Anyone have any example scraping multiple different web...

OOakyLabs

The structure i am doing idoes not look like the best.

I am basically creating several routers and then doing something like:

Plain Text

const crawler = new PlaywrightCrawler({
  // proxyConfiguration: new ProxyConfiguration({ proxyUrls: ['...'] }),
  requestHandler: async (ctx) => {
    if (ctx.request.url.includes("url1")) {
      await url1Router(ctx);
    }

    if (ctx.request.url.includes("url2")) {
      await url2Router(ctx);
    }

    if (ctx.request.url.includes("url3")) {
      await url3Router(ctx);
    }
    await Dataset.exportToJSON("data.json");
  },

  // Comment this option to scrape the full website.

  //   maxRequestsPerCrawl: 20,
});

This does not seem correct. Anyone with a better way?

8 comments

MMarco

You can use Crawlee's Router: https://crawlee.dev/api/playwright-crawler/function/createPlaywrightRouter.

MMarco

Create a route for each URL, then use labels to identify them.

OOakyLabs

@Marco , how far is that from what i am doing there? because it seems like soewhere i will have to do it? in the example above i did a router per url

OOakyLabs

there , urlRouter1, urlRouter2 is defined on a per url basis. am i wrong?

MMarco

It's actually very similar. Routes should be defined depending on your needs, so if you need a route per URL, just do that.

OOakyLabs

my concern is that i have multiple websites, not just different urls. each website might have two urls that i have to scrape independently

OOakyLabs

is that how you would do it @Marco ? would you have multiple routers ?

MMarco

Oh, I see. I think I would still use one router, with labels such as "website1-page2", to keep things simple; a function called at the beginning would assign the correct label to each request based on the URL.

Add a reply

Apify and Crawlee Official Forum

Anyone have any example scraping multiple different websites?