Massive Scraper

At a glance

The community member has a question about how to crawl multiple URLs from different pages using Crawlee, and run them in parallel with a single command or in isolation. The first comment suggests that the question lacks details, and provides a link to a GitHub repository as an example of a scalable scraping system. The second comment provides an example of using a router to handle multiple page handlers in Crawlee, allowing the community member to change the handler for a page by setting the label property when enqueueing a link.

Useful resources

ssirofjelly

Hi I have a (noob) question. So I want to crawl many different urls from different pages so they need their own crawler implementation. Some can use the same also. How can I achieve this in crawlee such that they run in parallel and can be lal executed with a single command or also in isolation?

Input and example repos etc. would be highly appreciated

2 comments

SSY

You gave very small pice of information developers would have more questions than answers from your message, and that's why I assume you didn't get any reply

Here is a good example of a scraping system implementation with hight scalability options and good monitoring tools. If you need to implement so called one-time scraper it would be a bad example but in a long term project case it would be one of the best - https://github.com/68publishers/crawler

Here for the future questions I would recommend to stick to these rules - https://stackoverflow.com/help/how-to-ask

CCrafty

you can have multiple pageHandlers using a router. this allows you to change which handler a page is processed by by setting the label property when enqueuing a link. heres an example

Plain Text

  const crawlerConfig = new Configuration({
    //config options
  });

  const router = createPlaywrightRouter();
  router.addHandler(
    'label1',
    label1Handler,
  );
  router.addHandler(
    'label2',
    label2Handler,
  );
  router.addHandler(
    'label3',
    label3Handler,
  );

  const crawlerOptions: PlaywrightCrawlerOptions = {
    requestHandler: router,
    //crawler options
  };

  const crawler = new PlaywrightCrawler(crawlerOptions, crawlerConfig);

crawler.run([
    {url: 'url1', label: 'label1'},
])

Add a reply

Apify Discord Mirror

Massive Scraper