Apify and Crawlee Official Forum

Updated 3 months ago

running multiple crawler instances at once

im scraping entire sites and running multiple crawlers at once for each site - looking to scrape 50+ sites, and im running multiple site scrapes at once from a start file emitting an event emitter to run each site, specifically the
Plain Text
await crawler.run(startUrls)
line.
Should i run them all at once in one terminal or run each one in separate terminals with different scripts to run each scraper
Also is this a maintanable approach to run multiple crawler instances at once
One final problem I am running into is that I'll run the start file but I get this request queue error when I run multiple crawlers
when i run it again, it sometimes works, but it's inconsistent in how this error pops up
request queue error:
Plain Text
ERROR CheerioCrawler:AutoscaledPool: isTaskReadyFunction failed       
ectory, open 'C:\Users\haris\OneDrive\Documents\GitHub\periodicScraper01\pscrape\storage\request_queues\default\1Rk4szfVGlTLik4.json'] {
  errno: -4058,
  code: 'ENOENT',
  syscall: 'open',
  path: 'C:\\Users\\haris\\OneDrive\\Documents\\GitHub\\periodicScraper01\\pscrape\\storage\\request_queues\\default\\1Rk4szfVGlTLik4.json'
}
4
A
h
B
21 comments
default requestQueue shared by crawlee instances, so to separate queue per crawler you need to name it
but recommended approach to not mix cralwers without reason, i.e. consider to crawl all sites by single crawler or (better) run actor per single site with single crawler.
how do you separate requestQueues for each instance
also is there a limit to how many instances can be run because when i run more than 5-6 crawlers, i get this error:
`` Error: This crawler instance is already running, you can add more requests to it via crawler.addRequests()`.
at CheerioCrawler.run
Use a named requestQueue.
like this:
Plain Text
// Open the 'my-queue' request queue
const queueWithName = await Actor.openRequestQueue('my-queue');
Is there a reason to this? Not mixing of crawlers
hey is there any way to run multiple crawlers without Actors/apify sdk? I know the recommended approach is to use one single crawler per instance
but issue is it screws up the logging
I need for them to run sequentially for proper debugging
is there any way I can accomplish this?
I tried to run them like this
Plain Text
try{
await crawlerOne.run(urls)
}catch(err){
log.error(err)}

try{
await crawlerTwo.run(urls)
}catch(err){
log.error(err)}
issue with this approach is crawler is skipped fully and crawler 2 exits on 1st request
saying browser closed unexpectedly
Can you provide some reproduction / full code of your implementation?

I guess, it should be fine if you use something like this:

Plain Text
const puppeteerRequestQueue = Actor.openRequestQueue('for-playwright');

const cheerioCrawler = new CheerioCrawler({
    requestHandler: createCheerioRouter(),
});

const puppeteerCrawler = new PuppeteerCrawler({
    requestQueue: puppeteerRequestQueue,
    requestHandler: createPuppeteerRouter(),
});

cheerioCrawler.router.addDefaultHandler(({ $, crawler }) => {
    // Add request to the CheerioCrawler request queue (default)
    if (SOMETHING) crawler.addRequests([REQUEST])
    // Our check tells us that this page must be handled with
    // Puppeteer, so we'll save the request in the puppeteerRequestQueue
    // to be handled after CheerioCrawler has finished
    else puppeteerRequestQueue.addRequest(REQUEST)
});

// ... Puppeteer handler

// Runs CheerioCrawler, which will maybe enqueue some links into
// the PuppeteerCrawler
await cheerioCrawler.run();

// If any requests were added to puppeteerRequestQueue, they'll be
// handled by Puppeteer now
await puppeteerCrawler.run();
hey, thanks I'll try this out to fix the issue
also if I'm not using apify, I can still use Actor helpers like openRequestQueue?
this does work, in separating out parameters such as maxRequestsPerMinute and maxRequestsPerCrawl but is Actor the only way I can open this queue?
because it gives me this warning that actor is not initialized
or should I initialize one even though I wont be using it?
Add a reply
Sign up and join the conversation on Discord