Apify and Crawlee Official Forum

Updated 3 months ago

CheerioCrawler hangs with 12 million urls

Plain Text
const requestList = await RequestList.open('My-ReqList', allUrls, { persistStateKey: 'My-ReqList' });
console.log(requestList.length())
const crawler = new CheerioCrawler({
  requestList,
  proxyConfiguration,
  requestHandler: router,
  minConcurrency: 32,
  maxConcurrency: 256,
  maxRequestRetries: 20,
  navigationTimeoutSecs: 6,
  loggingInterval: 30,
  useSessionPool: true,
  failedRequestHandler({ request }) {
      log.debug(`Request ${request.url} failed 20 times.`);
  },
});
await crawler.run()


allUrls is a list that contains 12 million urls, I'm trying to load them in the CheerioCrawler, but the process hangs using 14gb of ram memory and it doesn't even logs the requestList.length().

Can anybody help, please ?
N
O
11 comments
changed the code to :
Plain Text
console.log(`Starting to add ${allUrls.length} urls to the RequestList`)
const ReqList = new RequestList({
  sources: allUrls,
  persistRequestsKey: 'My-ReqList',
  keepDuplicateUrls: false
});
await ReqList.initialize()
console.log(ReqList.length)

const crawler = new CheerioCrawler({
  requestList: ReqList,
  proxyConfiguration,
  requestHandler: router,
  minConcurrency: 32,
  maxConcurrency: 256,
  maxRequestRetries: 20,
  navigationTimeoutSecs: 6,
  loggingInterval: 30,
  useSessionPool: true,
  failedRequestHandler({ request }) {
      log.debug(`Request ${request.url} failed 20 times.`);
  },
});

await crawler.run()

Tested on smaller batch of 100k urls, it works perfect.
with 12M urls, it has been running for 64minutes right now, stuck at 14.9Gb of memory usage (increased max node memory to 32gb, I have 128 available).

I will let it run more because I still see CPU activity, but it looks like it hanged.
It takes 10 minutes to get all the urls... but enqueueing them... it's a pain that for the moment doesn't work at all.
Are you running crawler on the Apify platform?
Can you share link to your run please?

Also, can you please share your code where you assign allUrls variable? Maybe there is some memory leak...
Are you getting it from the input?
Running locally not on Apify .
Just a sec, will share the code in a paste bin
Try to use RequestList instead of await crawler.addRequests(chunk);
It represents a big static list of URLs to crawl.

https://crawlee.dev/api/next/core/class/RequestList

I guess, the issue is that your chunkSize is way too big and scraper runs out of memory because of it.
Or you can try to pass your array to crawler.run().

Like here:
https://crawlee.dev/docs/next/examples/crawl-multiple-urls

Example:
Plain Text
// Run the crawler with initial request
await crawler.run([ // put your allUrls var 
    'http://www.example.com/page-1',
    'http://www.example.com/page-2',
    'http://www.example.com/page-3',
]);
I have 128gb of ram memory and I allow 64gb to node.
I've tried requestList it also fails similar.
I will try to put the array in crawler run
this one I didn't tried until now
Try then to decrease chunkSize. to 50-100k or something like that.
Please let's continue discussion in one ticket (it's the same issue, right?):
https://discord.com/channels/801163717915574323/1092208304660414597
Add a reply
Sign up and join the conversation on Discord