CheerioCrawler hangs with 12 million urls

At a glance

The community member is trying to load 12 million URLs into a CheerioCrawler, but the process hangs and uses a large amount of memory (14GB). They have tried using a RequestList and passing the array directly to the crawler.run() method, but the issues persist. The community members have suggested decreasing the chunk size and continuing the discussion on Discord. There is no explicitly marked answer in the comments.

Useful resources

NNeoNomade | Scraping hellhound

Plain Text

const requestList = await RequestList.open('My-ReqList', allUrls, { persistStateKey: 'My-ReqList' });
console.log(requestList.length())
const crawler = new CheerioCrawler({
  requestList,
  proxyConfiguration,
  requestHandler: router,
  minConcurrency: 32,
  maxConcurrency: 256,
  maxRequestRetries: 20,
  navigationTimeoutSecs: 6,
  loggingInterval: 30,
  useSessionPool: true,
  failedRequestHandler({ request }) {
      log.debug(`Request ${request.url} failed 20 times.`);
  },
});
await crawler.run()

allUrls is a list that contains 12 million urls, I'm trying to load them in the CheerioCrawler, but the process hangs using 14gb of ram memory and it doesn't even logs the requestList.length().

Can anybody help, please ?

11 comments

NNeoNomade | Scraping hellhound

changed the code to :

Plain Text

console.log(`Starting to add ${allUrls.length} urls to the RequestList`)
const ReqList = new RequestList({
  sources: allUrls,
  persistRequestsKey: 'My-ReqList',
  keepDuplicateUrls: false
});
await ReqList.initialize()
console.log(ReqList.length)

const crawler = new CheerioCrawler({
  requestList: ReqList,
  proxyConfiguration,
  requestHandler: router,
  minConcurrency: 32,
  maxConcurrency: 256,
  maxRequestRetries: 20,
  navigationTimeoutSecs: 6,
  loggingInterval: 30,
  useSessionPool: true,
  failedRequestHandler({ request }) {
      log.debug(`Request ${request.url} failed 20 times.`);
  },
});

await crawler.run()

Tested on smaller batch of 100k urls, it works perfect.
with 12M urls, it has been running for 64minutes right now, stuck at 14.9Gb of memory usage (increased max node memory to 32gb, I have 128 available).

I will let it run more because I still see CPU activity, but it looks like it hanged.
It takes 10 minutes to get all the urls... but enqueueing them... it's a pain that for the moment doesn't work at all.

OOleg V.

Are you running crawler on the Apify platform?
Can you share link to your run please?

Also, can you please share your code where you assign allUrls variable? Maybe there is some memory leak...
Are you getting it from the input?

NNeoNomade | Scraping hellhound

Running locally not on Apify .
Just a sec, will share the code in a paste bin

NNeoNomade | Scraping hellhound

https://pastebin.com/mvHX1yMa

OOleg V.

Try to use RequestList instead of await crawler.addRequests(chunk);
It represents a big static list of URLs to crawl.

https://crawlee.dev/api/next/core/class/RequestList

I guess, the issue is that your chunkSize is way too big and scraper runs out of memory because of it.
Or you can try to pass your array to crawler.run().

Like here:
https://crawlee.dev/docs/next/examples/crawl-multiple-urls

Example:

Plain Text

// Run the crawler with initial request
await crawler.run([ // put your allUrls var 
    'http://www.example.com/page-1',
    'http://www.example.com/page-2',
    'http://www.example.com/page-3',
]);

NNeoNomade | Scraping hellhound

I have 128gb of ram memory and I allow 64gb to node.

NNeoNomade | Scraping hellhound

I've tried requestList it also fails similar.

NNeoNomade | Scraping hellhound

I will try to put the array in crawler run

NNeoNomade | Scraping hellhound

this one I didn't tried until now

OOleg V.

Try then to decrease chunkSize. to 50-100k or something like that.

OOleg V.

Please let's continue discussion in one ticket (it's the same issue, right?):
https://discord.com/channels/801163717915574323/1092208304660414597

Add a reply

Apify Discord Mirror

CheerioCrawler hangs with 12 million urls