changed the code to :
console.log(`Starting to add ${allUrls.length} urls to the RequestList`)
const ReqList = new RequestList({
sources: allUrls,
persistRequestsKey: 'My-ReqList',
keepDuplicateUrls: false
});
await ReqList.initialize()
console.log(ReqList.length)
const crawler = new CheerioCrawler({
requestList: ReqList,
proxyConfiguration,
requestHandler: router,
minConcurrency: 32,
maxConcurrency: 256,
maxRequestRetries: 20,
navigationTimeoutSecs: 6,
loggingInterval: 30,
useSessionPool: true,
failedRequestHandler({ request }) {
log.debug(`Request ${request.url} failed 20 times.`);
},
});
await crawler.run()
Tested on smaller batch of 100k urls, it works perfect.
with 12M urls, it has been running for 64minutes right now, stuck at 14.9Gb of memory usage (increased max node memory to 32gb, I have 128 available).
I will let it run more because I still see CPU activity, but it looks like it hanged.
It takes 10 minutes to get all the urls... but enqueueing them... it's a pain that for the moment doesn't work at all.