Crawler skipping Jobs after processing 5,000-6,000 Requ...

LLARGO

Since a few days, I have been running the crawler with a high number of jobs. As a result, I have run into a problem.

I have found, that not all jobs are processed by the CheerioCrawler despite these jobs being added to the queue through addRequest([job]).

I can't really reproduce it, it happens approximately after 5000 - 6000 number of jobs.

My code doesn't crash, it continues to the next jobs (BullMQ job queue) without scraping the link.

This is normal behavior, since it reaches the requestHandler (CheerioInfo logger)

Attachment

7 comments

LLARGO

Here is where it starts misbehaving, and I have no idea why because the job/urls are valid. Seems like it doesn't reach crawler anymore

Attachment

LLARGO

And at this point, my Kafka consumer doesn't receive new data (product) from the scraper

LLARGO

This issue is still there, does anyone know how to solve this?

BBanul;

did you use loop to run your crawler continuously? as long as there is a urls? how do you do that?

AApifyBot

just advanced to level 2! Thanks for your contributions! 🎉

LLARGO

I’m using BullMQ and my CheerioCrawler has keepAlive set on true.

I use Cron to dispatch jobs to the crawler. The worker in this case stops working after the next batch of jobs.

AApifyBot

just advanced to level 2! Thanks for your contributions! 🎉

Add a reply

Join on Discord

Apify and Crawlee Official Forum

Crawler skipping Jobs after processing 5,000-6,000 Requests