Apify and Crawlee Official Forum

Updated 2 months ago

How to throttle enqueuing urls to next router

Plain Text
        splitAndExecute({
            callback: async (urlBatch, batchIndex) => {
                // log that we are enqueuing the nth batch of preview jobs from job-id etc
                logger.info(`Linkedin/Scraper - Enqueuing ${urlBatch.length} jobs to detail page handler - Batch ${batchIndex + 1}`);
                await enqueueLinks({
                    urls: urlBatch,
                    label: LinkedinRouterLabels.JOB_DETAIL_PAGE,
                    userData: createLinkedinRouterUserData(payload),
                    waitForAllRequestsToBeAdded: false
                });
                const minSleepTime = 2000 * (batchIndex + 1);
                const maxSleepTime = 3000 * (batchIndex + 1);
                await random_sleep(minSleepTime, maxSleepTime);

            },
            urls: jobDetailPageUrls,
            maxRequestsPerBatch: 2,
        });


Hello guys. I have a router to scrape the url list and enqueue them to the next router.
However, I want to limit the enqueuing to throttle the request to the website.

I've tried add the crawler configuration but it doesn't work as intended. Even when I have a limit of request/min or request/crawl etc. it doesn't respect that.
Inititially I thought that its because the checking of limit is done only after a certain url-list is enqueued. And if a person enqueus a list bigger than a limit in the first go, then this could be the reason of limits not taking effect.
E.g. if limit is of 10 requests, and I enqueue the 25 request as a single array.

So I manually split the job-urls array into mulitple smaller batches.
However, this does not work as well. I mean the enqueuing is definitely done with the intervals of sleep, but the next router is still called at once after all the batches are enqueued.
M
M
3 comments
Here's my cheerio config
Plain Text
    return new CheerioCrawler({
        proxyConfiguration,
        // maxRequestRetries: 1,
        maxConcurrency: 1,
        maxRequestsPerMinute: 2,
        maxRequestsPerCrawl: 10, // ! (for all routers i.e. preview/detail) Useful for testing. (In reality, it is more than this because of parallel requests)
        autoscaledPoolOptions: {
            desiredConcurrency: 1,
        },
        requestHandler: linkedinRouter,
    });
Hey,
If I understand correctly, you are trying to limit the frequency of requests that are being sent to the server, right? If so, you should enqueue all of the requests at once, and by setting the maxRequestsPerMinute field, CheerioCrawler will automatically limit the frequency of requests sent to the server.

By "enqueuing", you only add the requests to the RequestQueue, which then automatically feeds the Crawler. The maxRequestsPerMinute field does not limit the enqueuing rate, but the amount of requests that are being processed per minute. There is no real advantage in limiting the enqueuing process in general.
Thank you @Milunnn . I tried that already wasn't working for some reason.
But now Its working thank you for the help πŸ™‚
Add a reply
Sign up and join the conversation on Discord