splitAndExecute({
callback: async (urlBatch, batchIndex) => {
// log that we are enqueuing the nth batch of preview jobs from job-id etc
logger.info(`Linkedin/Scraper - Enqueuing ${urlBatch.length} jobs to detail page handler - Batch ${batchIndex + 1}`);
await enqueueLinks({
urls: urlBatch,
label: LinkedinRouterLabels.JOB_DETAIL_PAGE,
userData: createLinkedinRouterUserData(payload),
waitForAllRequestsToBeAdded: false
});
const minSleepTime = 2000 * (batchIndex + 1);
const maxSleepTime = 3000 * (batchIndex + 1);
await random_sleep(minSleepTime, maxSleepTime);
},
urls: jobDetailPageUrls,
maxRequestsPerBatch: 2,
});
Hello guys. I have a router to scrape the url list and enqueue them to the next router.
However, I want to limit the enqueuing to throttle the request to the website.
I've tried add the crawler configuration but it doesn't work as intended. Even when I have a limit of request/min or request/crawl etc. it doesn't respect that.
Inititially I thought that its because the checking of limit is done only after a certain url-list is enqueued. And if a person enqueus a list bigger than a limit in the first go, then this could be the reason of limits not taking effect.
E.g. if limit is of 10 requests, and I enqueue the 25 request as a single array.
So I manually split the job-urls array into mulitple smaller batches.
However, this does not work as well. I mean the enqueuing is definitely done with the intervals of sleep, but the next router is still called at once after all the batches are enqueued.