M. Shahzeb

·

How to throttle enqueuing urls to next router

Plain Text

        splitAndExecute({
            callback: async (urlBatch, batchIndex) => {
                // log that we are enqueuing the nth batch of preview jobs from job-id etc
                logger.info(`Linkedin/Scraper - Enqueuing ${urlBatch.length} jobs to detail page handler - Batch ${batchIndex + 1}`);
                await enqueueLinks({
                    urls: urlBatch,
                    label: LinkedinRouterLabels.JOB_DETAIL_PAGE,
                    userData: createLinkedinRouterUserData(payload),
                    waitForAllRequestsToBeAdded: false
                });
                const minSleepTime = 2000 * (batchIndex + 1);
                const maxSleepTime = 3000 * (batchIndex + 1);
                await random_sleep(minSleepTime, maxSleepTime);

            },
            urls: jobDetailPageUrls,
            maxRequestsPerBatch: 2,
        });

Hello guys. I have a router to scrape the url list and enqueue them to the next router.
However, I want to limit the enqueuing to throttle the request to the website.

I've tried add the crawler configuration but it doesn't work as intended. Even when I have a limit of request/min or request/crawl etc. it doesn't respect that.
Inititially I thought that its because the checking of limit is done only after a certain url-list is enqueued. And if a person enqueus a list bigger than a limit in the first go, then this could be the reason of limits not taking effect.
E.g. if limit is of 10 requests, and I enqueue the 25 request as a single array.

So I manually split the job-urls array into mulitple smaller batches.
However, this does not work as well. I mean the enqueuing is definitely done with the intervals of sleep, but the next router is still called at once after all the batches are enqueued.

3 comments

M

MM. Shahzeb

·

Saving scraped data from dynamic URLs using Crawlee in an Express Server?

Hello all.
I've been trying to build an app that triggers a scraping job when the api is hit.

The initial endpoint hits a crawlee router which has 2 handlers. one for the url-list scraping and the other for scraping the detail from each of the detail-page. (the url-list handler enqueues the next url-list page to url-list handler too btw)

I'm saving the data from each of these scrapes inside a KVstore but I want a way to save all the data in the KV store related to a particular job into database.
The attached screenshots are the MRE snippets from my code.

3 comments

M

Apify and Crawlee Official Forum

How to throttle enqueuing urls to next router

Saving scraped data from dynamic URLs using Crawlee in an Express Server?