makditeam

Gracefully closing the crawler with keepalive flag true

I'm using puppeteer crawler with keepAlive as true and crawler.run() (without await).

This runs the crawler infinitely and if I insert new requests to the requests queue, they get processed.

(I'm using non persisted request queue)

What I want is to gracefully close the crawler, as in If I get a signal to close, I want to process all the pending requests in the requests queue first and then kill the crawler.

Right now If I do crawler.teardown(), it abruptly closes the crawler instances without processing the pending requests.

2 comments

mmakditeam

Adding puppeteer dependency in crawlee

Hi, I'm using Crawlee's Puppeteer crawler

I've imported crawlee via package.json like

Plain Text

{
   "crawlee": "^3.5.4"
}

Shall I be importing puppeteer in via my package.json as well?

I see puppeteer in crawlee's peer dependency but as optional https://www.npmjs.com/package/crawlee?activeTab=code

Issue is that puppeteer gets bumped very often with bug fixes and I've been stuck with puppeteer 21.1.x.

Ideal expectation is that I just import Crawlee and bumping Crawlee bumps puppeteer as well as per its requirements/support.

Reference https://github.com/apify/crawlee/discussions/2101

2 comments

mmakditeam

Expire requests from request queue

Hello,

I have a use case where I need to handle request expiration in the RequestQueue after a specified time (e.g., 30 minutes). Is this achievable in the current scenario?

One possible approach is to set an epoch time in the userData when enqueuing a request. Then, when it reaches the preNavigationHooks phase, you can check the elapsed time against the specified limit and throw a NonRetryableError to prevent further processing of the request.

However, this approach may not be the most elegant solution, and it has the side effect of creating a page object, which in turn opens a browser and creates an empty tab, consuming unnecessary resources.

Is there a more efficient and cleaner way to handle request expiration and avoid the overhead of using resources?

1 comment

Apify Discord Mirror

Gracefully closing the crawler with keepalive flag true

Adding puppeteer dependency in crawlee

Expire requests from request queue