Named Request Queues not getting purged

At a glance

The community member is facing an issue with request queues not getting purged after each crawl, causing the crawlers to resume from the previous session. They have tried various approaches like setting the purgeRequestQueue option, using purgeDefaultStorages(), and dropping the queue, but none of these have worked. The community member is unsure if using separate queues for each crawler is the best approach to isolate crawler-specific options like maxRequestsPerCrawl. The comments suggest that the default purge only works for the default (non-named) queue, and that named queues need to be dropped and re-instantiated to be purged. The community members discuss the awkwardness of this approach and whether it's necessary to use separate queues if the crawlers run sequentially.

AAltairSama2

Hey folks,

I am explicitly creating request queues for my crawlers to make sure that crawler specific run options such as maxRequestsPerCrawl can be set on a crawler to crawler basis, but the issue with this approach is that the request queues are not getting purged after every crawl, resulting in the crawlers resuming the session from before. These are the approaches I tried

I have tried setting the option purgeRequestQueue to true explicitly in the crawler.run() func but it results in this error

Did not expect property purgeRequestQueue to exist, got true in object options

`

2. setting it as a global variable in

crawlee.json

 (it looks like crawlee is not picking up my crawlee.json file at all, because I tried to set logging levels in it and crawlee didnt pick it up).

3. tried using

await purgeDefaultStorages()` in my entry file

None of these options are working, is there some other way to purge these queues? I know its set by default to purge them but its not working for my named queues.

Also, is using queues the best way to isolate crawler specific options for each crawler? because when I used the default queue and restricted crawls to some numeric value in one crawler, and when it shut down after reaching that value, all the other crawlers would also shut down logging that max requests per crawl has been reached despite me not having specified this option when I initialized the crawlers.

9 comments

AAltairSama2

P.S. also tried queue.drop() method, which dropped the whole queue instead of purging it as expected

AAltairSama2

also found RequestQueue.config.getStorageClient()

AAltairSama2

and I tried calling the purge function from it and it didnt work, P.S. typescript says purge might be undefined, so we need to first check if purge is not undefined before calling it

LLukas Krivka

The default purge only cares about default (non-named) queue. Named ones need to be droped. There is no purge method as to clean it up without creating new one.

AAltairSama2

thanks, this is exactly what I ended up doing, I still feel we should have a purge method for them too, because dropping a queue and then re-instantiating it seems awkard to me, just one last thing, is using a separate queue advisable if the only use case is to isolate crawler specific options like maxRequestsPerCrawl to each crawler?

LLukas Krivka

Generally each crawler should have separate queue otherwise it will mess up each others queue

AAltairSama2

even if they will run sequentially?

LLukas Krivka

Sorry for late reply. In that case it is fine but you have to finish the request inside the run and add new ones in between

AAltairSama2

hey thanks, then just as a precaution, I'm assuming we have to override options like maxRequestPerCrawl for each crawler everytime just to make sure that the config from a previous crawler doesnt bleed onto other crawlers, I played around with a single queue and this was the main issue I found, anyways the separate queues work fine aside from the awkardness around purging it (re:above comment).

Add a reply

Apify Discord Mirror

Named Request Queues not getting purged