Apify and Crawlee Official Forum

Updated 3 months ago

remove uniqueKey from queue blacklist

Hi all,

Im scraping a weird website which has file attachment links which rotate every 5 minutes or so, eg

https://dca-global.org/serve-file/e1725459845/l1714045338/da/c1/Rvfa9Lo-AzHHX0NYJ3f-Tx3FrxSI8-N-Y5ytfS8Prak/1/37/file/1704801502dca_media_publications_2024_v10_lr%20read%20only.pdf

everything between serve-file and file changes regularly.

My strategy to deal with this is to calculate the unique key based on the 'stable' parts of the url. Then when i detect the url has changed, I can remove any queued requests with the unique key and replace them with the new url

My question is, if a request has hit its retry limit and has been 'blacklisted' from the request queue, how can i remove it so the new url can be processed?

Thanks!
P
C
4 comments
Hi @Crafty ,
You can remove request from RequestQueue based on its id through the API.

If you know only uniqueKey, then I suggest you to add new Request to the RequestQueue with the same uniqueKey -> It will end with response aving attribute wasAlreadyPresent set to true , but you should also obtain the stored Request data (with id).

When you have the Request id, you may do DELETE Http Request see https://docs.apify.com/api/v2#tag/Request-queuesQueue/operation/requestQueue_request_delete to delete it.
hi @Pepa J , thanks for the help. I an actually working with only crawlee and not apify but i found a method along the same lines. May i suggest a feature for ethier the request queue or request queue client to more easily query a request from its uniqueId?

Plain Text
    const requestQueue = await crawler.getRequestQueue();
    const result = await requestQueue.addRequest({url: 'https://google.com', uniqueKey: 'aaa', label: 'secondary'})
    log.info('result', result)
    if (result.wasAlreadyPresent) {
      log.info('already present')
      const request = await requestQueue.getRequest(result.requestId);
      log.info('request', request)
    }
@Crafty Ah I am sorry for the API mention.

I believe there are some architectural decisions around this. What you can do is to create a map in memory and save the uniqueKey->requestId relation as a key pair value there. πŸ€”
ah ok sounds good. it would be interesting to know if there is a reason for it, surely addRequest must be doing it under the hood.
Add a reply
Sign up and join the conversation on Discord