Apify Discord Mirror

Updated 5 months ago

remove uniqueKey from queue blacklist

At a glance

The community member is scraping a website with file attachment links that rotate every 5 minutes. Their strategy is to calculate a unique key based on the stable parts of the URL, and then remove any queued requests with the same unique key when the URL changes. The community member's question is how to remove a request that has hit its retry limit and been "blacklisted" from the request queue, so that the new URL can be processed.

In the comments, another community member suggests using the API to remove a request from the request queue based on its ID. If the community member only knows the unique key, they can add a new request with the same unique key, which will return a response indicating the request was already present, and then they can obtain the request ID and delete the request.

Another community member, working with Crawlee instead of Apify, found a similar approach, and suggested adding a feature to more easily query a request by its unique ID.

The final community member comment suggests creating an in-memory map to store the unique key to request ID relation, as there may be architectural decisions behind the lack of a more direct way to query by unique key.

Useful resources
Hi all,

Im scraping a weird website which has file attachment links which rotate every 5 minutes or so, eg

https://dca-global.org/serve-file/e1725459845/l1714045338/da/c1/Rvfa9Lo-AzHHX0NYJ3f-Tx3FrxSI8-N-Y5ytfS8Prak/1/37/file/1704801502dca_media_publications_2024_v10_lr%20read%20only.pdf

everything between serve-file and file changes regularly.

My strategy to deal with this is to calculate the unique key based on the 'stable' parts of the url. Then when i detect the url has changed, I can remove any queued requests with the unique key and replace them with the new url

My question is, if a request has hit its retry limit and has been 'blacklisted' from the request queue, how can i remove it so the new url can be processed?

Thanks!
P
C
4 comments
Hi @Crafty ,
You can remove request from RequestQueue based on its id through the API.

If you know only uniqueKey, then I suggest you to add new Request to the RequestQueue with the same uniqueKey -> It will end with response aving attribute wasAlreadyPresent set to true , but you should also obtain the stored Request data (with id).

When you have the Request id, you may do DELETE Http Request see https://docs.apify.com/api/v2#tag/Request-queuesQueue/operation/requestQueue_request_delete to delete it.
hi @Pepa J , thanks for the help. I an actually working with only crawlee and not apify but i found a method along the same lines. May i suggest a feature for ethier the request queue or request queue client to more easily query a request from its uniqueId?

Plain Text
    const requestQueue = await crawler.getRequestQueue();
    const result = await requestQueue.addRequest({url: 'https://google.com', uniqueKey: 'aaa', label: 'secondary'})
    log.info('result', result)
    if (result.wasAlreadyPresent) {
      log.info('already present')
      const request = await requestQueue.getRequest(result.requestId);
      log.info('request', request)
    }
@Crafty Ah I am sorry for the API mention.

I believe there are some architectural decisions around this. What you can do is to create a map in memory and save the uniqueKey->requestId relation as a key pair value there. πŸ€”
ah ok sounds good. it would be interesting to know if there is a reason for it, surely addRequest must be doing it under the hood.
Add a reply
Sign up and join the conversation on Discord