Apify and Crawlee Official Forum

Updated 3 months ago

Best practices/examples of hardening an actor that handles tens of thousands of records?

I was told to post this here instead of #chat by DanielDo:

I'm looking for any helpful links/articles/source code for writing actors that split a collection of objects from a dataset into paged collections for batching? I want to support actor input for capping the total dataset records that are allowed to be processed, the size of each page/batch, etc.

The objects retrieved will have a url in one of their keys that the actor will then go fetch and save to the local fs, so I'd like to make sure the actor can stop and resume where it left off without redundant fetches or fs operations.

The end goal is to go from having a dataset with records in the shape of { image: 'https://..../x.png', identifier: 'My Image' } to a zipped archive of all of the images–and the images will be nested under parent directories that are named based on the identifier key of a given record.
s
4 comments
So, for a record of { image: 'https://..../x.png', identifier: 'My Image' }

I will end up with an archive that when unzipped, will produce the following:

Plain Text
- Archive
  - My Image
    - x.png
Anyone? Could really use some help on this. Docs give just enough to spark my interest / mention in passing.
It'd be great if RequestQueue could be used outside of scrapers–can we use it for queueing up image URLs to download?
Or is it only intended to be passed into playwright/puppeteer/crawlee?
Add a reply
Sign up and join the conversation on Discord