Best practices/examples of hardening an actor that hand...

sshovelandsandbox

I was told to post this here instead of #chat by DanielDo:

I'm looking for any helpful links/articles/source code for writing actors that split a collection of objects from a dataset into paged collections for batching? I want to support actor input for capping the total dataset records that are allowed to be processed, the size of each page/batch, etc.

The objects retrieved will have a url in one of their keys that the actor will then go fetch and save to the local fs, so I'd like to make sure the actor can stop and resume where it left off without redundant fetches or fs operations.

The end goal is to go from having a dataset with records in the shape of { image: 'https://..../x.png', identifier: 'My Image' } to a zipped archive of all of the images–and the images will be nested under parent directories that are named based on the identifier key of a given record.

4 comments

sshovelandsandbox

So, for a record of { image: 'https://..../x.png', identifier: 'My Image' }

I will end up with an archive that when unzipped, will produce the following:

Plain Text

- Archive
  - My Image
    - x.png

sshovelandsandbox

Anyone? Could really use some help on this. Docs give just enough to spark my interest / mention in passing.

sshovelandsandbox

It'd be great if RequestQueue could be used outside of scrapers–can we use it for queueing up image URLs to download?

sshovelandsandbox

Or is it only intended to be passed into playwright/puppeteer/crawlee?

Add a reply

Join on Discord

Apify and Crawlee Official Forum

Best practices/examples of hardening an actor that handles tens of thousands of records?