Apify and Crawlee Official Forum

Home
Members
shovelandsandbox
s
shovelandsandbox
Offline, last seen 3 months ago
Joined August 30, 2024
I was told to post this here instead of #chat by DanielDo:

I'm looking for any helpful links/articles/source code for writing actors that split a collection of objects from a dataset into paged collections for batching? I want to support actor input for capping the total dataset records that are allowed to be processed, the size of each page/batch, etc.

The objects retrieved will have a url in one of their keys that the actor will then go fetch and save to the local fs, so I'd like to make sure the actor can stop and resume where it left off without redundant fetches or fs operations.

The end goal is to go from having a dataset with records in the shape of { image: 'https://..../x.png', identifier: 'My Image' } to a zipped archive of all of the images–and the images will be nested under parent directories that are named based on the identifier key of a given record.
4 comments
s
I haven't been able to find any information on how accessing datasets via client works for local development–does this only work on the platform? I have a monorepo with two actors and I'd like to access a named dataset from one actor inside the other. If accessing the datasets of other actors is not possible via openDataSet locally, what alternatives are there?
1 comment
s
The example monorepo (seen here: https://github.com/apify/actor-monorepo-example) doesn't cover how apify push is intended to be used – the only place you're able to use it is from the root of the repository, but doing it there shows the following in console:

Plain Text
apify push
Info: Created actor with name undefined on Apify.
Info: Deploying actor 'undefined' to Apify.
Run: Updated version 0.0 for my-actor-5 actor.
Run: Building actor my-actor-5
8 comments
P
s
A
The documentation (https://docs.apify.com/cli/docs/vars) doesn't touch on how you access any environment variables set in actor.json in an actor's source–I'm using the monorepo example repository.
6 comments
s
Plain Text
2023-06-02T21:07:23.777Z INFO  Downloading image https://image.api.playstation.com/vulcan/ap/rnd/202105/...
2023-06-02T21:07:23.783Z INFO  Downloading image https://image.api.playstation.com/gs2-sec/appkgo/prod/C...
2023-06-02T21:07:24.367Z INFO  Downloading image https://image.api.playstation.com/vulcan/ap/rnd/202102/...
2023-06-02T21:07:24.947Z INFO  Downloading image https://image.api.playstation.com/gs2-sec/appkgo/prod/C...
2023-06-02T21:07:24.965Z INFO  Downloading image https://image.api.playstation.com/vulcan/ap/rnd/202304/...
2023-06-02T21:07:25.971Z INFO  Downloading image https://image.api.playstation.com/gs2-sec/appkgo/prod/C...
2023-06-02T21:07:25.980Z INFO  Downloading image https://image.api.playstation.com/gs2-sec/appkgo/prod/C...
2023-06-02T21:07:26.407Z INFO  Downloading image https://image.api.playstation.com/vulcan/ap/rnd/202205/...
2023-06-02T21:07:26.504Z INFO  Downloading image https://image.api.playstation.com/gs2-sec/appkgo/prod/C...
2023-06-02T21:07:27.089Z INFO  Downloading image https://image.api.playstation.com/vulcan/ap/rnd/202211/...
2023-06-02T21:07:27.354Z INFO  Downloading image https://image.api.playstation.com/vulcan/ap/rnd/202009/...
2023-06-02T21:07:27.511Z INFO  Downloading image https://image.api.playstation.com/gs2-sec/appkgo/prod/C...
2023-06-02T21:07:28.055Z INFO  Downloading image https://image.api.playstation.com/gs2-sec/appkgo/prod/C...
2023-06-02T21:07:28.486Z INFO  Downloading image https://image.api.playstation.com/vulcan/ap/rnd/202108/...
2023-06-02T21:07:29.165Z INFO  Downloading image https://image.api.playstation.com/vulcan/ap/rnd/202303/...
2023-06-02T21:07:29.298Z INFO  Downloading image https://image.api.playstation.com/vulcan/ap/rnd/202010/...
2023-06-02T21:07:29.558Z INFO  Downloading image https://image.api.playstation.com/gs2-sec/appkgo/prod/C...
2023-06-02T21:07:30.253Z INFO  Downloading image https://image.api.playstation.com/gs2-sec/appkgo/prod/C...
2023-06-02T21:07:31.284Z INFO  Downloading image https://image.api.playstation.com/vulcan/ap/rnd/202302/...
2023-06-02T21:07:31.377Z INFO  Downloading image https://image.api.playstation.com/vulcan/ap/rnd/202301/...
2023-06-02T21:07:32.058Z INFO  BasicCrawler: All the requests from request list and/or request queue have been processed, the crawler will shut down.
2023-06-02T21:07:32.265Z INFO  BasicCrawler: Crawl finished. Final request statistics: {"requestsFinished":20,"requestsFailed":0,"retryHistogram":[20],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":820,"requestsFinishedPerMinute":139,"requestsFailedPerMinute":0,"requestTotalDurationMillis":16405,"requestsTotal":20,"crawlerRuntimeMillis":8603}
2023-06-02T21:07:32.267Z INFO  All images in iteration 0 were processed
2023-06-02T21:07:32.268Z INFO  Archiving Images...
2023-06-02T21:07:32.318Z INFO  Archive has been written
2023-06-02T21:07:32.539Z INFO  Will save output data to: key-value-store
2023-06-02T21:07:32.541Z INFO  Post-download processed data length:
2023-06-02T21:07:32.911Z INFO  END OF ITERATION STATS:
2023-06-02T21:07:32.912Z INFO  *** STATS ***
2023-06-02T21:07:32.914Z INFO  Total: 20, Uploaded: 0, Failed: 20, Skipped: 0, Duplicates: 0
2023-06-02T21:07:33.074Z INFO  Downloading finished
2023-06-02T21:07:33.076Z INFO  Actor finished successfully (exit code 0)


I'm using lukaskrivka/images-download-upload to grab all image urls from a dataset from one of my actors. You can see above that it's retrieving the urls successfully and appears to process/download them but the resulting zip file is empty every time. Anyone have any insight or suggestions for alternatives that work?
2 comments
A
s
$49/m is too steep for me at this point in time–I'm just exploring the platform and am hoping that it's possible to pay a handful of schrute bucks so I can run my actor locally on my own machine.
5 comments
s
A