Apify Discord Mirror

Updated last week

Append data to an existing dataset

At a glance
The community member is looking for a way to redirect the output of multiple runs of the same scraper to the same existing dataset, appending the new records. The community members discuss using the Apify SDK's "named Dataset" feature, but the original poster is focused on the REST API and integrating it with existing Java code. The community members explain that each run has its own default dataset, and there is no way to instruct an actor to use a pre-existing custom dataset. The solution suggested is to create an "Integration Actor" that can redirect the output to a custom dataset. <answer>Ahh. yes if you are using preexisting Actor then no way to redirect the output. Unless the Actor have a parameter to support Custom Dataset. Otherwise you can create an "Integration Actor" which will redirect the output to custom dataset</answer>
Useful resources
Is there a way to redirect the output of multiple runs of the same scraper to the same existing dataset, appending the new records? The order doesn't matter. Due to the limitations of the scraper I am using, I need to perform thousands of runs that produce a very small amount of output that I would like to add to an existing dataset (obviously having the same format or schema). I skim through the Apify API documentation and I did not find anything about it.
Ahh. yes if you are using preexisting Actor then no way to redirect the output. Unless the Actor have a parameter to support Custom Dataset. Otherwise you can create an "Integration Actor" which will redirect the output to custom dataset
View full solution
!
c
A
22 comments
Plain Text
ds = await Actor.open_dataset(name="my-dataset")
await ds.push_data(data)
Thanks for the quick answer. You are redirecting me to the SDK, are there an equivalent method for the REST API? (I have to integrate the calls with existing Java code). Put in another way, I'm asking it there is a method to call/run an actor specifying an existing dataset. I'm new to Apify, I don't have a clear picture of the platform at the moment.
It would be quite strange if the SDK could do things that are not possible via the API, unless the merge was a local operation.
If I understand correctly, each run has its own new storage, no way to specify an existing one. To do a merge, I need to take every single storage created by the run and put it into the overall previously created remote storage. This is what the SDK does too, I guess.
@cesio just advanced to level 1! Thanks for your contributions! πŸŽ‰
Yes, every run has DEFAULT storage, but you can ignore the DEFAULT storage and use your own CUSTOM dataset: eg: ds.push_data(data)
otherwise Actor.push_data(data) will use the DEFAULT dataset
Ok, I don't get what "ds" is because I don't know the SDK (you are referring to Python SDK I suppose) but I grasp the overall picture. Thanks.
ds is the CUSTOM dataset defined earlier: ds = await Actor.open_dataset(name="my-dataset")
@Helper may have more insight
are you using pre existing Actor or writing new one ?
Ok, that makes sense πŸ˜… . I don't want to sound rude but I'm focused on the REST API and not the SDK. As I said, I need to integrate it with existing Java code. I need to go down to a lower level than the one offered by the SDK. Given the links above, I understand how to create a remote dataset and how to store local data on it. From your answers I got that there is no way to instruct an actor to use an existing dataset (CUSTOM): you need to take the data from the DEFAULT dataset created by the run and move/copy it to the CUSTOM dataset.
For instance, if I have 1000 runs I will have 1000 small datasets plus the integration dataset.
I can't tell an actor to use a certain pre-existing dataset when I call it. The actor instance use its own dataset (DEFAULT). End of the story. Then, I can move/copy this dataset to a larger one.
Ahh. yes if you are using preexisting Actor then no way to redirect the output. Unless the Actor have a parameter to support Custom Dataset. Otherwise you can create an "Integration Actor" which will redirect the output to custom dataset
So there are actors that might support a "custom dataset" parameter. Good to know. 😁 That is not a standard feature.
Thanks again. I'll check out the integration actors. Have a nice day.
Add a reply
Sign up and join the conversation on Discord