Apify and Crawlee Official Forum

Updated 4 months ago

De-duplicate dataset results

I have an actor that returns a simple list of IDs. It's possible that during a run, concurrent processes can overlap and produce duplicate results. Is there any accepted way of avoiding this?

At the most basic level I'd hoped that I could do something simple like using the returned ID as the key in the dataset (i.e. a duplicate result would write the same entry so a duplicate would not be created), but this doesn't seem to work, presumably because each result is actually a separate JSON file in the dataset.

I've also thought about opening the dataset and getting the full list of IDs, then only pushing IDs not present - this could work but adds overhead and also seems to introduce the possibility of race conditions.

So, is there any way to push only unique values to the dataset?
H
1 comment
There are two possible solutions:
1- Create a global object, which you use to save the IDs for all entries, you could check and save the ID for each entry before pushing it to the dataset.
2- You could use this Actor to remove duplicates, by running it on the dataset of your Actor after it finishes: https://apify.com/lukaskrivka/dedup-datasets
Add a reply
Sign up and join the conversation on Discord