Apify and Crawlee Official Forum

Updated 3 weeks ago

Approach to store scrapped data in database (postgres)

(Apologises for the crosslink: https://github.com/apify/crawlee/discussions/1577)

Hi, I recently discovered Crawlee and I'm trying to figure out how can I store the scraped data in database instead in local directorio storage.

Is there any plugin for that? How must I proceed to implement one? Must I code my own class that implements StorageClient interface? If so how must I injected later to be used.

Thanks!
8
H
a
t
17 comments
you need to implement your own logic
instead of Dataset.push() just call insert to your db
isn't a good practice or have any benefit to implement StorageClient?
If you want your crawler to be practical and performant, I wouldn't recommend pushing into a Dataset, then into your PostgreSQL database. At that point, the Dataset would just be an unnecessary middle man.
The only way that'd be beneficial is if you'd like to validate the data with some custom scripts before actually pushing it into the production DB. Otherwise, just push directly into your DB.
Thanks Matt, I mean implement a custom StorageClient so when you write Dataset.push() really you store data in postgres instead in local filesystem
just advanced to level 1! Thanks for your contributions! πŸŽ‰
Its not a common case, so not covered by SDK, imho just use external package like https://github.com/supabase/supabase
Yeah, I'm actually using a graph database to store crawl results, and it performs very well β€” the only hitch has been making sure that my logic for what constitutes a "unique item" etc meshes with crawlee's
At that point, I'd recommend just using Sequelize to connect to your remote database and push data into it. Sequelize is (in my opinion) the best ORM.
Hi all, I'd looking to push straight to postgres. Wondering if anyone would be willing to share their implementation of doing so?
, sorry to ping, did you implement this?
Sorry to necro an older thread but Im looking at pushing data into postgres as well
Is the suggestion to skip Data.push entirely and just save directly into the DB?
I havent seen any examples of using PostGres (or any database for that matter)
This is something I am wanting to do as well
I use Supabase as my postgres platform and simply await my table and insert the data within the request handler
I recently implemented a custom storage client to store request queues in postgres as the storage costs for request queues in apify is too much. reduced my costs from 500 usd per month to 25 (25 is for the managed postgres service)

The same thing can also be extended to store datasets. I only did it for request queues. For dataset and key value the custom client still uses the apify storage.
Add a reply
Sign up and join the conversation on Discord