Apify

Apify and Crawlee Official Forum

b
F
A
J
A

Custom storage provider for RequestQueue?

It's probably a little out of the ordinary, but I'm building a crawler project that stores a pretty large pile of information in a database, rather than Crawlee's native KVS and DataSet. I'm curious if there are any examples of using alternative backends to store Crawlee's own datasets and request queue? If possible i'd love to consolidate the storage in one place, particularly since it would allow me to query and manage the request pool a bit easier…
2
A
L
A
17 comments
just advanced to level 1! Thanks for your contributions! 🎉
Not sure about examples but Crawlee already has generic storage API that can be implemented. And we already support 2 implementations - Apify API and local filesystem. So you can add 3rd implementation
I wanted to use some database but did not find any good matches, basically its either something external in other cloud (like firebase), otherwise makes not a lot of sense to use it. Embedded DBs technically possible but because of "migration" imho nearly useless (actor might be shutdown, moved to other server instance and then restarted at any point of runtime)
We're using ArangoDB — it's a "multi-modal" database that has native support for mongodb-style document storage and neo4j style graph queries in the same data store; it's proven very useful for complex analysis of large sites — queries like "find high-traffic pages that are fewer than 5 clicks from that page, but only if the links are in main body of an article, not the footer".
Can you provide some references on Crawlee generic storage API?
just advanced to level 4! Thanks for your contributions! 🎉
Example here: https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts
It uses this.client which is any class that implements DatasetClient, e.g. here - https://github.com/apify/crawlee/blob/master/packages/memory-storage/src/resource-clients/dataset.ts#L34

I will tell the team to provide more examples
I think more direct approach is https://github.com/arangodb/arangojs with exactly
Plain Text
const db = new Database({
  url: "http://YOURDOMAIN_OR_IP:8529",
  databaseName: "pancakes",
  auth: { username: "root", password: "hunter2" },
});

and make sure you handling data along with handled requests, it should be enough. As already mentioned you must have your own hosted solution
Yeah, we're already using arangojs to map site data to a custom domain model! But we're finding that we have to do more and more housekeeping to ensure that crawlee's request queue and other data stay in sync; unifying them seems like it would be a big win but I was concerned we'd be biting off a huge chunk of work. From the code that posted, it looks like it's at least in the realm of 'reasonable to consider'
If you make an open source code for this, tell me.
it's quite rough at the moment, but the project we've been working on is already on github. https://github.com/autogram-is/spidergram There's a lot of *ugh, we need to improve that" there — in particular, we have a clunky wrapper around PlaywrightCrawler that we're just going to be replacing with a custom BrowserCrawler implementation, but it does the work.
Most of what we do is less "scraping" and more "building a map of several interlinked web sites and using graph queries to tease out structural patterns", which is why we end up going in a few slightly-different directions
https://github.com/autogram-is/spidergram/blob/main/OVERVIEW.md explains a bit more about the domain model it maintains
Oh, so you not expecting to host your solution in Apify cloud (https://github.com/autogram-is/spidergram/blob/main/package.json), right?
At least not for the time being — we’ve been doing all our work locally to bootstrap the project and may eventually build out Apify actors for it but atm, just slinging around about 4-5gb of crawled data locally, heh
Well, as I see it actor(s) designed to be isolated, might be worth considering to do so from very beginning.
I still plan to check the code, thanks for open sourcing it
Add a reply
Sign up and join the conversation on Discord
Join