Custom storage provider for RequestQueue?

eeaton

It's probably a little out of the ordinary, but I'm building a crawler project that stores a pretty large pile of information in a database, rather than Crawlee's native KVS and DataSet. I'm curious if there are any examples of using alternative backends to store Crawlee's own datasets and request queue? If possible i'd love to consolidate the storage in one place, particularly since it would allow me to query and manage the request pool a bit easier…

17 comments

AApifyBot

just advanced to level 1! Thanks for your contributions! 🎉

LLukas Krivka

Not sure about examples but Crawlee already has generic storage API that can be implemented. And we already support 2 implementations - Apify API and local filesystem. So you can add 3rd implementation

AAlexey Udovydchenko

I wanted to use some database but did not find any good matches, basically its either something external in other cloud (like firebase), otherwise makes not a lot of sense to use it. Embedded DBs technically possible but because of "migration" imho nearly useless (actor might be shutdown, moved to other server instance and then restarted at any point of runtime)

eeaton

We're using ArangoDB — it's a "multi-modal" database that has native support for mongodb-style document storage and neo4j style graph queries in the same data store; it's proven very useful for complex analysis of large sites — queries like "find high-traffic pages that are fewer than 5 clicks from that page, but only if the links are in main body of an article, not the footer".

LLeMoussel

Can you provide some references on Crawlee generic storage API?

AApifyBot

just advanced to level 4! Thanks for your contributions! 🎉

LLukas Krivka

Example here: https://github.com/apify/crawlee/blob/master/packages/core/src/storages/dataset.ts
It uses this.client which is any class that implements DatasetClient, e.g. here - https://github.com/apify/crawlee/blob/master/packages/memory-storage/src/resource-clients/dataset.ts#L34

I will tell the team to provide more examples

AAlexey Udovydchenko

I think more direct approach is https://github.com/arangodb/arangojs with exactly

Plain Text

const db = new Database({
  url: "http://YOURDOMAIN_OR_IP:8529",
  databaseName: "pancakes",
  auth: { username: "root", password: "hunter2" },
});

and make sure you handling data along with handled requests, it should be enough. As already mentioned you must have your own hosted solution

eeaton

Yeah, we're already using arangojs to map site data to a custom domain model! But we're finding that we have to do more and more housekeeping to ensure that crawlee's request queue and other data stay in sync; unifying them seems like it would be a big win but I was concerned we'd be biting off a huge chunk of work. From the code that posted, it looks like it's at least in the realm of 'reasonable to consider'

LLeMoussel

If you make an open source code for this, tell me.

eeaton

it's quite rough at the moment, but the project we've been working on is already on github. https://github.com/autogram-is/spidergram There's a lot of *ugh, we need to improve that" there — in particular, we have a clunky wrapper around PlaywrightCrawler that we're just going to be replacing with a custom BrowserCrawler implementation, but it does the work.

eeaton

Most of what we do is less "scraping" and more "building a map of several interlinked web sites and using graph queries to tease out structural patterns", which is why we end up going in a few slightly-different directions

eeaton

https://github.com/autogram-is/spidergram/blob/main/OVERVIEW.md explains a bit more about the domain model it maintains

AAlexey Udovydchenko

Oh, so you not expecting to host your solution in Apify cloud (https://github.com/autogram-is/spidergram/blob/main/package.json), right?

eeaton

At least not for the time being — we’ve been doing all our work locally to bootstrap the project and may eventually build out Apify actors for it but atm, just slinging around about 4-5gb of crawled data locally, heh

AAlexey Udovydchenko

Well, as I see it actor(s) designed to be isolated, might be worth considering to do so from very beginning.

LLukas Krivka

I still plan to check the code, thanks for open sourcing it

Add a reply

Join on Discord

Apify and Crawlee Official Forum

Custom storage provider for RequestQueue?