just advanced to level 1! Thanks for your contributions! 🎉
Not sure about examples but Crawlee already has generic storage API that can be implemented. And we already support 2 implementations - Apify API and local filesystem. So you can add 3rd implementation
I wanted to use some database but did not find any good matches, basically its either something external in other cloud (like firebase), otherwise makes not a lot of sense to use it. Embedded DBs technically possible but because of "migration" imho nearly useless (actor might be shutdown, moved to other server instance and then restarted at any point of runtime)
We're using ArangoDB — it's a "multi-modal" database that has native support for mongodb-style document storage and neo4j style graph queries in the same data store; it's proven very useful for complex analysis of large sites — queries like "find high-traffic pages that are fewer than 5 clicks from that page, but only if the links are in main body of an article, not the footer".
Can you provide some references on Crawlee generic storage API?
just advanced to level 4! Thanks for your contributions! 🎉
I think more direct approach is
https://github.com/arangodb/arangojs with exactly
const db = new Database({
url: "http://YOURDOMAIN_OR_IP:8529",
databaseName: "pancakes",
auth: { username: "root", password: "hunter2" },
});
and make sure you handling data along with handled requests, it should be enough. As already mentioned you must have your own hosted solution
Yeah, we're already using arangojs to map site data to a custom domain model! But we're finding that we have to do more and more housekeeping to ensure that crawlee's request queue and other data stay in sync; unifying them seems like it would be a big win but I was concerned we'd be biting off a huge chunk of work. From the code that posted, it looks like it's at least in the realm of 'reasonable to consider'
If you make an open source code for this, tell me.
it's quite rough at the moment, but the project we've been working on is already on github.
https://github.com/autogram-is/spidergram There's a lot of *ugh, we need to improve that" there — in particular, we have a clunky wrapper around PlaywrightCrawler that we're just going to be replacing with a custom BrowserCrawler implementation, but it does the work.
Most of what we do is less "scraping" and more "building a map of several interlinked web sites and using graph queries to tease out structural patterns", which is why we end up going in a few slightly-different directions
At least not for the time being — we’ve been doing all our work locally to bootstrap the project and may eventually build out Apify actors for it but atm, just slinging around about 4-5gb of crawled data locally, heh
Well, as I see it actor(s) designed to be isolated, might be worth considering to do so from very beginning.
I still plan to check the code, thanks for open sourcing it