Apify

Apify and Crawlee Official Forum

b
F
A
J
A
Members
eaton
e
eaton
Offline, last seen last month
Joined August 30, 2024
It's probably a little out of the ordinary, but I'm building a crawler project that stores a pretty large pile of information in a database, rather than Crawlee's native KVS and DataSet. I'm curious if there are any examples of using alternative backends to store Crawlee's own datasets and request queue? If possible i'd love to consolidate the storage in one place, particularly since it would allow me to query and manage the request pool a bit easier…
17 comments
L
A
e
L
A
It's probably a little out of the ordinary, but I'm building a crawler project that stores a pretty large pile of information in a database, rather than Crawlee's native KVS and DataSet. I'm curious if there are any examples of using alternative backends to store Crawlee's own datasets and request queue? If possible i'd love to consolidate the storage in one place, particularly since it would allow me to query and manage the request pool a bit easier…
17 comments
L
A
A
e
L
I'm working on a project that requires quite a few "blind" requests — hitting URLs that might be full-fledged pages, or might be (say) PDFs to download and archive, but provide no real clues in their URLs alone. Unfortunately, of the examples of intercepting requests and downloading file rather than requesting the URLs as a browser do their work in preNavigationHooks, examining the URL itself.

Aside from simply using a stub BasicCrawler to check headers first, canceling the full navigation attempt if it's unnecessary, and accepting that there will be unnecessary double-visits, does Crawlee's architecture offer any way to handle this scenario?
4 comments
L
e
I'm working on a project that requires quite a few "blind" requests — hitting URLs that might be full-fledged pages, or might be (say) PDFs to download and archive, but provide no real clues in their URLs alone. Unfortunately, of the examples of intercepting requests and downloading file rather than requesting the URLs as a browser do their work in preNavigationHooks, examining the URL itself.

Aside from simply using a stub BasicCrawler to check headers first, canceling the full navigation attempt if it's unnecessary, and accepting that there will be unnecessary double-visits, does Crawlee's architecture offer any way to handle this scenario?
4 comments
L
e