eaton

·

Custom storage provider for RequestQueue?

It's probably a little out of the ordinary, but I'm building a crawler project that stores a pretty large pile of information in a database, rather than Crawlee's native KVS and DataSet. I'm curious if there are any examples of using alternative backends to store Crawlee's own datasets and request queue? If possible i'd love to consolidate the storage in one place, particularly since it would allow me to query and manage the request pool a bit easier…

17 comments

L

A

e

L

A

eeaton

·

Examine headers before loading full page?

I'm working on a project that requires quite a few "blind" requests — hitting URLs that might be full-fledged pages, or might be (say) PDFs to download and archive, but provide no real clues in their URLs alone. Unfortunately, of the examples of intercepting requests and downloading file rather than requesting the URLs as a browser do their work in preNavigationHooks, examining the URL itself.

Aside from simply using a stub BasicCrawler to check headers first, canceling the full navigation attempt if it's unnecessary, and accepting that there will be unnecessary double-visits, does Crawlee's architecture offer any way to handle this scenario?

4 comments

L

e

Apify and Crawlee Official Forum

Custom storage provider for RequestQueue?

Examine headers before loading full page?