Apify and Crawlee Official Forum

Examine headers before loading full page?

I'm working on a project that requires quite a few "blind" requests — hitting URLs that might be full-fledged pages, or might be (say) PDFs to download and archive, but provide no real clues in their URLs alone. Unfortunately, of the examples of intercepting requests and downloading file rather than requesting the URLs as a browser do their work in preNavigationHooks, examining the URL itself.

Aside from simply using a stub BasicCrawler to check headers first, canceling the full navigation attempt if it's unnecessary, and accepting that there will be unnecessary double-visits, does Crawlee's architecture offer any way to handle this scenario?

4 comments

eeaton

After some noodling around, I ended up solving this the brute-force way: in a preNavigationHook, I'm sending a HEAD request to the URL, checking its headers, and if it's a downloadable mime type I'm setting skipNavigation = true and request.label = 'download'. Although it results in an extra request, the impact should be pretty low — and will also allow me to skip unnecessary full requests, which should more than make up for it over the course of a large crawl.

eeaton

It’s been a while, but this technique has been working very well for us thus far.

LLukas Krivka

That sounds good. You would basically have to override the page.goto behavior so it stops where you need. Not sure if the waitUntil: 'commit' would work for you, I'm not exactly sure what is available at that point.
https://playwright.dev/docs/api/class-page#page-goto

LLukas Krivka

Other option would be to use preNavigationHooks and page.on('request') or page.on('response') and cut the whole Request off once you get some first network.

Add a reply

Join on Discord