Apify

Apify and Crawlee Official Forum

b
F
A
J
A

preNavigationHook needs to listen to response from network and change goToOptions.

Hey y'all, so, basically I'm trying to see if the response is application/pdf, then, it should timeout immediately and ideally skipRequest.

Plain Text
async (crawlingContext, gotoOptions) => {
        const { page, request, crawler } = crawlingContext
        const queue = await crawler.getRequestQueue()
        const crawler_dto = request.userData.crawler_dto

        if (!request.url.endsWith('.pdf')) {
          gotoOptions.waitUntil = 'networkidle2'
          gotoOptions.timeout = 20000
          await page.setBypassCSP(true)
          await page.setExtraHTTPHeaders({
            'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
          })
          await page.setViewport({ width: 1440, height: 900 })
        }

        page.on('response', async (page_response) => {
          if (page_response.headers()['content-type'] === 'application/pdf') {
            gotoOptions.timeout = 1
          }
        })
      },
b
H
L
15 comments
preNavigationHooks are executed before sending the request, and you cannot directly listen to the response within this hook. Instead, you have a couple of options:

1- Listen to the response in requestHandler: You can handle the response within the requestHandler function, which is called after the request has been sent but before the response is processed.

2- Access the response in postNavigationHook: Alternatively, if you need to access the response after it has been received, you can do so in the postNavigationHook. This hook is called after the navigation has occurred and the response has been received.
thanks for the response.
The real problem is the timeout = 20000 seconds before I know it's an application/pdf (from network)
so. the router.addDefaultHandler doesn't get called for 20 seconds... or at all? since the request times out (since it's a iframe type pdf)
(since the url does not end in PDF, and, you can't technically tell it's a PDF until netwokr loads)
Try this:
Plain Text
import { NonRetryableError } from 'crawlee';

preNavigationHooks: [
    async ({ page }) => {
        page.on('response', async (page_response) => {
            if (page_response.headers()['content-type'] === 'application/pdf') {
                throw new NonRetryableError('PDFs are not supported');
            }
        });
    },
]
This would crash the process as you cannot throw in the page.on event handler because you are not able to await that.

I think you could do await page.waitForResponse instead and throw after it.

Actually that would just get stuck because you dont navigate. I think then use `gotoOptions.waitUntil: 'domcontentloaded' and handle the response type in requestHandler
that’s what I ended up doing last night, but, then I will end up getting some pages that don’t load properly because i should be using networkidle2
Also, you can't really await the page.on('response', so by time you get application/pdf you might already be halfway thru the process of scraping that "pdf" page.
wait wtf... now I'm not even getting that response (on link above) it's only returning the favicon.ico response!
So confused.
ahh because the request already happened by the time it's in the default handler..
You would have to do the networkidle2 in requestHandler. There is no way to stop the page navigation in the middle.
Add a reply
Sign up and join the conversation on Discord
Join