playwright & pdf + error handling

At a glance

The community member is facing an issue with Playwright throwing a "net::ERR_ABORTED" error when scraping PDF files. They are considering using a preNavigationHook to handle this, but are unsure if it's the best approach. They wonder if they should have two crawlers - one for normal pages using Playwright, and another for PDF files using Cheerio.

The comments suggest that using the preNavigationHook may not be the best solution, and that the community member should try excluding PDF URLs from the main crawler and downloading them directly using a separate "BasicCrawler". They also mention the importance of using separate RequestQueues for the two crawlers to avoid conflicts.

There is no explicitly marked answer, but the community members provide suggestions on how to handle the PDF scraping issue, such as using different crawlers and RequestQueues.

Useful resources

bbmax

Hello,
Playwright will throw an net::ERR_ABORTED when scraping any type of pdf file, so, the only way I've been able to figure out how to handle this is in a preNavigationHook, since, I can't catch the error in the routerhandler.

Does anyone have any better suggestions? I'm wondering if I should have two crawlers, playwright for normal pages, and then when it comes across an pdf send it to cheerio?

Thanks!

12 comments

bbmax

specifically I'm doing a page.on('download', in the preNavigationHook -- doesn't seem smart

bbmax

https://github.com/microsoft/playwright/issues/7822 i'm also doing this workaround in the preNavigationHook

bbmax

tagging friends for help ❤️

bbmax

Also, I guess another question is... how do you run two long-running crawlers? I ended up solving this by creating a BasicCrawler that just downloads the PDF

bbmax

await Promise.all([crawler.run(), pdf_crawler.run()]) but trying somethign like this

bbmax

and than this is my preInjectionHook which i'm not sure is working so well.

Plain Text

    async (crawlingContext, gotoOptions) => {
      gotoOptions.waitUntil = 'networkidle'
      const page = crawlingContext.page
      const request = crawlingContext.request
      await page.route('**/*.pdf', async route => {
        request.noRetry = true
        console.log('running pdf', request.url)
        const crawler_request = new CrawleeRequest({ url: request.url, userData: request.userData })
        await pdf_crawler.addRequests([crawler_request])
      })

      page.on('download', async (download: Download) => {
        request.noRetry = true
        console.log('running download', request.url)
        const crawler_request = new CrawleeRequest({ url: request.url, userData: request.userData })
        await pdf_crawler.addRequests([crawler_request])
      })
    },

AApifyBot

just advanced to level 1! Thanks for your contributions! 🎉

PPepa J

Hi ,
generally I don't think that using preNavigation hook is generally bad idea. You might exclude the urls for PDF files from crawling and download them directly via got-scraping .

Why you want to open PDF in Playwright? PDF has no DOM structure, therefore you would not be able to usual playwright calls anyway.

When using two crawlers, please make sure they are both using different RequestQueue to avoid conflicts, when one crawler is processing requests for the other one.

Not sure what you mean by sending PDF to cheerio, again PDF is not a HTML page, it has no DOM structure and cheerio is able to work only with XML based documents.

bbmax

, thanks so much for the response. I ended up doing exactly what you said with the different request queues, and I used the basic crawler to send in any pdf.

The problem is I can’t just only exclude pdf urls because some urls don’t have the file extension in them as they redirect to a pdf or just use a content type to output it.

bbmax

It took me way too long to figure out the request queue thing.

bbmax

is there a way to make maxRequestsPerCrawl per request queue and then create a new requestqueue every time I have a "spearate crawl" / is there a way to open a new queue and set it on a specific crawler??

PPepa J

is there a way to make maxRequestsPerCrawl per request

it is Crawler option so it has to be set on Crawler.

create a new requestqueue every time I have a "spearate crawl"

Yes You may create new RequestQueue whenever you want await Actor.openRequestQueue("my-nw-request-queue-1")
I am not sure if it is allowed to create multiple default (unnamed) RequestQueues in single run - I know it was issue in the past.

is there a way to open a new queue and set it on a specific crawler??

You need to pass the RequestQueue to the Crawler options ( requestQueue option).

Add a reply

Apify Discord Mirror

playwright & pdf + error handling