Apify and Crawlee Official Forum

Updated last year

playwright & pdf + error handling

Hello,
Playwright will throw an net::ERR_ABORTED when scraping any type of pdf file, so, the only way I've been able to figure out how to handle this is in a preNavigationHook, since, I can't catch the error in the routerhandler.

Does anyone have any better suggestions? I'm wondering if I should have two crawlers, playwright for normal pages, and then when it comes across an pdf send it to cheerio?

Thanks!
b
A
P
12 comments
specifically I'm doing a page.on('download', in the preNavigationHook -- doesn't seem smart
https://github.com/microsoft/playwright/issues/7822 i'm also doing this workaround in the preNavigationHook
tagging friends for help ❤️
Also, I guess another question is... how do you run two long-running crawlers? I ended up solving this by creating a BasicCrawler that just downloads the PDF
await Promise.all([crawler.run(), pdf_crawler.run()]) but trying somethign like this
and than this is my preInjectionHook which i'm not sure is working so well.
Plain Text
    async (crawlingContext, gotoOptions) => {
      gotoOptions.waitUntil = 'networkidle'
      const page = crawlingContext.page
      const request = crawlingContext.request
      await page.route('**/*.pdf', async route => {
        request.noRetry = true
        console.log('running pdf', request.url)
        const crawler_request = new CrawleeRequest({ url: request.url, userData: request.userData })
        await pdf_crawler.addRequests([crawler_request])
      })

      page.on('download', async (download: Download) => {
        request.noRetry = true
        console.log('running download', request.url)
        const crawler_request = new CrawleeRequest({ url: request.url, userData: request.userData })
        await pdf_crawler.addRequests([crawler_request])
      })
    },
just advanced to level 1! Thanks for your contributions! 🎉
Hi ,
generally I don't think that using preNavigation hook is generally bad idea. You might exclude the urls for PDF files from crawling and download them directly via got-scraping .

Why you want to open PDF in Playwright? PDF has no DOM structure, therefore you would not be able to usual playwright calls anyway.

When using two crawlers, please make sure they are both using different RequestQueue to avoid conflicts, when one crawler is processing requests for the other one.

Not sure what you mean by sending PDF to cheerio, again PDF is not a HTML page, it has no DOM structure and cheerio is able to work only with XML based documents.
, thanks so much for the response. I ended up doing exactly what you said with the different request queues, and I used the basic crawler to send in any pdf.

The problem is I can’t just only exclude pdf urls because some urls don’t have the file extension in them as they redirect to a pdf or just use a content type to output it.
It took me way too long to figure out the request queue thing.
is there a way to make maxRequestsPerCrawl per request queue and then create a new requestqueue every time I have a "spearate crawl" / is there a way to open a new queue and set it on a specific crawler??

is there a way to make maxRequestsPerCrawl per request
it is Crawler option so it has to be set on Crawler.
create a new requestqueue every time I have a "spearate crawl"
Yes You may create new RequestQueue whenever you want await Actor.openRequestQueue("my-nw-request-queue-1")
I am not sure if it is allowed to create multiple default (unnamed) RequestQueues in single run - I know it was issue in the past.
is there a way to open a new queue and set it on a specific crawler??
You need to pass the RequestQueue to the Crawler options ( requestQueue option).
Add a reply
Sign up and join the conversation on Discord