Apify and Crawlee Official Forum

Updated 3 months ago

error when crawling download link

Hi All,

im trying to crawl a website that has PDF's to download across different pages

An example
https://dca-global.org/file/view/12756/interact-case-study-cedaci

On that page there is a button with a download link. The download link changes every time you visit the page. When i navigate to the download url manually it works as expected (the file downloads and the tab closes). When i try to navigate to it with puppeteer crawler however, I get a 403 error saying HMAC mismatch but strangely the file still downloads? (I confirmed this by finding the download cache in my temp storage). Im not sure if this is some kind of anti scraping functionality but if so why would it still download?

here is my crawlee config. since it is a 403, my handler never gets called

Plain Text
  chromium.use(stealthPlugin());

  const router = createPlaywrightRouter();
  router.addHandler(
    requestLabels.SPIDER,
    spiderDiscoveryHandlerFactory(container),
  );
  router.addHandler(requestLabels.ARTICLE, articleHandlerFactory(container));

  const config = new Configuration({
    storageClient: new MemoryStorage({
      localDataDirectory: `./storage/${message.messageId}`,
      writeMetadata: true,
      persistStorage: true,
    }),
    persistStateIntervalMillis: 5000,
    persistStorage: true,
    purgeOnStart: false,
    headless: false,
  });

  const crawler = new PlaywrightCrawler(
    {
      launchContext: {
        launcher: chromium,
      },
      requestHandler: router,
      errorHandler: (_request, error) => {
        logger.error(`${error.name}\n${error.message}`);
      },
      maxRequestsPerCrawl:
        body.config.maxRequests > 0 ? body.config.maxRequests : undefined,
      useSessionPool: true,
      persistCookiesPerSession: true,
    },
    config,
  );
C
A
P
7 comments
seems to be a cookie issue
not a cookie issue, that is just because when i tested the link in another browser obviously the cookie didnt match.

seems to be this issue where a navigation turns into a download and chromium throws its toys out of the pram

https://github.com/microsoft/playwright-java/issues/541
just advanced to level 1! Thanks for your contributions! πŸŽ‰
error when crawling download link
I think I have a solution. It isnt perfect but I was able to intercept the download in a preNavigationHook

Plain Text
import { PlaywrightCrawler, sleep } from 'crawlee';
import path from 'path';

const crawler = new PlaywrightCrawler({
  headless: false, // Run in headful mode for debugging
    async requestHandler({request, page, enqueueLinks, session}) {
      console.log(session?.getCookies(request.url))
      const a = 1;
      await enqueueLinks({globs: ['https://dca-global.org/serve-file/**']})
    },
    async errorHandler({ request }) {
      if(request.userData['attachment']) {
        console.log('download not error')
        request.noRetry = true
      }
    },
    preNavigationHooks: [
      async (crawlingContext) => {
        const a = 1;
        crawlingContext.page.once('response', async (resp) => {
          const disposition = await resp.headerValue('content-disposition')
          if (disposition && crawlingContext.request.url == resp.request().url()) {
            crawlingContext.request.userData['attachment'] = true;
            const download = await crawlingContext.page.waitForEvent('download')
            await download.saveAs(path.join('./storage/downloads', download.suggestedFilename()))
          }
        })
    },
    ],
    sessionPoolOptions: {
      maxPoolSize: 1
    }
});

const startUrls = ['https://dca-global.org/file/view/12756/interact-case-study-cedaci'];

await crawler.addRequests(startUrls);

await crawler.run();


i set the max pool size to 1 to ensure that the cookie was picked up on the previous navigation before downloading the file. The hook awaits the response. checks the disposition header and that it is the initial download and not something like a secondary image, sets a flag in the user data and downloads the file. The trouble is that there is a potential race condition between the flag and the net::ERR_ABORTED

any advice would be appreciated!
This should avoid the race condition

Plain Text
type downloadPromise = {
  data: Buffer
  suggestedName: string
}
    async requestHandler({ enqueueLinks }) {
      await enqueueLinks({globs: ['https://dca-global.org/serve-file/**']})
    },
    async errorHandler({ request }) {
      if(request.userData['download']) {
        console.log('checking for download')
        try {
          const download = await request.userData.download as downloadPromise | undefined
          if (download) {
            console.log('it was a file download not an error')
            
          } {
            console.log('no download, was actually an error')
          }
        } catch (err) {
          console.log('download failed')
          console.error(err)
        }
      }
    },
    preNavigationHooks: [
      async (crawlingContext) => {
        crawlingContext.request.userData['download'] = new Promise<downloadPromise | undefined>( async (resolve, reject) => {
          try {
            const response = await crawlingContext.page.waitForEvent('response')
            const disposition = await response.headerValue('content-disposition')
            if (disposition && crawlingContext.request.url == response.request().url()) {
              const download = await crawlingContext.page.waitForEvent('download')
              const stream = await download.createReadStream()
              const chunks: Buffer[] = [];
  
              stream.on('data', (chunk: Buffer) => {
                chunks.push(chunk);
              });
  
              stream.on('end', () => {
                const buffer = Buffer.concat(chunks);
                resolve({data: buffer, suggestedName: download.suggestedFilename()})
              });
              setTimeout(() => reject(new Error('download not complete after 15 seconds')), 15000)
            } else {
              resolve(undefined)
            }
          } catch(err) {
            reject(err)
          }
        })
    },
    ],
Thank you for your description of the problem and solution.
Add a reply
Sign up and join the conversation on Discord