bmax

enqueue urls / request queue not being unique

I'm seeing a lot of the same exact URL's being ran twice? Any ideas?

1 comment

{"time":"2024-05-20T03:04:41.809Z","level":"WARNING","msg":"PuppeteerCrawler:AutoscaledPool:Snapshot

This error is happening consistently, even while only running 1 browser. When I load up the server and look at top. There are a bunch of long-running chrome processes that haven't been killed.

top attached.:

Error:

Plain Text

{"time":"2024-05-20T03:04:41.809Z","level":"WARNING","msg":"PuppeteerCrawler:AutoscaledPool:Snapshotter: Memory is critically overloaded. Using 16268 MB of 14071 MB (116%). Consider increasing available memory.","scraper":"web","url":"https://www.natronacounty-wy.gov/845/LegalPublic-Notices","place_id":"65a603fac769fa16f6596a8f"}

15 comments

bbmax

Reclaiming failed request back to the list or queue. Detected a session error, rotating session.

What exactly does this mean, and, how to solve this?

https://princeton.edu is the problematic website

1 comment

bbmax

Trying to optimize autoscale options

Hello,

I am running my scraper on an AWS 8gb cpu, 16gb memory ecs.

Plain Text

    maxConcurrency: 200,
    maxRequestsPerCrawl: 500,
    maxRequestRetries: 2,
    requestHandlerTimeoutSecs: 185,

Right now the avg cpu and mem are both like 88%. Is there anything I can do here to optimize more?

I also have CRAWLEE_AVAILABLE_MEMORY_RATIO=.8

9 comments

bbmax

I want the first urls that I enqueue to run before the next forefront:true.

I have two places that do forefront: true, the very first URLs that I start the crawler with.

Then each page will also have a set of urls that go on the front of the queue, but, I want the very first urls that start the crawl to have priority, how would I accomplish this?

I was thinking of two request queues but had a lot of problems with that as well.

1 comment

bbmax

Memory for only 1 browser is 12gb? How to ensure clean up after pages?

Hello,

do you have to manually call page.close or anything at the end of defaulthandler?

14 comments

bbmax

Does crawlee already implement this fingerprint generator?

https://github.com/apify/fingerprint-suite?

3 comments

bbmax

more than one request queue

Is it possible to create two request queues per
.run()?

4 comments

bbmax

preNavigationHook needs to listen to response from network and change goToOptions.

Hey y'all, so, basically I'm trying to see if the response is application/pdf, then, it should timeout immediately and ideally skipRequest.

Plain Text

async (crawlingContext, gotoOptions) => {
        const { page, request, crawler } = crawlingContext
        const queue = await crawler.getRequestQueue()
        const crawler_dto = request.userData.crawler_dto

        if (!request.url.endsWith('.pdf')) {
          gotoOptions.waitUntil = 'networkidle2'
          gotoOptions.timeout = 20000
          await page.setBypassCSP(true)
          await page.setExtraHTTPHeaders({
            'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
          })
          await page.setViewport({ width: 1440, height: 900 })
        }

        page.on('response', async (page_response) => {
          if (page_response.headers()['content-type'] === 'application/pdf') {
            gotoOptions.timeout = 1
          }
        })
      },

15 comments

bbmax

getMostPopularErrors

https://crawlee.dev/api/next/utils/class/ErrorTracker#getMostPopularErrors

How do I use this?

1 comment

bbmax

Running out of space with pupeteers user profiles on a long running scrape.

Plain Text

"Failed to launch browser. Please check the following:\n- Check whether the provided executable path \"/usr/bin/google-chrome\" is correct.\n- Try installing a browser, if it's missing, by running `npx @puppeteer/browsers install chromium --path [path]` and pointing `executablePath` to the downloaded executable (https://pptr.dev/browsers-api)\n\nThe original error is available in the `cause` property. Below is the error received when trying to launch a browser:\n","stack":"Failed to launch browser. Please check the following:\n- Check whether the provided executable path \"/usr/bin/google-chrome\" is correct.\n- Try installing a browser, if it's missing, by running `npx @puppeteer/browsers install chromium --path [path]` and pointing `executablePath` to the downloaded executable (https://pptr.dev/browsers-api)\n\nThe original error is available in the `cause` property. Below is the error received when trying to launch a browser:\n\nError: ENOSPC: no space left on device, mkdtemp '/tmp/puppeteer_dev_profile-pXEfmi'\nError thrown at:\n\n    at PuppeteerPlugin._throwAugmentedLaunchError (/home/app/node_modules/@crawlee/browser-pool/abstract-classes/browser-plugin.js:145:15)\n    at PuppeteerPlugin._launch (/home/app/node_modules/@crawlee/browser-

5 comments

bbmax

isFinishedFunction, check other crawler?

Hello,

two questions.

Is there a way to call this.isFinishedFunction so it calls the original function but also just add another web_crawler_queue isFinished on top of it? the uncommented out function I tried, worked somewhat, but, after a long running web_crawler_queue finished, it just kept giving me a stalled error, and, this crawler that this function belongs to never finished.

Plain Text

    autoscaledPoolOptions: {
      isFinishedFunction: async () => {
        const web_crawler_queue = await RequestQueue.open(place_id)
//        return this.isFinishedFunction() && await web_crawler_queue.isFinished()
        return await request_queue.isFinished() && await web_crawler_queue.isFinished()

      }
    },

8 comments

bbmax

Memory usage is spiking, even with CRAWLEE_AVAILABLE_MEMORY_RATIO set to .7

Any ideas on how to solve this?

4 comments

bbmax

long running scraper, 500+ pages for each crawl

Hello,

I have a playwright crawler that is listening to db changes, whenever a page gets added I want it to scrape and enqueue 500 links for the whole scraping process, but, there can be multiple things added to the DB at the same time. I've tried keepAlive and the maxRequests thing is hard to manage if we just keep adding urls to the same crawler.

My question is: what's the best way to create a playwright crawler that will automatically handle the processing of 500 pages for each start?

21 comments

bbmax

better error handling?

Plain Text

WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. Navigation timed out after 60 seconds. {"id":"icvnyTXX7zWJjgV","url":"https://www.gastongov.com/486/Transportation-Planning","retryCount":2}

any way to get more data / error handle better than this?

2 comments

bbmax

Need serious help scaling crawlee

I have an ECS instance with 4vCPU & 16gb RAM. My scaling options are the following:

Plain Text

    maxConcurrency: 200,
    maxRequestsPerCrawl: 500,
    maxRequestRetries: 2,
    requestHandlerTimeoutSecs: 185,

I am starting 4 of these crawlers at a time.
Here is a snapshot log:

Plain Text

{"time":"2024-04-15T00:09:08.818Z","level":"INFO","msg":"PuppeteerCrawler:AutoscaledPool: state","currentConcurrency":1,"desiredConcurrency":1,"systemStatus":{"isSystemIdle":false,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0.106},"cpuInfo":{"isOverloaded":true,"limitRatio":0.4,"actualRatio":1},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}

Can anyone help me identify the correct settings so it is not maxed out?

29 comments

bbmax

playwright & pdf + error handling

Hello,
Playwright will throw an net::ERR_ABORTED when scraping any type of pdf file, so, the only way I've been able to figure out how to handle this is in a preNavigationHook, since, I can't catch the error in the routerhandler.

Does anyone have any better suggestions? I'm wondering if I should have two crawlers, playwright for normal pages, and then when it comes across an pdf send it to cheerio?

Thanks!

12 comments

Apify Discord Mirror

enqueue urls / request queue not being unique

{"time":"2024-05-20T03:04:41.809Z","level":"WARNING","msg":"PuppeteerCrawler:AutoscaledPool:Snapshot

Reclaiming failed request back to the list or queue. Detected a session error, rotating session.

Trying to optimize autoscale options

I want the first urls that I enqueue to run before the next forefront:true.

Memory for only 1 browser is 12gb? How to ensure clean up after pages?

Does crawlee already implement this fingerprint generator?

more than one request queue

preNavigationHook needs to listen to response from network and change goToOptions.

getMostPopularErrors

Running out of space with pupeteers user profiles on a long running scrape.

isFinishedFunction, check other crawler?

Memory usage is spiking, even with CRAWLEE_AVAILABLE_MEMORY_RATIO set to .7

long running scraper, 500+ pages for each crawl

better error handling?

Need serious help scaling crawlee

playwright & pdf + error handling