Apify and Crawlee Official Forum

b
bmax
Offline, last seen last month
Joined August 30, 2024
I'm seeing a lot of the same exact URL's being ran twice? Any ideas?
1 comment
M
This error is happening consistently, even while only running 1 browser. When I load up the server and look at top. There are a bunch of long-running chrome processes that haven't been killed.

top attached.:

Error:
Plain Text
{"time":"2024-05-20T03:04:41.809Z","level":"WARNING","msg":"PuppeteerCrawler:AutoscaledPool:Snapshotter: Memory is critically overloaded. Using 16268 MB of 14071 MB (116%). Consider increasing available memory.","scraper":"web","url":"https://www.natronacounty-wy.gov/845/LegalPublic-Notices","place_id":"65a603fac769fa16f6596a8f"}    
15 comments
L
m
N
b
What exactly does this mean, and, how to solve this?


https://princeton.edu is the problematic website
1 comment
O
Hello,

I am running my scraper on an AWS 8gb cpu, 16gb memory ecs.

Plain Text
    maxConcurrency: 200,
    maxRequestsPerCrawl: 500,
    maxRequestRetries: 2,
    requestHandlerTimeoutSecs: 185,


Right now the avg cpu and mem are both like 88%. Is there anything I can do here to optimize more?

I also have CRAWLEE_AVAILABLE_MEMORY_RATIO=.8
9 comments
O
b
A
v
I have two places that do forefront: true, the very first URLs that I start the crawler with.

Then each page will also have a set of urls that go on the front of the queue, but, I want the very first urls that start the crawl to have priority, how would I accomplish this?

I was thinking of two request queues but had a lot of problems with that as well.
1 comment
v
Hello,

do you have to manually call page.close or anything at the end of defaulthandler?
14 comments
N
P
b
L
H
Is it possible to create two request queues per
.run()?
4 comments
b
v
Hey y'all, so, basically I'm trying to see if the response is application/pdf, then, it should timeout immediately and ideally skipRequest.

Plain Text
async (crawlingContext, gotoOptions) => {
        const { page, request, crawler } = crawlingContext
        const queue = await crawler.getRequestQueue()
        const crawler_dto = request.userData.crawler_dto

        if (!request.url.endsWith('.pdf')) {
          gotoOptions.waitUntil = 'networkidle2'
          gotoOptions.timeout = 20000
          await page.setBypassCSP(true)
          await page.setExtraHTTPHeaders({
            'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
          })
          await page.setViewport({ width: 1440, height: 900 })
        }

        page.on('response', async (page_response) => {
          if (page_response.headers()['content-type'] === 'application/pdf') {
            gotoOptions.timeout = 1
          }
        })
      },
15 comments
L
b
H
Plain Text
"Failed to launch browser. Please check the following:\n- Check whether the provided executable path \"/usr/bin/google-chrome\" is correct.\n- Try installing a browser, if it's missing, by running `npx @puppeteer/browsers install chromium --path [path]` and pointing `executablePath` to the downloaded executable (https://pptr.dev/browsers-api)\n\nThe original error is available in the `cause` property. Below is the error received when trying to launch a browser:\n​","stack":"Failed to launch browser. Please check the following:\n- Check whether the provided executable path \"/usr/bin/google-chrome\" is correct.\n- Try installing a browser, if it's missing, by running `npx @puppeteer/browsers install chromium --path [path]` and pointing `executablePath` to the downloaded executable (https://pptr.dev/browsers-api)\n\nThe original error is available in the `cause` property. Below is the error received when trying to launch a browser:\n​\nError: ENOSPC: no space left on device, mkdtemp '/tmp/puppeteer_dev_profile-pXEfmi'\nError thrown at:\n\n    at PuppeteerPlugin._throwAugmentedLaunchError (/home/app/node_modules/@crawlee/browser-pool/abstract-classes/browser-plugin.js:145:15)\n    at PuppeteerPlugin._launch (/home/app/node_modules/@crawlee/browser-
5 comments
b
P
A
Hello,

two questions.

Is there a way to call this.isFinishedFunction so it calls the original function but also just add another web_crawler_queue isFinished on top of it? the uncommented out function I tried, worked somewhat, but, after a long running web_crawler_queue finished, it just kept giving me a stalled error, and, this crawler that this function belongs to never finished.



Plain Text
    autoscaledPoolOptions: {
      isFinishedFunction: async () => {
        const web_crawler_queue = await RequestQueue.open(place_id)
//        return this.isFinishedFunction() && await web_crawler_queue.isFinished()
        return await request_queue.isFinished() && await web_crawler_queue.isFinished()

      }
    },
8 comments
b
L
A
4 comments
L
b
O
Hello,

I have a playwright crawler that is listening to db changes, whenever a page gets added I want it to scrape and enqueue 500 links for the whole scraping process, but, there can be multiple things added to the DB at the same time. I've tried keepAlive and the maxRequests thing is hard to manage if we just keep adding urls to the same crawler.

My question is: what's the best way to create a playwright crawler that will automatically handle the processing of 500 pages for each start?
21 comments
A
b
L
A
Plain Text
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. Navigation timed out after 60 seconds. {"id":"icvnyTXX7zWJjgV","url":"https://www.gastongov.com/486/Transportation-Planning","retryCount":2}

any way to get more data / error handle better than this?
2 comments
L
A
I have an ECS instance with 4vCPU & 16gb RAM. My scaling options are the following:

Plain Text
    maxConcurrency: 200,
    maxRequestsPerCrawl: 500,
    maxRequestRetries: 2,
    requestHandlerTimeoutSecs: 185,

I am starting 4 of these crawlers at a time.
Here is a snapshot log:
Plain Text
{"time":"2024-04-15T00:09:08.818Z","level":"INFO","msg":"PuppeteerCrawler:AutoscaledPool: state","currentConcurrency":1,"desiredConcurrency":1,"systemStatus":{"isSystemIdle":false,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0.106},"cpuInfo":{"isOverloaded":true,"limitRatio":0.4,"actualRatio":1},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}    


Can anyone help me identify the correct settings so it is not maxed out?
29 comments
P
b
S
Hello,
Playwright will throw an net::ERR_ABORTED when scraping any type of pdf file, so, the only way I've been able to figure out how to handle this is in a preNavigationHook, since, I can't catch the error in the routerhandler.

Does anyone have any better suggestions? I'm wondering if I should have two crawlers, playwright for normal pages, and then when it comes across an pdf send it to cheerio?

Thanks!
12 comments
b
P
A