{"time":"2024-05-20T03:04:41.809Z","level":"WARNING","m...

bbmax

This error is happening consistently, even while only running 1 browser. When I load up the server and look at top. There are a bunch of long-running chrome processes that haven't been killed.

top attached.:

Error:

Plain Text

{"time":"2024-05-20T03:04:41.809Z","level":"WARNING","msg":"PuppeteerCrawler:AutoscaledPool:Snapshotter: Memory is critically overloaded. Using 16268 MB of 14071 MB (116%). Consider increasing available memory.","scraper":"web","url":"https://www.natronacounty-wy.gov/845/LegalPublic-Notices","place_id":"65a603fac769fa16f6596a8f"}

Attachment

15 comments

bbmax

That top is with zero browsers currently running.

bbmax

NNeoNomade

to debug this, the routes would be needed

NNeoNomade

For example even if you have await page.close at the end of each handler, but you have some process in the handler that hangs. It can lead to this .
It's hard to debug without the content of the routes

mmicroworlds

I see you're using node js. I would suggest that you kill all active running browsers/child-processes (page.close() and browser.close() are not enough especially when the script hangs).

When you launch a browser, get it's process id (browser.pid) and manually kill that process when you're done with the browser. You can use this library - https://www.npmjs.com/package/tree-kill. So instead of browser.close(), do:

Plain Text

const kill = require('tree-kill');
const browserPid = browser.pid
kill(browserPid);

Use with caution though, only kill the process when you completely don't need the browser 😅

mmicroworlds

Another option is to use a very obselete library https://github.com/thomasdondorf/puppeteer-cluster. You can control the cuncurrency and it efficiently manages all the browers/pages running on the server. See example of running an express server with browsers on it - https://github.com/thomasdondorf/puppeteer-cluster/blob/master/examples/express-screenshot.js 👍

PS: this library is not maintained but for the most part, it gets the job done. 😀

bbmax

zzz

bbmax

💤

bbmax

thanks for checking. How do you get the browser.pid from the BrowserPool within crawlee?

NNeoNomade

This solution is against the Browser pool 🤣

bbmax

lmao -- I agree. I'm thinking crawlee should help manage this

bbmax

but gotta do what you gotta do.

NNeoNomade

I'm 100% sure the issue lays in routes. I crawled 10 million urls with a single crawler without this issue 🤣
But tweaked the routes to be as memory efficient as possible

mmicroworlds

Ah I see. Non of the examples I gave above uses Crawlee. Probably not suitable for your use case but I've been using this in several actors (that run on a VPS) in production.

But as you're using Crawlee, then I'd recommend lowering the concurrency, until you find the OPTIMAL performance settings. This will allow Crawlee gracefully handle the browsers regardless of spawned instance.

LLukas Krivka

The pages are probably super heavy so the Crawlee concurrency is not able to keep up the memory under the limit. Maybe you could slow down the scaling. If you want to dig in, it would be better to send a reproduction to Crawlee GitHub issue. Ideally the log with how the current and desired concurrency is changing.

Add a reply

Join on Discord

Apify and Crawlee Official Forum

{"time":"2024-05-20T03:04:41.809Z","level":"WARNING","msg":"PuppeteerCrawler:AutoscaledPool:Snapshot