Apify and Crawlee Official Forum

Updated 4 months ago

Crawlee memory management

Hi All,

I have a playwright crawler that after a few hours exhausts its memory and ends up going extremely slowly. I havent set up any custom logic to manage the memory and concurrency of crawlee but it was my understanding that in general AutoScaledPool should deal with it anyway?

Most of my memory usage is coming from my chromium instances. there are currently 27 of them each taking between 50 and 100MB. the node process itself is taking around 500MB.

Here is my system stste message

Plain Text
{
  "level": "info",
  "service": "AutoscaledPool",
  "message": "state",
  "id": "5b83448e57d74571921de06df2d980f2",
  "jobId": "testPayload4",
  "currentConcurrency": 1,
  "desiredConcurrency": 1,
  "systemStatus": {
    "isSystemIdle": false,
    "memInfo": {
      "isOverloaded": true,
      "limitRatio": 0.2,
      "actualRatio": 1
    },
    "eventLoopInfo": {
      "isOverloaded": false,
      "limitRatio": 0.6,
      "actualRatio": 0.019
    },
    "cpuInfo": {
      "isOverloaded": false,
      "limitRatio": 0.4,
      "actualRatio": 0
    },
    "clientInfo": {
      "isOverloaded": false,
      "limitRatio": 0.3,
      "actualRatio": 0
    }
  }
}


and here is my memory warning message

Plain Text
{
  "level": "warning",
  "service": "Snapshotter",
  "message": "Memory is critically overloaded. Using 7164 MB of 6065 MB (118%). Consider increasing available memory.",
  "id": "5b83448e57d74571921de06df2d980f2",
  "jobId": "testPayload4"
}

The PC it is running on has 24GB of RAM so the 6GB target makes sense with the default value for maxUsedMemoryRatio being 0.25. The PC also has pleanty of available ram above crawlee, sitting at about 67% usage currently.
Why isnt AutoScaledPool scaling down or otherwise clearing up chromium instances to improve its memory condition?
C
A
D
7 comments
I think i fixed it. I dont think it was anything to do with crawlee at all. Periodicly I was opening a new chromium context manually to handle authentication. I wasnt closing those contexts so they were just piling up every 5 minutes
just advanced to level 2! Thanks for your contributions! πŸŽ‰
Out of interest, how did you generate that system state message ?
Its just automatic isnt it? I will double check if i have anything special. πŸ™‚
here is my crawler config code

Plain Text
  const router = createPlaywrightRouter();
  router.addHandler(
    requestLabels.spider,
    await spiderDiscoveryHandlerFactory(container),
  );
  router.addHandler(
    requestLabels.spiderBackTrack,
    await spiderBackTrackHandlerFactory(container),
  );
  router.addHandler(
    requestLabels.article,
    await articleHandlerFactory(container),
  );
  router.addHandler(
    requestLabels.download,
    await downloadHandlerFactory(container),
  );

  const crawlerOptions: PlaywrightCrawlerOptions = {
    launchContext: {
      launcher: chromium,
    },
    requestHandler: router,
    preNavigationHooks: [
      downloadPreNavigationHookFactory(container),
      articleImageInterceptorFactory(container),
    ],
    errorHandler: errorHandlerFactory(container),
    failedRequestHandler: failedRequestHandlerFactory(container),
    maxRequestsPerCrawl:
      body.config.maxRequests > 0 ? body.config.maxRequests : undefined,
    useSessionPool: true,
    log: new cralweeLogger(logger.child('crawlee')),
    persistCookiesPerSession: true,
  };

  const storageClient = new MemoryStorage({
    localDataDirectory: `./storage/${message.messageId}`,
    writeMetadata: true,
    persistStorage: true,
  });

  const crawlerConfig = new Configuration({
    storageClient: storageClient,
    persistStateIntervalMillis: 5000,
    persistStorage: true,
    purgeOnStart: false,
    headless: true,
  });
  }

  const crawler = new PlaywrightCrawler(crawlerOptions, crawlerConfig);


the only key difference is that I made my own logger that hooked into the winston logging I have been using in the wider app
Oh I want to have my own logger. How did you implement that?
You can extend the crawlee log class, overwrite the 'internal' method (iirc) and do whatever you like
Add a reply
Sign up and join the conversation on Discord