Crafty

crawlee not respecting cgroup resource limits

crawlee doesnt seem to respect resource limits imposed by cgroups. This poses problems for containerised enviroments where ethier crawlee gets oom killed or silently slows to a crawl as it thinks it has much more resource available then it actually does. reading and setting the maximum ram is pretty easy

Plain Text

function getMaxMemoryMB(): number | null {
  const cgroupPath = '/sys/fs/cgroup/memory.max';

  if (!existsSync(cgroupPath)) {
    log.warning('Cgroup v2 memory limit file not found.');
    return null;
  }

  try {
    const data = readFileSync(cgroupPath, 'utf-8').trim();
    
    if (data === 'max') {
      log.warning('No memory limit set (cgroup reports "max").');
      return null;
    }

    const maxMemoryBytes = parseInt(data, 10);
    return maxMemoryBytes / (1024 * 1024); // Convert to MB
  } catch (error) {
    log.exception(error as Error, 'Error reading cgroup memory limit:');
    return null;
  }
}

this can then be used to set a reasonable RAM limit for crawlee however, the CPU limits are proving more difficult. Has anyone found a fix yet?

4 comments

CCrafty

await a promise set in a pre navigation hook

Hi All,

I have a pre navigation hook that listens for requests and if they return images saves them to the cloud

Plain Text

  return async (context) => {
    if (context.request.label == requestLabels.article) {
      context.page.on('request', async (req) => {
        if (req.resourceType() == 'image') {
          const response = await req.response();
          // extra processing and save to cloud

This works in 95% of cases however there are some where the main request hander completes before each res.response() can be collected. This causes an error since the browser context closes with the main request handler. My question is how can i best get around this? One idea I had was that I could put a promise in the page userdata that resolves when the page has no outstanding images. After reading the docs however im not sure if this is possible since userdata needs to be serialisable? Has anyone else encountered this type of issue and how have they got around it?

2 comments

CCrafty

autoscale pool trying to scale up without suffecient memory

Hi All,

im running a playwright crawler and am running into a bit of an issue with crawler stability. Have a look at these two log messages

Plain Text

{
  "service": "AutoscaledPool",
  "time": "2024-10-30T16:42:17.049Z",
  "id": "cae4950d568a4b8bac375ffa5a40333c",
  "jobId": "9afee408-42bf-4194-b17c-9864db707e5c",
  "currentConcurrency": "4",
  "desiredConcurrency": "5",
  "systemStatus": "{\"isSystemIdle\":true,\"memInfo\":{\"isOverloaded\":false,\"limitRatio\":0.2,\"actualRatio\":0},\"eventLoopInfo\":{\"isOverloaded\":false,\"limitRatio\":0.6,\"actualRatio\":0},\"cpuInfo\":{\"isOverloaded\":false,\"limitRatio\":0.4,\"actualRatio\":0},\"clientInfo\":{\"isOverloaded\":false,\"limitRatio\":0.3,\"actualRatio\":0}}"
}

autoscaled pool is trying to increase its concurrency from 4 to 5 since it was in its view idle. 20 seconds later though

Plain Text

{
  "rejection": "true",
  "date": "Wed Oct 30 2024 16:42:38 GMT+0000 (Coordinated Universal Time)",
  "process": "{\"pid\":1,\"uid\":997,\"gid\":997,\"cwd\":\"/home/myuser\",\"execPath\":\"/usr/local/bin/node\",\"version\":\"v22.9.0\",\"argv\":[\"/usr/local/bin/node\",\"/home/myuser/FIDO-Scraper-Discovery\"],\"memoryUsage\":{\"rss\":337043456,\"heapTotal\":204886016,\"heapUsed\":168177928,\"external\":30148440,\"arrayBuffers\":14949780}}",
  "os": "{\"loadavg\":[3.08,3.38,3.68],\"uptime\":312222.44}",
  "stack": "response.headerValue: Target page, context or browser has been closed\n    at Page.<anonymous> (/home/myuser/FIDO-Scraper-Discovery/dist/articleImagesPreNavHook.js:15:60)"
}

which suggests memory was much tighter than autoscaledpool was considering, likley due to the additional ram that chromium was using. Crawlee was running in a k8 pod with a 4GB ram limit. Is this behaviour intended and how might i improve my performance? Does autoscaled pool account for how much ram is actually in use or just how much the node process uses?

5 comments

CCrafty

chromium issue in apify/actor-node-playwright-chrome:22

Hi folks, I pulled the latest revision of the actor-node-playwright-chrome:22 docker image but when i tried to run the project I got the "install browsers" playwright error

Plain Text

azure:service-bus:receiver:warning [connection-1|streaming:discovery-8ffea0b6-f055-c04e-88ae-f31f039f2c24] Abandoning the message with id '656b7051a08b4b759087c40d0ecef687' on the receiver 'discovery-8ffea0b6-f055-c04e-88ae-f31f039f2c24' since an error occured: browserType.launch: Executable doesn't exist at /home/myuser/pw-browsers/chromium-1129/chrome-linux/chrome
╔═════════════════════════════════════════════════════════════════════════╗
║ Looks like Playwright Test or Playwright was just installed or updated. ║
║ Please run the following command to download new browsers:              ║
║                                                                         ║
║     npx playwright install                                              ║
║                                                                         ║
║ <3 Playwright Team                                                      ║
╚═════════════════════════════════════════════════════════════════════════╝

I thought all the browser install was pre-handled in the image?

3 comments

CCrafty

remove uniqueKey from queue blacklist

Hi all,

Im scraping a weird website which has file attachment links which rotate every 5 minutes or so, eg

https://dca-global.org/serve-file/e1725459845/l1714045338/da/c1/Rvfa9Lo-AzHHX0NYJ3f-Tx3FrxSI8-N-Y5ytfS8Prak/1/37/file/1704801502dca_media_publications_2024_v10_lr%20read%20only.pdf

everything between serve-file and file changes regularly.

My strategy to deal with this is to calculate the unique key based on the 'stable' parts of the url. Then when i detect the url has changed, I can remove any queued requests with the unique key and replace them with the new url

My question is, if a request has hit its retry limit and has been 'blacklisted' from the request queue, how can i remove it so the new url can be processed?

Thanks!

4 comments

CCrafty

change storage dir programaticly

Hi All,

I want to use crawlee in kubernetes. I want my jobs to be resumable if a pod gets evicted so I have set up a PV for storage. This however poses an issue if I have multiple pods running at once. To solve this I want to change the storage dir programaticly when instanciating crawlee. I know I can do this through ENV vars however id prefer a more programatic solution if possible?

I have looked at the constructors for the (Playwright) Crawler and the Configuration class but I dont seem to be able to set it.

Thanks!

3 comments

CCrafty

error when crawling download link

Hi All,

im trying to crawl a website that has PDF's to download across different pages

An example
https://dca-global.org/file/view/12756/interact-case-study-cedaci

On that page there is a button with a download link. The download link changes every time you visit the page. When i navigate to the download url manually it works as expected (the file downloads and the tab closes). When i try to navigate to it with puppeteer crawler however, I get a 403 error saying HMAC mismatch but strangely the file still downloads? (I confirmed this by finding the download cache in my temp storage). Im not sure if this is some kind of anti scraping functionality but if so why would it still download?

here is my crawlee config. since it is a 403, my handler never gets called

Plain Text

  chromium.use(stealthPlugin());

  const router = createPlaywrightRouter();
  router.addHandler(
    requestLabels.SPIDER,
    spiderDiscoveryHandlerFactory(container),
  );
  router.addHandler(requestLabels.ARTICLE, articleHandlerFactory(container));

  const config = new Configuration({
    storageClient: new MemoryStorage({
      localDataDirectory: `./storage/${message.messageId}`,
      writeMetadata: true,
      persistStorage: true,
    }),
    persistStateIntervalMillis: 5000,
    persistStorage: true,
    purgeOnStart: false,
    headless: false,
  });

  const crawler = new PlaywrightCrawler(
    {
      launchContext: {
        launcher: chromium,
      },
      requestHandler: router,
      errorHandler: (_request, error) => {
        logger.error(`${error.name}\n${error.message}`);
      },
      maxRequestsPerCrawl:
        body.config.maxRequests > 0 ? body.config.maxRequests : undefined,
      useSessionPool: true,
      persistCookiesPerSession: true,
    },
    config,
  );

7 comments

CCrafty

Crawlee memory management

Hi All,

I have a playwright crawler that after a few hours exhausts its memory and ends up going extremely slowly. I havent set up any custom logic to manage the memory and concurrency of crawlee but it was my understanding that in general AutoScaledPool should deal with it anyway?

Most of my memory usage is coming from my chromium instances. there are currently 27 of them each taking between 50 and 100MB. the node process itself is taking around 500MB.

Here is my system stste message

Plain Text

{
  "level": "info",
  "service": "AutoscaledPool",
  "message": "state",
  "id": "5b83448e57d74571921de06df2d980f2",
  "jobId": "testPayload4",
  "currentConcurrency": 1,
  "desiredConcurrency": 1,
  "systemStatus": {
    "isSystemIdle": false,
    "memInfo": {
      "isOverloaded": true,
      "limitRatio": 0.2,
      "actualRatio": 1
    },
    "eventLoopInfo": {
      "isOverloaded": false,
      "limitRatio": 0.6,
      "actualRatio": 0.019
    },
    "cpuInfo": {
      "isOverloaded": false,
      "limitRatio": 0.4,
      "actualRatio": 0
    },
    "clientInfo": {
      "isOverloaded": false,
      "limitRatio": 0.3,
      "actualRatio": 0
    }
  }
}

and here is my memory warning message

Plain Text

{
  "level": "warning",
  "service": "Snapshotter",
  "message": "Memory is critically overloaded. Using 7164 MB of 6065 MB (118%). Consider increasing available memory.",
  "id": "5b83448e57d74571921de06df2d980f2",
  "jobId": "testPayload4"
}

The PC it is running on has 24GB of RAM so the 6GB target makes sense with the default value for maxUsedMemoryRatio being 0.25. The PC also has pleanty of available ram above crawlee, sitting at about 67% usage currently.
Why isnt AutoScaledPool scaling down or otherwise clearing up chromium instances to improve its memory condition?

7 comments

Apify Discord Mirror

crawlee not respecting cgroup resource limits

await a promise set in a pre navigation hook

autoscale pool trying to scale up without suffecient memory

chromium issue in apify/actor-node-playwright-chrome:22

remove uniqueKey from queue blacklist

change storage dir programaticly

error when crawling download link

Crawlee memory management