Apify

Apify and Crawlee Official Forum

b
F
A
J
A
Members
Crafty
C
Crafty
Offline, last seen 3 weeks ago
Joined August 30, 2024
Hi All,

I have a playwright crawler that after a few hours exhausts its memory and ends up going extremely slowly. I havent set up any custom logic to manage the memory and concurrency of crawlee but it was my understanding that in general AutoScaledPool should deal with it anyway?

Most of my memory usage is coming from my chromium instances. there are currently 27 of them each taking between 50 and 100MB. the node process itself is taking around 500MB.

Here is my system stste message

Plain Text
{
  "level": "info",
  "service": "AutoscaledPool",
  "message": "state",
  "id": "5b83448e57d74571921de06df2d980f2",
  "jobId": "testPayload4",
  "currentConcurrency": 1,
  "desiredConcurrency": 1,
  "systemStatus": {
    "isSystemIdle": false,
    "memInfo": {
      "isOverloaded": true,
      "limitRatio": 0.2,
      "actualRatio": 1
    },
    "eventLoopInfo": {
      "isOverloaded": false,
      "limitRatio": 0.6,
      "actualRatio": 0.019
    },
    "cpuInfo": {
      "isOverloaded": false,
      "limitRatio": 0.4,
      "actualRatio": 0
    },
    "clientInfo": {
      "isOverloaded": false,
      "limitRatio": 0.3,
      "actualRatio": 0
    }
  }
}


and here is my memory warning message

Plain Text
{
  "level": "warning",
  "service": "Snapshotter",
  "message": "Memory is critically overloaded. Using 7164 MB of 6065 MB (118%). Consider increasing available memory.",
  "id": "5b83448e57d74571921de06df2d980f2",
  "jobId": "testPayload4"
}

The PC it is running on has 24GB of RAM so the 6GB target makes sense with the default value for maxUsedMemoryRatio being 0.25. The PC also has pleanty of available ram above crawlee, sitting at about 67% usage currently.
Why isnt AutoScaledPool scaling down or otherwise clearing up chromium instances to improve its memory condition?
11 comments
C
D
A
A
O
Hi folks, I pulled the latest revision of the actor-node-playwright-chrome:22 docker image but when i tried to run the project I got the "install browsers" playwright error

Plain Text
azure:service-bus:receiver:warning [connection-1|streaming:discovery-8ffea0b6-f055-c04e-88ae-f31f039f2c24] Abandoning the message with id '656b7051a08b4b759087c40d0ecef687' on the receiver 'discovery-8ffea0b6-f055-c04e-88ae-f31f039f2c24' since an error occured: browserType.launch: Executable doesn't exist at /home/myuser/pw-browsers/chromium-1129/chrome-linux/chrome
╔═════════════════════════════════════════════════════════════════════════╗
║ Looks like Playwright Test or Playwright was just installed or updated. ║
║ Please run the following command to download new browsers:              ║
║                                                                         ║
║     npx playwright install                                              ║
║                                                                         ║
║ <3 Playwright Team                                                      ║
╚═════════════════════════════════════════════════════════════════════════╝

I thought all the browser install was pre-handled in the image?
3 comments
C
A
Hi all,

Im scraping a weird website which has file attachment links which rotate every 5 minutes or so, eg

https://dca-global.org/serve-file/e1725459845/l1714045338/da/c1/Rvfa9Lo-AzHHX0NYJ3f-Tx3FrxSI8-N-Y5ytfS8Prak/1/37/file/1704801502dca_media_publications_2024_v10_lr%20read%20only.pdf

everything between serve-file and file changes regularly.

My strategy to deal with this is to calculate the unique key based on the 'stable' parts of the url. Then when i detect the url has changed, I can remove any queued requests with the unique key and replace them with the new url

My question is, if a request has hit its retry limit and has been 'blacklisted' from the request queue, how can i remove it so the new url can be processed?

Thanks!
4 comments
C
P
Hi All,

im trying to crawl a website that has PDF's to download across different pages

An example
https://dca-global.org/file/view/12756/interact-case-study-cedaci

On that page there is a button with a download link. The download link changes every time you visit the page. When i navigate to the download url manually it works as expected (the file downloads and the tab closes). When i try to navigate to it with puppeteer crawler however, I get a 403 error saying HMAC mismatch but strangely the file still downloads? (I confirmed this by finding the download cache in my temp storage). Im not sure if this is some kind of anti scraping functionality but if so why would it still download?

here is my crawlee config. since it is a 403, my handler never gets called

Plain Text
  chromium.use(stealthPlugin());

  const router = createPlaywrightRouter();
  router.addHandler(
    requestLabels.SPIDER,
    spiderDiscoveryHandlerFactory(container),
  );
  router.addHandler(requestLabels.ARTICLE, articleHandlerFactory(container));

  const config = new Configuration({
    storageClient: new MemoryStorage({
      localDataDirectory: `./storage/${message.messageId}`,
      writeMetadata: true,
      persistStorage: true,
    }),
    persistStateIntervalMillis: 5000,
    persistStorage: true,
    purgeOnStart: false,
    headless: false,
  });

  const crawler = new PlaywrightCrawler(
    {
      launchContext: {
        launcher: chromium,
      },
      requestHandler: router,
      errorHandler: (_request, error) => {
        logger.error(`${error.name}\n${error.message}`);
      },
      maxRequestsPerCrawl:
        body.config.maxRequests > 0 ? body.config.maxRequests : undefined,
      useSessionPool: true,
      persistCookiesPerSession: true,
    },
    config,
  );
7 comments
P
C
A
Hi All,

I want to use crawlee in kubernetes. I want my jobs to be resumable if a pod gets evicted so I have set up a PV for storage. This however poses an issue if I have multiple pods running at once. To solve this I want to change the storage dir programaticly when instanciating crawlee. I know I can do this through ENV vars however id prefer a more programatic solution if possible?

I have looked at the constructors for the (Playwright) Crawler and the Configuration class but I dont seem to be able to set it.

Thanks!
3 comments
C
M
L
Hi All,

im trying to crawl a website that has PDF's to download across different pages

An example
https://dca-global.org/file/view/12756/interact-case-study-cedaci

On that page there is a button with a download link. The download link changes every time you visit the page. When i navigate to the download url manually it works as expected (the file downloads and the tab closes). When i try to navigate to it with puppeteer crawler however, I get a 403 error saying HMAC mismatch but strangely the file still downloads? (I confirmed this by finding the download cache in my temp storage). Im not sure if this is some kind of anti scraping functionality but if so why would it still download?

here is my crawlee config. since it is a 403, my handler never gets called

Plain Text
  chromium.use(stealthPlugin());

  const router = createPlaywrightRouter();
  router.addHandler(
    requestLabels.SPIDER,
    spiderDiscoveryHandlerFactory(container),
  );
  router.addHandler(requestLabels.ARTICLE, articleHandlerFactory(container));

  const config = new Configuration({
    storageClient: new MemoryStorage({
      localDataDirectory: `./storage/${message.messageId}`,
      writeMetadata: true,
      persistStorage: true,
    }),
    persistStateIntervalMillis: 5000,
    persistStorage: true,
    purgeOnStart: false,
    headless: false,
  });

  const crawler = new PlaywrightCrawler(
    {
      launchContext: {
        launcher: chromium,
      },
      requestHandler: router,
      errorHandler: (_request, error) => {
        logger.error(`${error.name}\n${error.message}`);
      },
      maxRequestsPerCrawl:
        body.config.maxRequests > 0 ? body.config.maxRequests : undefined,
      useSessionPool: true,
      persistCookiesPerSession: true,
    },
    config,
  );
7 comments
P
C
A
Hi All,

I want to use crawlee in kubernetes. I want my jobs to be resumable if a pod gets evicted so I have set up a PV for storage. This however poses an issue if I have multiple pods running at once. To solve this I want to change the storage dir programaticly when instanciating crawlee. I know I can do this through ENV vars however id prefer a more programatic solution if possible?

I have looked at the constructors for the (Playwright) Crawler and the Configuration class but I dont seem to be able to set it.

Thanks!
3 comments
C
M
L
Hi All,

I have a playwright crawler that after a few hours exhausts its memory and ends up going extremely slowly. I havent set up any custom logic to manage the memory and concurrency of crawlee but it was my understanding that in general AutoScaledPool should deal with it anyway?

Most of my memory usage is coming from my chromium instances. there are currently 27 of them each taking between 50 and 100MB. the node process itself is taking around 500MB.

Here is my system stste message

Plain Text
{
  "level": "info",
  "service": "AutoscaledPool",
  "message": "state",
  "id": "5b83448e57d74571921de06df2d980f2",
  "jobId": "testPayload4",
  "currentConcurrency": 1,
  "desiredConcurrency": 1,
  "systemStatus": {
    "isSystemIdle": false,
    "memInfo": {
      "isOverloaded": true,
      "limitRatio": 0.2,
      "actualRatio": 1
    },
    "eventLoopInfo": {
      "isOverloaded": false,
      "limitRatio": 0.6,
      "actualRatio": 0.019
    },
    "cpuInfo": {
      "isOverloaded": false,
      "limitRatio": 0.4,
      "actualRatio": 0
    },
    "clientInfo": {
      "isOverloaded": false,
      "limitRatio": 0.3,
      "actualRatio": 0
    }
  }
}


and here is my memory warning message

Plain Text
{
  "level": "warning",
  "service": "Snapshotter",
  "message": "Memory is critically overloaded. Using 7164 MB of 6065 MB (118%). Consider increasing available memory.",
  "id": "5b83448e57d74571921de06df2d980f2",
  "jobId": "testPayload4"
}

The PC it is running on has 24GB of RAM so the 6GB target makes sense with the default value for maxUsedMemoryRatio being 0.25. The PC also has pleanty of available ram above crawlee, sitting at about 67% usage currently.
Why isnt AutoScaledPool scaling down or otherwise clearing up chromium instances to improve its memory condition?
7 comments
C
D
A