Apify and Crawlee Official Forum

C
Crafty
Offline, last seen 6 days ago
Joined August 30, 2024
Hi All,

I have a pre navigation hook that listens for requests and if they return images saves them to the cloud

Plain Text
  return async (context) => {
    if (context.request.label == requestLabels.article) {
      context.page.on('request', async (req) => {
        if (req.resourceType() == 'image') {
          const response = await req.response();
          // extra processing and save to cloud


This works in 95% of cases however there are some where the main request hander completes before each res.response() can be collected. This causes an error since the browser context closes with the main request handler. My question is how can i best get around this? One idea I had was that I could put a promise in the page userdata that resolves when the page has no outstanding images. After reading the docs however im not sure if this is possible since userdata needs to be serialisable? Has anyone else encountered this type of issue and how have they got around it?
2 comments
P
C
Hi All,

im running a playwright crawler and am running into a bit of an issue with crawler stability. Have a look at these two log messages


Plain Text
{
  "service": "AutoscaledPool",
  "time": "2024-10-30T16:42:17.049Z",
  "id": "cae4950d568a4b8bac375ffa5a40333c",
  "jobId": "9afee408-42bf-4194-b17c-9864db707e5c",
  "currentConcurrency": "4",
  "desiredConcurrency": "5",
  "systemStatus": "{\"isSystemIdle\":true,\"memInfo\":{\"isOverloaded\":false,\"limitRatio\":0.2,\"actualRatio\":0},\"eventLoopInfo\":{\"isOverloaded\":false,\"limitRatio\":0.6,\"actualRatio\":0},\"cpuInfo\":{\"isOverloaded\":false,\"limitRatio\":0.4,\"actualRatio\":0},\"clientInfo\":{\"isOverloaded\":false,\"limitRatio\":0.3,\"actualRatio\":0}}"
}


autoscaled pool is trying to increase its concurrency from 4 to 5 since it was in its view idle. 20 seconds later though

Plain Text
{
  "rejection": "true",
  "date": "Wed Oct 30 2024 16:42:38 GMT+0000 (Coordinated Universal Time)",
  "process": "{\"pid\":1,\"uid\":997,\"gid\":997,\"cwd\":\"/home/myuser\",\"execPath\":\"/usr/local/bin/node\",\"version\":\"v22.9.0\",\"argv\":[\"/usr/local/bin/node\",\"/home/myuser/FIDO-Scraper-Discovery\"],\"memoryUsage\":{\"rss\":337043456,\"heapTotal\":204886016,\"heapUsed\":168177928,\"external\":30148440,\"arrayBuffers\":14949780}}",
  "os": "{\"loadavg\":[3.08,3.38,3.68],\"uptime\":312222.44}",
  "stack": "response.headerValue: Target page, context or browser has been closed\n    at Page.<anonymous> (/home/myuser/FIDO-Scraper-Discovery/dist/articleImagesPreNavHook.js:15:60)"
}

which suggests memory was much tighter than autoscaledpool was considering, likley due to the additional ram that chromium was using. Crawlee was running in a k8 pod with a 4GB ram limit. Is this behaviour intended and how might i improve my performance? Does autoscaled pool account for how much ram is actually in use or just how much the node process uses?
5 comments
C
M
P
Hi folks, I pulled the latest revision of the actor-node-playwright-chrome:22 docker image but when i tried to run the project I got the "install browsers" playwright error

Plain Text
azure:service-bus:receiver:warning [connection-1|streaming:discovery-8ffea0b6-f055-c04e-88ae-f31f039f2c24] Abandoning the message with id '656b7051a08b4b759087c40d0ecef687' on the receiver 'discovery-8ffea0b6-f055-c04e-88ae-f31f039f2c24' since an error occured: browserType.launch: Executable doesn't exist at /home/myuser/pw-browsers/chromium-1129/chrome-linux/chrome
╔═════════════════════════════════════════════════════════════════════════╗
║ Looks like Playwright Test or Playwright was just installed or updated. ║
║ Please run the following command to download new browsers:              ║
║                                                                         ║
║     npx playwright install                                              ║
║                                                                         ║
║ <3 Playwright Team                                                      ║
╚═════════════════════════════════════════════════════════════════════════╝

I thought all the browser install was pre-handled in the image?
3 comments
C
A
Hi all,

Im scraping a weird website which has file attachment links which rotate every 5 minutes or so, eg

https://dca-global.org/serve-file/e1725459845/l1714045338/da/c1/Rvfa9Lo-AzHHX0NYJ3f-Tx3FrxSI8-N-Y5ytfS8Prak/1/37/file/1704801502dca_media_publications_2024_v10_lr%20read%20only.pdf

everything between serve-file and file changes regularly.

My strategy to deal with this is to calculate the unique key based on the 'stable' parts of the url. Then when i detect the url has changed, I can remove any queued requests with the unique key and replace them with the new url

My question is, if a request has hit its retry limit and has been 'blacklisted' from the request queue, how can i remove it so the new url can be processed?

Thanks!
4 comments
C
P
Hi All,

I want to use crawlee in kubernetes. I want my jobs to be resumable if a pod gets evicted so I have set up a PV for storage. This however poses an issue if I have multiple pods running at once. To solve this I want to change the storage dir programaticly when instanciating crawlee. I know I can do this through ENV vars however id prefer a more programatic solution if possible?

I have looked at the constructors for the (Playwright) Crawler and the Configuration class but I dont seem to be able to set it.

Thanks!
3 comments
C
M
L
Hi All,

im trying to crawl a website that has PDF's to download across different pages

An example
https://dca-global.org/file/view/12756/interact-case-study-cedaci

On that page there is a button with a download link. The download link changes every time you visit the page. When i navigate to the download url manually it works as expected (the file downloads and the tab closes). When i try to navigate to it with puppeteer crawler however, I get a 403 error saying HMAC mismatch but strangely the file still downloads? (I confirmed this by finding the download cache in my temp storage). Im not sure if this is some kind of anti scraping functionality but if so why would it still download?

here is my crawlee config. since it is a 403, my handler never gets called

Plain Text
  chromium.use(stealthPlugin());

  const router = createPlaywrightRouter();
  router.addHandler(
    requestLabels.SPIDER,
    spiderDiscoveryHandlerFactory(container),
  );
  router.addHandler(requestLabels.ARTICLE, articleHandlerFactory(container));

  const config = new Configuration({
    storageClient: new MemoryStorage({
      localDataDirectory: `./storage/${message.messageId}`,
      writeMetadata: true,
      persistStorage: true,
    }),
    persistStateIntervalMillis: 5000,
    persistStorage: true,
    purgeOnStart: false,
    headless: false,
  });

  const crawler = new PlaywrightCrawler(
    {
      launchContext: {
        launcher: chromium,
      },
      requestHandler: router,
      errorHandler: (_request, error) => {
        logger.error(`${error.name}\n${error.message}`);
      },
      maxRequestsPerCrawl:
        body.config.maxRequests > 0 ? body.config.maxRequests : undefined,
      useSessionPool: true,
      persistCookiesPerSession: true,
    },
    config,
  );
7 comments
P
C
A
Hi All,

I have a playwright crawler that after a few hours exhausts its memory and ends up going extremely slowly. I havent set up any custom logic to manage the memory and concurrency of crawlee but it was my understanding that in general AutoScaledPool should deal with it anyway?

Most of my memory usage is coming from my chromium instances. there are currently 27 of them each taking between 50 and 100MB. the node process itself is taking around 500MB.

Here is my system stste message

Plain Text
{
  "level": "info",
  "service": "AutoscaledPool",
  "message": "state",
  "id": "5b83448e57d74571921de06df2d980f2",
  "jobId": "testPayload4",
  "currentConcurrency": 1,
  "desiredConcurrency": 1,
  "systemStatus": {
    "isSystemIdle": false,
    "memInfo": {
      "isOverloaded": true,
      "limitRatio": 0.2,
      "actualRatio": 1
    },
    "eventLoopInfo": {
      "isOverloaded": false,
      "limitRatio": 0.6,
      "actualRatio": 0.019
    },
    "cpuInfo": {
      "isOverloaded": false,
      "limitRatio": 0.4,
      "actualRatio": 0
    },
    "clientInfo": {
      "isOverloaded": false,
      "limitRatio": 0.3,
      "actualRatio": 0
    }
  }
}


and here is my memory warning message

Plain Text
{
  "level": "warning",
  "service": "Snapshotter",
  "message": "Memory is critically overloaded. Using 7164 MB of 6065 MB (118%). Consider increasing available memory.",
  "id": "5b83448e57d74571921de06df2d980f2",
  "jobId": "testPayload4"
}

The PC it is running on has 24GB of RAM so the 6GB target makes sense with the default value for maxUsedMemoryRatio being 0.25. The PC also has pleanty of available ram above crawlee, sitting at about 67% usage currently.
Why isnt AutoScaledPool scaling down or otherwise clearing up chromium instances to improve its memory condition?
7 comments
C
D
A