NeoNomade | Scraping hellhound

Apify

Apify Crawlee GitHub

Apify Discord Mirror

NeoNomade | Scraping hellhound

N

NeoNomade | Scraping hellhound

Offline, last seen yesterday

Joined August 30, 2024

NNeoNomade | Scraping hellhound

·

Camoufox failing

I have a project that is using the PlaywrightCrawler from Crawlee.
If I create the template camoufox it's running perfectly, when I take the same commands from the package.json of the template and basically following the same example in my project I get the following error:

Plain Text

2025-03-13T11:58:38.513Z [Crawler] [INFO ℹ️] Finished! Total 0 requests: 0 succeeded, 0 failed.
{"terminal":true}
2025-03-13T11:58:38.513Z [Crawler] [ERROR ❌] BrowserLaunchError: Failed to launch browser. Please check the following:
- Check whether the provided executable path "/Users/dp420/.cache/camoufox/Camoufox.app/Contents/MacOS/camoufox" is correct.
- Try installing the required dependencies by running `npx playwright install --with-deps` (https://playwright.dev/docs/browsers).

Of course none of those 2 ideas are helping, camoufox binary is already there, and playwright install --with-deps have been already ran because the project was previously running firefox.

the entire error log is attached

2 comments

n

NNeoNomade | Scraping hellhound

·

Replace default logger

Hello, did anybody manage to completely replace the logs from Crawlee with console logs ?
If yes, can you please share your implementation ?

1 comment

C

NNeoNomade | Scraping hellhound

·

CheerioCrawler headerGenerator help

Hello !

I kept reading the docs but couldn't find a clear information about this. When we use Puppeteer or Playwright we can tweak in browserPool the fingerprintGenerator. For Cheerio we have the headerGenerator from got, how we can adjust it inside the CheerioCrawler ?

2 comments

L

f

NNeoNomade | Scraping hellhound

·

extract cookies in playwright

Hello,
In puppeteer in order to extract cookies I was doing :

Plain Text

cookiesStore = await page.cookies(page.url());

how can I achieve the same thing in Playwright ? Can't find anything in docs

1 comment

A

NNeoNomade | Scraping hellhound

·

running on ARM

Have anybody managed to deploy Cheerio or Puppeteer crawlers on ARM instances ?

4 comments

H

N

L

NNeoNomade | Scraping hellhound

·

XVFB fails on server.

I've deployed a playwright with chromium crawler on aws batch, with the default docker image.
this is the error that I'm getting, it's mandatory for this crawler to run headful because otherwise there are some buttons that I need to click that are not loading.
(Error log attached).
I've also tried to create a custom slimmer image, but I bump into the same issue with Xvfb.

16 comments

m

N

NNeoNomade | Scraping hellhound

·

Handle browser failure

I have Puppeteer scraper that is doing lots of actions on a page, at one point the browser fails.
It's a page with infinite scroll and I have to click a button and scroll down. After 70-80 interactions the browser crashes, and the request is getting retried as usual.
The main idea is that with those actions I'm collecting
urls that I wan't to navigate.
I want to somehow handle the browser crashing so I can start with those urls when the browser crashes.

2 comments

P

NNeoNomade | Scraping hellhound

·

Interception error in Puppeteer

I'm getting this error in Puppeteer but I'm not doing any interception in my script, I just create a request and add it to the crawler using crawler.addRequests, the request is a get where I just provide url and headers.

Plain Text

DEBUG Error while disabling request interception {"error":{"name":"TargetCloseError","message":"Protocol error (Network.setCacheDisabled): Target closed","stack":"TargetCloseError: Protocol error (Network.setCacheDisabled): Target closed\n at CallbackRegistry.clear (project/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Connection.js:138:36)\n at CDPSessionImpl._onClosed (project/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Connection.js:451:25)\n at Connection.onMessage (project/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Connection.js:248:25)\n at WebSocket.<anonymous> (project/node_modules/puppeteer-core/lib/cjs/puppeteer/common/NodeWebSocketTransport.js:52:32)\n at callListener (project/node_modules/ws/lib/event-target.js:290:14)\n at WebSocket.onMessage (project/node_modules/ws/lib/event-target.js:209:9)\n at WebSocket.emit (node:events:365:28)\n at Receiver.receiverOnMessage (project/node_modules/ws/lib/websocket.js:1184:20)\n at Receiver.emit (node:events:365:28)\n at Receiver.dataMessage (project/node_modules/ws/lib/receiver.js:541:14)"}}

1 comment

A

NNeoNomade | Scraping hellhound

·

Crawler works locally but not on cloud

Hello, I've built a puppeteer crawler, nothing special about it.
It works locally flawless, I've tried to deploy to AWS on batch with Fargate, I get navigation timeouts after 60 seconds, switched to EC2, navigation timeouts after 60 seconds, increased navigation timeout to 120 seconds, same error.
Switched proxies between BrightData and OxyLabs, same issue.
Deployed to Apify, same issue.

I'm getting out of my mind understanding why is this happening.

17 comments

N

P

A

t

NNeoNomade | Scraping hellhound

·

Crawler only working in headed mode.

I have a Puppeteer Crawler that works almost flawless in headed mode, but if I go headless all the requests are getting 403 errors.
I was thinking that xvfb should fix this but unfortunately it doesn't. Any other ideas ?

10 comments

A

N

c

O

NNeoNomade | Scraping hellhound

·

Cheerio memory error

Hello,
I have deployed a CheerioCrawler on AWS, the machine has 2vCPU and 4gb of ram, but I get the following error:

Plain Text

WARN CheerioCrawler:AutoscaledPool:Snapshotter: Memory is critically overloaded. Using 1174 MB of 750 MB (157%). Consider increasing available memory.

What can it be ?

2 comments

P

N

NNeoNomade | Scraping hellhound

·

Pause concurrent requests ?

Hello,
I have the following issue, I have a website that I'm scraping and I need to login every 100-150 items.
The issue is, if I'm going with more than 1 concurrent requests when in needs to login it already has in progress requests, which will go wrong.
So I have a marker that I'm extracting to know when I need to login again.
I want to go with >1 concurrent requests and stop everything when that marker is found, do the login and then resume.
Could it be possible to achieve that ?

22 comments

N

P

A

NNeoNomade | Scraping hellhound

·

requestHandler timed out

Hello,
I have a quite big scraper, it goes over 200k pages and it will take approximately 12 hours, but after 6 hours, for some reason all the requests are getting this requestHandler timed out after 30 seconds.

I don't think increasing the requestHandler timeout will solve it, maybe there is something else wrong that I don't get ?

15 comments

P

A

N

O

H

NNeoNomade | Scraping hellhound

·

networkidle2 option

Hello,
In puppeteer for page.reload or page.goto you can choose the option {waitUntil: 'networkidle2'} , using Puppeteer in Crawlee, I only found that I can use that if I reload each page .

Is there any other way to configure navigation to use the {waitUntil: 'networkidle2'} from the beginning ?

1 comment

H

NNeoNomade | Scraping hellhound

·

Blocking certain requests

I'm trying to block some requests in Puppeteer but it doesn't seem to work if I run the script headed :

Plain Text

const blockedResourceTypes = ['webp', 'svg', 'mp4', 'jpeg', 'gif', 'avif', 'font']
const crawler = new PuppeteerCrawler({
    launchContext: {
        launchOptions: {
            headless: false,
            devtools: true,
            defaultViewport:{ width: 1920, height: 6000 },
            args: [
                '--disable-dev-shm-usage',
            ]
        },
        useIncognitoPages: true,
    },
    proxyConfiguration,
    requestHandler: router,
    maxConcurrency: 16,
    maxRequestRetries: 15,
    maxRequestsPerMinute: 2,
    navigationTimeoutSecs: 120,
    useSessionPool: true,
    failedRequestHandler({ request }) {
        log.debug(`Request ${request.url} failed 15 times.`);
    },

    preNavigationHooks: [
        async ({ addInterceptRequestHandler }) => {
            await addInterceptRequestHandler((request) => {
                if (blockedResourceTypes.includes(request.resourceType())) {
                    return request.respond({
                        status: 200,
                        body: 'useless shit',
                    });
                }
                return request.continue();
            });
        },
    ],
});

Any ideas ?

10 comments

N

P

A

NNeoNomade | Scraping hellhound

·

--disable-dev-shm-usage

How can I run puppeteer with this tag ? (obviously inside Crawlee)

2 comments

P

N

NNeoNomade | Scraping hellhound

·

CheerioCrawler hangs with 12 million urls

Plain Text

const requestList = await RequestList.open('My-ReqList', allUrls, { persistStateKey: 'My-ReqList' });
console.log(requestList.length())
const crawler = new CheerioCrawler({
  requestList,
  proxyConfiguration,
  requestHandler: router,
  minConcurrency: 32,
  maxConcurrency: 256,
  maxRequestRetries: 20,
  navigationTimeoutSecs: 6,
  loggingInterval: 30,
  useSessionPool: true,
  failedRequestHandler({ request }) {
      log.debug(`Request ${request.url} failed 20 times.`);
  },
});
await crawler.run()

allUrls is a list that contains 12 million urls, I'm trying to load them in the CheerioCrawler, but the process hangs using 14gb of ram memory and it doesn't even logs the requestList.length().

Can anybody help, please ?

11 comments

O

N

NNeoNomade | Scraping hellhound

·

download xml.gz sitemaps.

I'm trying to parse the sitemaps from a website that has .xml.gz sitemaps, in python I could use gunzip to decompress and use them.
In crawlee we only have the "downloadListOfUrls" method, how I could make it to decompress those files before using them >?
sitemap: https://www.zoro.com/sitemaps/usa/sitemap-product-10.xml.gz

6 comments

A

N

NNeoNomade | Scraping hellhound

·

custom logic for status codes

Is it possible to do some custom logic based on status codes in the Puppeteer Crawler ? If yes, how ?

3 comments

m

N

R

NNeoNomade | Scraping hellhound

·

Concurrent requests and login

Hello,
I’m scraping a website where I have to choose the location (it’s an e commerce store and I have to choose the shop) .
The issue is , I’ve created a post navigation hook that is checking for the location and in case it’s not correct it’s starting the process of selecting the location again, saving the cookies and retry the request with the new cookies.
With this workflow I can only achieve concurrency 1.
Otherwise I’ll have multiple selections of the location at the same time and 99% of time it fails.
How could I somehow pause the entire process, do the selection, and retry all the unhandled requests in the queue .

Thanks ! 🙏🏼

2 comments

N

P

NNeoNomade | Scraping hellhound

·

error handling

Can we somehow throw errors that are closing the page ?
and not retrying the request?

3 comments

N

H

A

NNeoNomade | Scraping hellhound

·

Throw error that respects maxRequestRetries

Hello,
With RetryRequestError, the request gets retried an infinite times until it succeeds, what error should I throw to respect the maxRequestRetries?

4 comments

N

P

NNeoNomade | Scraping hellhound

·

TSConfig in Crawlee projects.

Plain Text

Cannot find module 'crawlee'. Did you mean to set the 'moduleResolution' option to 'nodenext', or to add aliases to the 'paths' option?ts(2792)

The linter is giving this error even on the template project.
This needs attention or can I let it like this ?

17 comments

N

P

NNeoNomade | Scraping hellhound

·

enqueueLinksByClickingElements help

This is the code :

Plain Text

await utils.puppeteer.enqueueLinksByClickingElements({
        page,
        requestQueue: RequestQueue.open(),
        selector: 'li.pagination_next',
        label: 'category',
        forefront: true

    });

This is the error :

Plain Text

 Reclaiming failed request back to the list or queue. Expected property object `requestQueue` to have keys `["fetchNextRequest","addRequest"]` in object `options`

I have imported RequestQueue from crawlee, don't understand where it goes wrong

3 comments

N

v

NNeoNomade | Scraping hellhound

·

Error when running in Docker Container

I'm deploying a Crawlee (Cheerio) project in an amazonlinux:2023 based docker container.
I get the following error:

Plain Text

> node src/main.js

DEBUG CheerioCrawler:SessionPool: No 'persistStateKeyValueStoreId' options specified, this session pool's data has been saved in the KeyValueStore with the id: ee911a9c-b90e-412e-af5b-a470b0172ba8
INFO  CheerioCrawler: Starting the crawl
ERROR Memory snapshot failed.
  Error: spawn ps ENOENT
      at ChildProcess._handle.onexit (node:internal/child_process:283:19)
      at onErrorNT (node:internal/child_process:476:16)
      at process.processTicksAndRejections (node:internal/process/task_queues:82:21)
DEBUG CheerioCrawler:SessionPool: Persisting state {"persistStateKey":"SDK_SESSION_POOL_STATE"}
DEBUG Statistics: Persisting state {"persistStateKey":"SDK_CRAWLER_STATISTICS_0"}
DEBUG CheerioCrawler:SessionPool: Persisting state {"persistStateKey":"SDK_SESSION_POOL_STATE"}
DEBUG Statistics: Persisting state {"persistStateKey":"SDK_CRAWLER_STATISTICS_0"}
DEBUG Statistics: Persisting state {"persistStateKey":"SDK_CRAWLER_STATISTICS_0"}
node:internal/errors:490
    ErrorCaptureStackTrace(err);
    ^

Error: spawn ps ENOENT
    at ChildProcess._handle.onexit (node:internal/child_process:283:19)
    at onErrorNT (node:internal/child_process:476:16)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
  errno: -2,
  code: 'ENOENT',
  syscall: 'spawn ps',
  path: 'ps',
  spawnargs: [ '-A', '-o', 'ppid,pid,stat,rss,comm' ]
}

Node.js v18.16.0

2 comments

A

N