Apify and Crawlee Official Forum

Home
Members
NeoNomade
N
NeoNomade
Offline, last seen 3 months ago
Joined August 30, 2024
Hello,
In puppeteer in order to extract cookies I was doing :
Plain Text
cookiesStore = await page.cookies(page.url());

how can I achieve the same thing in Playwright ? Can't find anything in docs
1 comment
A
Have anybody managed to deploy Cheerio or Puppeteer crawlers on ARM instances ?
4 comments
H
N
L
I've deployed a playwright with chromium crawler on aws batch, with the default docker image.
this is the error that I'm getting, it's mandatory for this crawler to run headful because otherwise there are some buttons that I need to click that are not loading.
(Error log attached).
I've also tried to create a custom slimmer image, but I bump into the same issue with Xvfb.
16 comments
m
N
I have Puppeteer scraper that is doing lots of actions on a page, at one point the browser fails.
It's a page with infinite scroll and I have to click a button and scroll down. After 70-80 interactions the browser crashes, and the request is getting retried as usual.
The main idea is that with those actions I'm collecting
urls that I wan't to navigate.
I want to somehow handle the browser crashing so I can start with those urls when the browser crashes.
2 comments
P
I'm getting this error in Puppeteer but I'm not doing any interception in my script, I just create a request and add it to the crawler using crawler.addRequests, the request is a get where I just provide url and headers.

Plain Text
DEBUG Error while disabling request interception {"error":{"name":"TargetCloseError","message":"Protocol error (Network.setCacheDisabled): Target closed","stack":"TargetCloseError: Protocol error (Network.setCacheDisabled): Target closed\n at CallbackRegistry.clear (project/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Connection.js:138:36)\n at CDPSessionImpl._onClosed (project/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Connection.js:451:25)\n at Connection.onMessage (project/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Connection.js:248:25)\n at WebSocket.<anonymous> (project/node_modules/puppeteer-core/lib/cjs/puppeteer/common/NodeWebSocketTransport.js:52:32)\n at callListener (project/node_modules/ws/lib/event-target.js:290:14)\n at WebSocket.onMessage (project/node_modules/ws/lib/event-target.js:209:9)\n at WebSocket.emit (node:events:365:28)\n at Receiver.receiverOnMessage (project/node_modules/ws/lib/websocket.js:1184:20)\n at Receiver.emit (node:events:365:28)\n at Receiver.dataMessage (project/node_modules/ws/lib/receiver.js:541:14)"}}
1 comment
A
Hello, I've built a puppeteer crawler, nothing special about it.
It works locally flawless, I've tried to deploy to AWS on batch with Fargate, I get navigation timeouts after 60 seconds, switched to EC2, navigation timeouts after 60 seconds, increased navigation timeout to 120 seconds, same error.
Switched proxies between BrightData and OxyLabs, same issue.
Deployed to Apify, same issue.

I'm getting out of my mind understanding why is this happening.
17 comments
N
P
A
t
I have a Puppeteer Crawler that works almost flawless in headed mode, but if I go headless all the requests are getting 403 errors.
I was thinking that xvfb should fix this but unfortunately it doesn't. Any other ideas ?
10 comments
A
N
c
O
Hello,
I have deployed a CheerioCrawler on AWS, the machine has 2vCPU and 4gb of ram, but I get the following error:
Plain Text
WARN CheerioCrawler:AutoscaledPool:Snapshotter: Memory is critically overloaded. Using 1174 MB of 750 MB (157%). Consider increasing available memory.


What can it be ?
2 comments
P
N
Hello,
I have the following issue, I have a website that I'm scraping and I need to login every 100-150 items.
The issue is, if I'm going with more than 1 concurrent requests when in needs to login it already has in progress requests, which will go wrong.
So I have a marker that I'm extracting to know when I need to login again.
I want to go with >1 concurrent requests and stop everything when that marker is found, do the login and then resume.
Could it be possible to achieve that ?
22 comments
N
P
A
Hello,
I have a quite big scraper, it goes over 200k pages and it will take approximately 12 hours, but after 6 hours, for some reason all the requests are getting this requestHandler timed out after 30 seconds.

I don't think increasing the requestHandler timeout will solve it, maybe there is something else wrong that I don't get ?
15 comments
P
A
N
O
H
Hello,
In puppeteer for page.reload or page.goto you can choose the option {waitUntil: 'networkidle2'} , using Puppeteer in Crawlee, I only found that I can use that if I reload each page .

Is there any other way to configure navigation to use the {waitUntil: 'networkidle2'} from the beginning ?
1 comment
H
I'm trying to block some requests in Puppeteer but it doesn't seem to work if I run the script headed :
Plain Text
const blockedResourceTypes = ['webp', 'svg', 'mp4', 'jpeg', 'gif', 'avif', 'font']
const crawler = new PuppeteerCrawler({
    launchContext: {
        launchOptions: {
            headless: false,
            devtools: true,
            defaultViewport:{ width: 1920, height: 6000 },
            args: [
                '--disable-dev-shm-usage',
            ]
        },
        useIncognitoPages: true,
    },
    proxyConfiguration,
    requestHandler: router,
    maxConcurrency: 16,
    maxRequestRetries: 15,
    maxRequestsPerMinute: 2,
    navigationTimeoutSecs: 120,
    useSessionPool: true,
    failedRequestHandler({ request }) {
        log.debug(`Request ${request.url} failed 15 times.`);
    },

    preNavigationHooks: [
        async ({ addInterceptRequestHandler }) => {
            await addInterceptRequestHandler((request) => {
                if (blockedResourceTypes.includes(request.resourceType())) {
                    return request.respond({
                        status: 200,
                        body: 'useless shit',
                    });
                }
                return request.continue();
            });
        },
    ],
});


Any ideas ?
10 comments
N
P
A
How can I run puppeteer with this tag ? (obviously inside Crawlee)
2 comments
P
N
Plain Text
const requestList = await RequestList.open('My-ReqList', allUrls, { persistStateKey: 'My-ReqList' });
console.log(requestList.length())
const crawler = new CheerioCrawler({
  requestList,
  proxyConfiguration,
  requestHandler: router,
  minConcurrency: 32,
  maxConcurrency: 256,
  maxRequestRetries: 20,
  navigationTimeoutSecs: 6,
  loggingInterval: 30,
  useSessionPool: true,
  failedRequestHandler({ request }) {
      log.debug(`Request ${request.url} failed 20 times.`);
  },
});
await crawler.run()


allUrls is a list that contains 12 million urls, I'm trying to load them in the CheerioCrawler, but the process hangs using 14gb of ram memory and it doesn't even logs the requestList.length().

Can anybody help, please ?
11 comments
O
N
I'm trying to parse the sitemaps from a website that has .xml.gz sitemaps, in python I could use gunzip to decompress and use them.
In crawlee we only have the "downloadListOfUrls" method, how I could make it to decompress those files before using them >?
sitemap: https://www.zoro.com/sitemaps/usa/sitemap-product-10.xml.gz
6 comments
A
N
Is it possible to do some custom logic based on status codes in the Puppeteer Crawler ? If yes, how ?
3 comments
m
N
R
Hello,
I’m scraping a website where I have to choose the location (it’s an e commerce store and I have to choose the shop) .
The issue is , I’ve created a post navigation hook that is checking for the location and in case it’s not correct it’s starting the process of selecting the location again, saving the cookies and retry the request with the new cookies.
With this workflow I can only achieve concurrency 1.
Otherwise I’ll have multiple selections of the location at the same time and 99% of time it fails.
How could I somehow pause the entire process, do the selection, and retry all the unhandled requests in the queue .

Thanks ! 🙏🏼
2 comments
N
P
Can we somehow throw errors that are closing the page ?
and not retrying the request?
3 comments
N
H
A
Hello,
With RetryRequestError, the request gets retried an infinite times until it succeeds, what error should I throw to respect the maxRequestRetries?
4 comments
N
P
Plain Text
Cannot find module 'crawlee'. Did you mean to set the 'moduleResolution' option to 'nodenext', or to add aliases to the 'paths' option?ts(2792)


The linter is giving this error even on the template project.
This needs attention or can I let it like this ?
17 comments
N
P
This is the code :
Plain Text
await utils.puppeteer.enqueueLinksByClickingElements({
        page,
        requestQueue: RequestQueue.open(),
        selector: 'li.pagination_next',
        label: 'category',
        forefront: true

    });

This is the error :
Plain Text
 Reclaiming failed request back to the list or queue. Expected property object `requestQueue` to have keys `["fetchNextRequest","addRequest"]` in object `options`


I have imported RequestQueue from crawlee, don't understand where it goes wrong
3 comments
N
v
I'm deploying a Crawlee (Cheerio) project in an amazonlinux:2023 based docker container.
I get the following error:
Plain Text
> node src/main.js

DEBUG CheerioCrawler:SessionPool: No 'persistStateKeyValueStoreId' options specified, this session pool's data has been saved in the KeyValueStore with the id: ee911a9c-b90e-412e-af5b-a470b0172ba8
INFO  CheerioCrawler: Starting the crawl
ERROR Memory snapshot failed.
  Error: spawn ps ENOENT
      at ChildProcess._handle.onexit (node:internal/child_process:283:19)
      at onErrorNT (node:internal/child_process:476:16)
      at process.processTicksAndRejections (node:internal/process/task_queues:82:21)
DEBUG CheerioCrawler:SessionPool: Persisting state {"persistStateKey":"SDK_SESSION_POOL_STATE"}
DEBUG Statistics: Persisting state {"persistStateKey":"SDK_CRAWLER_STATISTICS_0"}
DEBUG CheerioCrawler:SessionPool: Persisting state {"persistStateKey":"SDK_SESSION_POOL_STATE"}
DEBUG Statistics: Persisting state {"persistStateKey":"SDK_CRAWLER_STATISTICS_0"}
DEBUG Statistics: Persisting state {"persistStateKey":"SDK_CRAWLER_STATISTICS_0"}
node:internal/errors:490
    ErrorCaptureStackTrace(err);
    ^

Error: spawn ps ENOENT
    at ChildProcess._handle.onexit (node:internal/child_process:283:19)
    at onErrorNT (node:internal/child_process:476:16)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
  errno: -2,
  code: 'ENOENT',
  syscall: 'spawn ps',
  path: 'ps',
  spawnargs: [ '-A', '-o', 'ppid,pid,stat,rss,comm' ]
}

Node.js v18.16.0
2 comments
A
N
I'm trying to run this code in my default handler:
Plain Text
if (request.loadedUrl === 'url-from-where-i-get-cookies'){
        goodCokies = session.getCookies('url-from-where-i-get-cookies')
        await crawler.addRequests(['url-where-i-need-cookies'])
        return
    }
    await page.setCookie(goodCokies)


Error:
Plain Text
Reclaiming failed request back to the list or queue. Protocol error (Network.deleteCookies): Invalid parameters Failed to deserialize params.name - BINDINGS: mandatory field missing at position 2081
1 comment
A
Hello,
I have a question regarding Puppeteer, I want to change the proxies at one point during the process.

Is this achievable ?
for example I have proxy1 and proxy2, I start by using proxy1 and at one point I switch to proxy2.
7 comments
N
H
n
I have written this code for Puppeteer:
Plain Text
await puppeteerClickElements.enqueueLinksByClickingElements({ forefront: true, selector: 'a.js-color-change' })

But it generates this error:
Plain Text
 Reclaiming failed request back to the list or queue. Expected property `page` to be of type `object` but received type `undefined`
Expected object `page` to have keys `["goto","evaluate"]` in object `options`

Where is the mistake ?
6 comments
N
P