new_in_town

Apify Crawlee GitHub

Apify Discord Mirror

Home

Members

new_in_town

Offline, last seen 4 months ago

Joined August 30, 2024

nnew_in_town

Goodbye Crawlee (migrated to Hero)

I migrated my scraping code from Crawlee to Hero (see https://github.com/ulixee/hero). It works. Everything that worked with Crawlee - works with Hero.

Why I migrated: can not handle the over-engineered Crawlee API more (and bugs related to this).
It was just too much APIs (different APIs!) for my simple case.
Hero has about 5 times simpler API.

In both cases (Crawlee and Hero) I am using only scraping library, no additional (cloud) services, no docker containers.

I am not manipulating DOM, not doing retries, not doing any complex things in Typescript. I am just accessing the URL (in some cases the URL1 and after this the URL2 to pretentd I'm normal user), grab the rendered HTML and that's it. All the HTML manipulations (extracting the data from the HTML) done in completely different program (written in different programming language, not in Typescript).
Re-try logic -> again, this is implemented in that different program.

I use beanstalkd (see https://github.com/beanstalkd/beanstalkd/) message queue between that "different program" and the scraper. So I just replaced the Crawlee-based-scraper with Hero-based-scraper without touching other parts of the system. Usage of beanstalkd was already discussed in this forum: use search to find these discussions.

Goodbye Crawlee.

3 comments

nnew_in_town

A site that shows cloudflare captcha ALWAYS

I immediately get captcha on every URL.
Accessing it in a normal GUI browser typing site homepage URL: captcha.
Searching this site in google, clicking on the link in google results: browser shows site address and...: captcha.

(by the way, they changed it, few months ago this site was not that restrictive)

Well... what is our best solution for sites always showing cloudflare captchas ?

2 comments

nnew_in_town

Error: PlaywrightCrawler:SessionPool:Session "Cookie not in this host's domain"

I am using PlaywrightCrawler with Firefox. When accessing wellfound.com and see this error:

Plain Text

DEBUG PlaywrightCrawler:SessionPool:Session: Could not set cookies. {"errorMessages":["Cookie not in this host's domain. Cookie:prod.website-files.com Request:wellfound.com"]}

It might be that this cookie is something important: I'm navigating to another page on this site and get HTTP 403 and captcha...
How to fix this error?

Have these settings in code:

Plain Text

        
    useSessionPool: true,
    persistCookiesPerSession: true,
    sessionPoolOptions: {
         maxPoolSize: 300,
         sessionOptions:{
             maxAgeSecs: 70,
             maxUsageCount: 2,
         },
     },
     
    launchContext: {
        ...
        launchOptions: {
            bypassCSP: true,
            acceptDownloads: true,

1 comment

nnew_in_town

bot detection (captcha) changed, Playwright+Crawlee+Firefox+rotating proxies does not help any more

I have a program: Playwright+Crawlee+Firefox+rotating proxies used to scrape jobs from wellfound.com In may 2024 (and earlier) it worked quite well, many months, despite captcha protection on site.

Today I get HTTP 403 and captcha (from ct.captcha-delivery.com). My code is not changed!

Proxies: iproyal.com, "residential-proxies", session time 1 min ("sticky session"). What I did: in the same session accessed URL1 and than URL2. URL1 has no captcha, URL2 contains info I need, and is/was protected with captcha. In the past the trick with "URL1 and than URL2 in the same session" worked well. Today I get captcha when accessing URL2.

What I tried: switched between Chrome and Firefox in my code. For Chrome tried with chromium.use(stealthPlugin()) and without it.

Still see that captcha. Tried to access the site with normal GUI browser (Firefox) through iproyal.com "sticky session": accessing URL1 and than URL2: no captcha.
It means: proxies are still OK, they are not detected!

Bottom line: something changed, bot detection improved.
What is our answer?

Is it something similar to this: https://discord.com/channels/801163717915574323/1293244368249032895/1293244368249032895
@Jeno what solution you found?

1 comment

nnew_in_town

accessing 'log' outside of 'requestHandler' ?

In the PlaywrightCrawler.requestHandler() I can access 'log' because it is an argument of requestHandler()
How can I access log (or something similar?) in other places?

Example:
I want to log something before the crawler.run();

(well, console.log works but I would like to control the loglevel in one place...)

2 comments

nnew_in_town

Crawlee+PlaywrightCrawler+proxy - original IP leaking through WebRTC

I'm running this simple program from a server in German datacenter with IP 167.235...
This program uses US residential proxies (rotating every 1min).

And I see that pixelscan.net is able to detect my original IP: 167.235...
On the attached screenshot you can find it under "WebRTC address"

So how to avoid this?

P.S:
and another problem I see - "Plugins Length", it is discussed here https://discord.com/channels/801163717915574323/1059483872271798333

19 comments

nnew_in_town

Crawlee vs bot detection systems - Plugins length is not OK

I tested PlaywrightCrawler on three bot detection sites (see [1], [2], [3] and the attached screenshots).
In all cases these sites complains about "0 plugins" or "Plugins length".

If I open these sites with browser I use every day (Firefox on Linux, by the way - the same as
used in PlaywrightCrawler settings) - these sites say "5 plugins" and the field is green.

Is it something in my code?
Can Crawlee emulate these plugins attributes?

[1] - https://infosimples.github.io/detect-headless/
[2] - https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html
[3] - https://webscraping.pro/wp-content/uploads/2021/02/testresult2.html

and here - part of the PlaywrightCrawler:

Plain Text

const crawler = new PlaywrightCrawler({
    ...
    browserPoolOptions: {
        useFingerprints: true,

        fingerprintOptions: {
            fingerprintGeneratorOptions: {
                browsers: ['firefox'],
                operatingSystems: ['linux'],
            },
        },
    },

    launchContext: {
        launcher: firefox
    },

});

Screenshots:

37 comments

nnew_in_town

Make PlaywrightCrawler less unique and avoid blocking? (canvas/fonts/plugins/permissions...)

I checked my program (PlaywrightCrawler) against this thing: https://amiunique.org/fingerprint
Used US residential proxy, did 3 screenshots, see below
It seems - there are some areas where Crawlee could do better (be less unique, less detectable)!

Here the list (these things are red on the screenshots):

User Agent (I used fingerprint generator for this!)
Canvas
Navigator properties
List of fonts
List of plugins
Permissions

Some settings in my PlaywrightCrawler:
useFingerprints: true, useFingerprintCache: false, launcher: firefox

Regarding list of plugins: I use some JS code (pluginContent string) taken from here: https://discord.com/channels/801163717915574323/1059483872271798333
and inject it into page this way:

Plain Text

    preNavigationHooks: [
        async ({ page, request }) => {
            await page.addInitScript({ content: pluginContent });
        },

Well, this code/hack... it simulates presence of some PDF plugins... but I have an impression there are better solutions for plugins/fonts/permissions...

10 comments

nnew_in_town

How to upgrade Crawlee+Playwright to the latest versions?

6 month ago I created a Crawlee+Playwright+"node-beanstalk"(a JS wrapper for Beanstalkd message queue) project. I was following Crawlee documentation, created some... template? and started to add things to this template (no Docker image was used. I just installed things on Ubuntu machine).
And somehow it works (and it still wonders me)))

This is the versions used at the moment (my package.json is below, feel free to take look/criticize, i know it is not perfect):

Plain Text

   crawlee/core 3.3.1
   playwright 1.33.0

   npm:  8.19.3
   node: 16.19.0

Now I see that the latest Crawlee version is 3.5, latest Playwright is 1.39 and may be some other packages are updated. It is time to update.

So, what is the proper way to update Crawlee and Playwright in such project?
Is it just this:

Plain Text

   npm update playwright
   npm update crawlee

Or something else?

I use headless Firefox it is installed here:
~/.cache/ms-playwright/firefox-1403/
How to update it?

Disclaimer: i am not a JS developer, i am Java developer who somehow writes JS code (lot of copy/paste, yes) so I know that dependency management is not that easy, so I think it is better to ask in this forum than create a mess is my project...

4 comments

nnew_in_town

per-site interval between requests?

Imagine the request queue of Crawlee (PlaywrightCrawler) containing URLs of two (or more) sites:

example.com/url1
another-site.com/url2
example.com/url3
another-site.com/url4
...

I would like to configure Crawlee to have per-site interval between requests. For the above example it means:

example.com: 20 sec (or more) between requests
another-site.com: 60 sec (or more) between requests

How to do this with Crawlee?

1 comment

nnew_in_town

Added "playwright-extra" with "stealthPlugin" and got error "Cannot read properties of undefined"

I have some code using PlaywrightCrawler. I added "playwright-extra" with "stealthPlugin" to this code. Exactly as in documentation [1]

I added to my code only this:

Plain Text

import { firefox } from 'playwright-extra';
import stealthPlugin from 'puppeteer-extra-plugin-stealth';
firefox.use(stealthPlugin());

The rest of program remains the same as before. And I have useFingerprints: true and launcher: firefox in code.

Well, the code works. Bot detection sites report that my crawler has 3 plugins and supports 4 mime types, so something changed.
But! I got this is the stdout:

Plain Text

INFO  PlaywrightCrawler: Starting the crawler.
An error occured while executing "onPageCreated" in plugin "stealth/evasions/user-agent-override": TypeError: Cannot read properties of undefined (reading 'userAgent')
    at Proxy.<anonymous> (.../node_modules/playwright-extra/src/puppeteer-compatiblity-shim/index.ts:217:23)
    at runNextTicks (node:internal/process/task_queues:61:5)
    at processImmediate (node:internal/timers:437:9)
    at process.topLevelDomainCallback (node:domain:161:15)
    at process.callbackTrampoline (node:internal/async_hooks:128:24)
    at async Plugin.onPageCreated (.../node_modules/puppeteer-extra-plugin-stealth/evasions/user-agent-override/index.js:69:8)

How bad is this?

[1] https://crawlee.dev/docs/examples/crawler-plugins

8 comments

nnew_in_town

New fingerprint per new request with PlaywrightCrawler/Firefox ?

Hi all,
what I want to achieve:

every request should have unique fingerprint - this is important!
cookies, etc. not shared between requests
PlaywrightCrawler
no sessions - every request is independent, (no login or similar)
Firefox
performance/throughput is not a number one prio

At the moment I almost have this with the hack retireBrowserAfterPageCount=2 in browserPoolOptions: this gives a unique fingerprint every two requests, which... isn't perfect (and starting a new browser instance so often looks strange)

In this thread: https://discord.com/channels/801163717915574323/1060467542616965150
a solution for browser pool (without crawler) was suggested.

I would like to have both: new fingerprint per request and PlaywrightCrawler.
Is it possible?

1 comment

nnew_in_town

Mysterious retryOnBlocked property

Actually this should be a great thing!

https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlerOptions#retryOnBlocked

If set to true, the crawler will automatically try to bypass any detected bot protection.
Currently supports:
Cloudflare Bot Management
Google Search Rate Limiting

Can we have some information about ... how to use this thing?
Any prerequisites? Side effects?
Does it needs some special settings in PlaywrightCrawler ?
Example: I have maxRequestRetries=0 - is it OK to use retryOnBlocked in such case?

1 comment

nnew_in_town

PlaywrightCrawler exception: page.content: Target page, context or browser has been closed

This exception happens in about 15-20% of all requests... quite often!

This line in code:

Plain Text

content = await page.content();

Throws this exception

Plain Text

page.content: Target page, context or browser has been closed
   at (<somewhere-in-my-code>.js:170:54)
   at PlaywrightCrawler.requestHandler (<somewhere-in-my-code>.js:596:15)
   at async wrap (.../node_modules/@apify/timeout/index.js:52:21)

Is it something well known?

Should I check(wait for) something before calling page.content() ?
It is already checked that status of response.status() is less than 400 (it is actually 200, i see it in the logs)

7 comments

nnew_in_town

Got captha and HTTP 403 using PlaywrightCrawler

Got captha and HTTP 403 when accessing wellfound.com

I get captcha all the time when I access links like these (basically - accessing any job ad on wellfound):
https://wellfound.com/company/kalepa/jobs/2651640-tech-lead-manager-full-stack-europe
https://wellfound.com/company/pinatacloud/jobs/2655889-principal-software-engineer
https://wellfound.com/company/wingspanapp/jobs/2629420-senior-software-engineer

Screenshot attached.

~~and this is not Cloudflare protection - it's some other anti-bot thing~~.

I am using:

US residential proxies from smartproxy.com
PlaywrightCrawler with useSessionPool: false and persistCookiesPerSession: false
headless Firefox, both as launcher and in fingerprintGeneratorOptions browsers
my locale is en-US, timezone in America/New_York (to match US proxies)
in fingerprintGeneratorOptions devices: ['desktop']
in launchContext: { useIncognitoPages: true }
I set pluginContent in preNavigationHooks to fix the "plugin length" problem, as described here: https://discord.com/channels/801163717915574323/1059483872271798333

And still this site detects me as robot!
Any ideas how to overcome this?

UPDATE1: the IP on screenshot is somewhere in US/Texas...

UPDATE2: when I open these links in my desktop browser incognito mode - I get this captcha too...

11 comments

nnew_in_town

How to disable PlaywrightCrawler request statistics on the console?

I see these messages on the console

Plain Text

 INFO  Statistics: PlaywrightCrawler request statistics: {"requestAvgFailedDurationMillis":null,

how can I disable it?

P.S.

I already have this:

Plain Text

... new PlaywrightCrawler({
    autoscaledPoolOptions: {
        loggingIntervalSecs: null,

3 comments

nnew_in_town

Saving bandwith using PlaywrightCrawler: to block googletagmanager, google-analytics etc...

I already block images as described in [1] and this helps to save some bandwith.
Next step: looking at statistics in my proxy service I see a significant number of requests like these:

Plain Text

https://www.googletagmanager.com/gtag/js?id=...
https://connect.facebook.net/en_US/fbevents.js
https://www.google-analytics.com/analytics.js
https://fonts.googleapis.com/css?family=Lato

Can somebody show me an example of code blocking these domains? (better: to block all domains from a given list)

I assume it should be something in PlaywrightCrawler.preNavigationHooks, right?
Prerequisites: PlaywrightCrawler, Firefox as launcher (Chrome-specific hacks probably would not work)

(I'm not good at writing Javascript from scratch, so need some help)

[1] https://discord.com/channels/801163717915574323/1060986956961546320

1 comment

nnew_in_town

Captcha detection?

How to detect captcha?

I see this in the response HTML:

Plain Text

<head>
  ...
  <meta name="captcha-challenge" content="1">
  ...

but I would prefer to use some function in Playwright/Crawlee.
I mean, some generic way to detect captcha - who knows which variant of captha I will get in the future....

I can not use HTTP status - this page returns status=200 but it shows captcha!

1 comment

nnew_in_town

failedRequestHandler, error argument, detailed error message lost

I am using PlaywrightCrawler and the failedRequestHandler to handle errors.
Something like this:

Plain Text

const crawler = new PlaywrightCrawler({
    ...
    async failedRequestHandler({request, response, page, log}, error) {

    ...

And sometimes I see errors in the log:

Plain Text

ERROR failedRequestHandler: Request failed and reached maximum retries. page.goto: SSL_ERROR_BAD_CERT_DOMAIN

But! when I am looking inside the error argument of the failedRequestHandler with the JSON.stringify(error)
I see only this: {"name":"Error"}

It seems, the detailed error message I see in the log is not accessible in the error argument.

So, how to access the detailed error message in code?

8 comments

nnew_in_town

Firefox, PlaywrightCrawler, SSL_ERROR_BAD_CERT_DOMAIN error

One of the pages I want to scrape with PlaywrightCrawler returns the SSL_ERROR_BAD_CERT_DOMAIN error.
I can reproduce this error when I open this URL in Firefox/Chrome - I see the browser shows the prompt with
the warning and asks "...do you want to proceed?"

So the error is from the browser, not from Crawlee/Playwright...

But... Firefox has so many flags/settings... may be I can somehow set the flag
"accept all certificates"?

2 comments

nnew_in_town

PlaywrightCrawler.requestHandler: Error: mouse.move: Target page, context or browser has been closed

In the PlaywrightCrawler.requestHandler I calling page.mouse.move and
sometimes I get this error: mouse.move: Target page, context or browser has been closed

Here the sequence of calls:

Plain Text

async requestHandler( {request, response, page, enqueueLinks, log, proxyInfo} )
{
    ...
    await sleep( interval );
    await page.mouse.move( rnd(100,400), rnd(40,300) );
    await sleep( interval );
    ...
    content = await page.content();
}

In case I catch the exception thrown in page.mouse.move and contine - than I get
almost the same thing when calling page.content():
page.content: Target page, context or browser has been closed

I would like to move mouse randomly - I think, the make my scraper "human-like".
But something is going wrong here and I can not figure out what.

Sometimes this code works and sometimes I see these errors!
Pls help!

UPDATE:
and sometimes the error message is:
ERROR requestHandler: Request failed and reached maximum retries. page.goto: Navigation failed because page was closed!
I think, somehow these error messages are related...

4 comments

nnew_in_town

Ways to minimize traffic (save money) when crawling-scraping?

It can be done either with preNavigationHooks, see https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlerOptions#preNavigationHooks

or with the blockRequests https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlingContext#blockRequests

As far as I know, blockRequests has some limitations (does it works in incognito mode with Firerox as launcher?). This was discussed in this forum, see:
https://discord.com/channels/801163717915574323/1039557325784105002
https://discord.com/channels/801163717915574323/1019949012415160370

As far as I understand - you can not have both: cache AND incognito mode.
Well, there is the experimentalContainers thing - in theory it should allow both cache and incognito.
I tried it, see https://discord.com/channels/801163717915574323/1060738415370453032/1060952860868739192
it looks it's not really "incognito" when fingerprint.com recognize you even when your IP is different.
(you can disprove me - may be my test was wrong, who knows?)

Please suggest...

So one of the ideas - to use "Datacenter proxy" instead of "Residential"...
I see Datacenter proxies for about $0.7 per GB - much cheaper that Residential.
Does it make sense to try?
What is your experience with Datacenter proxies ?

12 comments

nnew_in_town

PlaywrightCrawler - how often browser fingerprints are changed?

Are browser fingerprints changed

every request?
every 1 min?
every... I do not what else ))

And how changing browser fingerprints related
to using or not using PlaywrightCrawler.launchContext.useIncognitoPages ?

I am asking this because I saw a situation when two attempts to open a bot detection site
https://fingerprint.com/demo/ result in same "ID" - in other words they were
able to identify me! Screenshots attached.

Interval between requests - 3 min.
Different IPs (from the pool of "rotating" IP's).
Without incognito

8 comments

nnew_in_town

Proxy services - recommendations, feedback

For developers building scrapers/crawlers with Crawlee library - which proxy services you are using?

Is it possible to use "US residential proxies" ?
What do you think about quality of service?
What about price?

18 comments

nnew_in_town

Crawlee - how to set timezone?

Ok, I know in which country are my proxies/IPs, so I can set locale:

Plain Text

const crawler = new PlaywrightCrawler({
    ...
    fingerprintOptions: {
        fingerprintGeneratorOptions: {
            locales: [ ... ],
    ...

BUT! How to set the timezone corresponding to the country?

This is not a theoretical question: this site: https://pixelscan.net
checks timezone, detects "Africa/Abidjan", compares it with my IP in German datacenter
and says "Look like you spoofing your location". (attached - two parts of the huge screenshot made in headless mode with PlaywrightCrawler)

So how to set/control timezone?

12 comments