Apify

Apify and Crawlee Official Forum

b
F
A
J
A
Members
new_in_town
n
new_in_town
Offline, last seen last month
Joined August 30, 2024
In the PlaywrightCrawler.requestHandler() I can access 'log' because it is an argument of requestHandler()
How can I access log (or something similar?) in other places?

Example:
I want to log something before the crawler.run();

(well, console.log works but I would like to control the loglevel in one place...)
2 comments
A
A
I checked my program (PlaywrightCrawler) against this thing: https://amiunique.org/fingerprint
Used US residential proxy, did 3 screenshots, see below
It seems - there are some areas where Crawlee could do better (be less unique, less detectable)!

Here the list (these things are red on the screenshots):
  • User Agent (I used fingerprint generator for this!)
  • Canvas
  • Navigator properties
  • List of fonts
  • List of plugins
  • Permissions
Some settings in my PlaywrightCrawler:
useFingerprints: true, useFingerprintCache: false, launcher: firefox

Regarding list of plugins: I use some JS code (pluginContent string) taken from here: https://discord.com/channels/801163717915574323/1059483872271798333
and inject it into page this way:
Plain Text
    preNavigationHooks: [
        async ({ page, request }) => {
            await page.addInitScript({ content: pluginContent });
        },


Well, this code/hack... it simulates presence of some PDF plugins... but I have an impression there are better solutions for plugins/fonts/permissions...
10 comments
L
n
A
A
P
6 month ago I created a Crawlee+Playwright+"node-beanstalk"(a JS wrapper for Beanstalkd message queue) project. I was following Crawlee documentation, created some... template? and started to add things to this template (no Docker image was used. I just installed things on Ubuntu machine).
And somehow it works (and it still wonders me)))

This is the versions used at the moment (my package.json is below, feel free to take look/criticize, i know it is not perfect):
Plain Text
   crawlee/core 3.3.1
   playwright 1.33.0

   npm:  8.19.3
   node: 16.19.0


Now I see that the latest Crawlee version is 3.5, latest Playwright is 1.39 and may be some other packages are updated. It is time to update.

So, what is the proper way to update Crawlee and Playwright in such project?
Is it just this:
Plain Text
   npm update playwright
   npm update crawlee

Or something else?

I use headless Firefox it is installed here:
~/.cache/ms-playwright/firefox-1403/
How to update it?

Disclaimer: i am not a JS developer, i am Java developer who somehow writes JS code (lot of copy/paste, yes) so I know that dependency management is not that easy, so I think it is better to ask in this forum than create a mess is my project...
4 comments
n
L
h
I'm running this simple program from a server in German datacenter with IP 167.235...
This program uses US residential proxies (rotating every 1min).

And I see that pixelscan.net is able to detect my original IP: 167.235...
On the attached screenshot you can find it under "WebRTC address"

So how to avoid this?

P.S:
and another problem I see - "Plugins Length", it is discussed here https://discord.com/channels/801163717915574323/1059483872271798333
19 comments
n
L
P
p
L
I tested PlaywrightCrawler on three bot detection sites (see [1], [2], [3] and the attached screenshots).
In all cases these sites complains about "0 plugins" or "Plugins length".

If I open these sites with browser I use every day (Firefox on Linux, by the way - the same as
used in PlaywrightCrawler settings) - these sites say "5 plugins" and the field is green.

Is it something in my code?
Can Crawlee emulate these plugins attributes?

[1] - https://infosimples.github.io/detect-headless/
[2] - https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html
[3] - https://webscraping.pro/wp-content/uploads/2021/02/testresult2.html

and here - part of the PlaywrightCrawler:
Plain Text
const crawler = new PlaywrightCrawler({
    ...
    browserPoolOptions: {
        useFingerprints: true,

        fingerprintOptions: {
            fingerprintGeneratorOptions: {
                browsers: ['firefox'],
                operatingSystems: ['linux'],
            },
        },
    },

    launchContext: {
        launcher: firefox
    },

});


Screenshots:
37 comments
1
P
n
L
A
L
Imagine the request queue of Crawlee (PlaywrightCrawler) containing URLs of two (or more) sites:

example.com/url1
another-site.com/url2
example.com/url3
another-site.com/url4
...

I would like to configure Crawlee to have per-site interval between requests. For the above example it means:

example.com: 20 sec (or more) between requests
another-site.com: 60 sec (or more) between requests

How to do this with Crawlee?
1 comment
H
I checked my program (PlaywrightCrawler) against this thing: https://amiunique.org/fingerprint
Used US residential proxy, did 3 screenshots, see below
It seems - there are some areas where Crawlee could do better (be less unique, less detectable)!

Here the list (these things are red on the screenshots):
  • User Agent (I used fingerprint generator for this!)
  • Canvas
  • Navigator properties
  • List of fonts
  • List of plugins
  • Permissions
Some settings in my PlaywrightCrawler:
useFingerprints: true, useFingerprintCache: false, launcher: firefox

Regarding list of plugins: I use some JS code (pluginContent string) taken from here: https://discord.com/channels/801163717915574323/1059483872271798333
and inject it into page this way:
Plain Text
    preNavigationHooks: [
        async ({ page, request }) => {
            await page.addInitScript({ content: pluginContent });
        },


Well, this code/hack... it simulates presence of some PDF plugins... but I have an impression there are better solutions for plugins/fonts/permissions...
10 comments
L
n
A
A
P
6 month ago I created a Crawlee+Playwright+"node-beanstalk"(a JS wrapper for Beanstalkd message queue) project. I was following Crawlee documentation, created some... template? and started to add things to this template (no Docker image was used. I just installed things on Ubuntu machine).
And somehow it works (and it still wonders me)))

This is the versions used at the moment (my package.json is below, feel free to take look/criticize, i know it is not perfect):
Plain Text
   crawlee/core 3.3.1
   playwright 1.33.0

   npm:  8.19.3
   node: 16.19.0


Now I see that the latest Crawlee version is 3.5, latest Playwright is 1.39 and may be some other packages are updated. It is time to update.

So, what is the proper way to update Crawlee and Playwright in such project?
Is it just this:
Plain Text
   npm update playwright
   npm update crawlee

Or something else?

I use headless Firefox it is installed here:
~/.cache/ms-playwright/firefox-1403/
How to update it?

Disclaimer: i am not a JS developer, i am Java developer who somehow writes JS code (lot of copy/paste, yes) so I know that dependency management is not that easy, so I think it is better to ask in this forum than create a mess is my project...
4 comments
L
n
h
I'm running this simple program from a server in German datacenter with IP 167.235...
This program uses US residential proxies (rotating every 1min).

And I see that pixelscan.net is able to detect my original IP: 167.235...
On the attached screenshot you can find it under "WebRTC address"

So how to avoid this?

P.S:
and another problem I see - "Plugins Length", it is discussed here https://discord.com/channels/801163717915574323/1059483872271798333
19 comments
P
n
p
L
L
I tested PlaywrightCrawler on three bot detection sites (see [1], [2], [3] and the attached screenshots).
In all cases these sites complains about "0 plugins" or "Plugins length".

If I open these sites with browser I use every day (Firefox on Linux, by the way - the same as
used in PlaywrightCrawler settings) - these sites say "5 plugins" and the field is green.

Is it something in my code?
Can Crawlee emulate these plugins attributes?

[1] - https://infosimples.github.io/detect-headless/
[2] - https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html
[3] - https://webscraping.pro/wp-content/uploads/2021/02/testresult2.html

and here - part of the PlaywrightCrawler:
Plain Text
const crawler = new PlaywrightCrawler({
    ...
    browserPoolOptions: {
        useFingerprints: true,

        fingerprintOptions: {
            fingerprintGeneratorOptions: {
                browsers: ['firefox'],
                operatingSystems: ['linux'],
            },
        },
    },

    launchContext: {
        launcher: firefox
    },

});


Screenshots:
37 comments
1
P
n
L
A
L
Imagine the request queue of Crawlee (PlaywrightCrawler) containing URLs of two (or more) sites:

example.com/url1
another-site.com/url2
example.com/url3
another-site.com/url4
...

I would like to configure Crawlee to have per-site interval between requests. For the above example it means:

example.com: 20 sec (or more) between requests
another-site.com: 60 sec (or more) between requests

How to do this with Crawlee?
1 comment
H
I have some code using PlaywrightCrawler. I added "playwright-extra" with "stealthPlugin" to this code. Exactly as in documentation [1]

I added to my code only this:
Plain Text
import { firefox } from 'playwright-extra';
import stealthPlugin from 'puppeteer-extra-plugin-stealth';
firefox.use(stealthPlugin());

The rest of program remains the same as before. And I have useFingerprints: true and launcher: firefox in code.

Well, the code works. Bot detection sites report that my crawler has 3 plugins and supports 4 mime types, so something changed.
But! I got this is the stdout:
Plain Text
INFO  PlaywrightCrawler: Starting the crawler.
An error occured while executing "onPageCreated" in plugin "stealth/evasions/user-agent-override": TypeError: Cannot read properties of undefined (reading 'userAgent')
    at Proxy.<anonymous> (.../node_modules/playwright-extra/src/puppeteer-compatiblity-shim/index.ts:217:23)
    at runNextTicks (node:internal/process/task_queues:61:5)
    at processImmediate (node:internal/timers:437:9)
    at process.topLevelDomainCallback (node:domain:161:15)
    at process.callbackTrampoline (node:internal/async_hooks:128:24)
    at async Plugin.onPageCreated (.../node_modules/puppeteer-extra-plugin-stealth/evasions/user-agent-override/index.js:69:8)

How bad is this?


[1] https://crawlee.dev/docs/examples/crawler-plugins
8 comments
d
D
H
L
n
I have some code using PlaywrightCrawler. I added "playwright-extra" with "stealthPlugin" to this code. Exactly as in documentation [1]

I added to my code only this:
Plain Text
import { firefox } from 'playwright-extra';
import stealthPlugin from 'puppeteer-extra-plugin-stealth';
firefox.use(stealthPlugin());

The rest of program remains the same as before. And I have useFingerprints: true and launcher: firefox in code.

Well, the code works. Bot detection sites report that my crawler has 3 plugins and supports 4 mime types, so something changed.
But! I got this is the stdout:
Plain Text
INFO  PlaywrightCrawler: Starting the crawler.
An error occured while executing "onPageCreated" in plugin "stealth/evasions/user-agent-override": TypeError: Cannot read properties of undefined (reading 'userAgent')
    at Proxy.<anonymous> (.../node_modules/playwright-extra/src/puppeteer-compatiblity-shim/index.ts:217:23)
    at runNextTicks (node:internal/process/task_queues:61:5)
    at processImmediate (node:internal/timers:437:9)
    at process.topLevelDomainCallback (node:domain:161:15)
    at process.callbackTrampoline (node:internal/async_hooks:128:24)
    at async Plugin.onPageCreated (.../node_modules/puppeteer-extra-plugin-stealth/evasions/user-agent-override/index.js:69:8)

How bad is this?


[1] https://crawlee.dev/docs/examples/crawler-plugins
8 comments
d
D
H
L
n
Hi all,
what I want to achieve:

  • every request should have unique fingerprint - this is important!
  • cookies, etc. not shared between requests
  • PlaywrightCrawler
  • no sessions - every request is independent, (no login or similar)
  • Firefox
  • performance/throughput is not a number one prio
At the moment I almost have this with the hack retireBrowserAfterPageCount=2 in browserPoolOptions: this gives a unique fingerprint every two requests, which... isn't perfect (and starting a new browser instance so often looks strange)

In this thread: https://discord.com/channels/801163717915574323/1060467542616965150
a solution for browser pool (without crawler) was suggested.

I would like to have both: new fingerprint per request and PlaywrightCrawler.
Is it possible?
1 comment
L
Hi all,
what I want to achieve:

  • every request should have unique fingerprint - this is important!
  • cookies, etc. not shared between requests
  • PlaywrightCrawler
  • no sessions - every request is independent, (no login or similar)
  • Firefox
  • performance/throughput is not a number one prio
At the moment I almost have this with the hack retireBrowserAfterPageCount=2 in browserPoolOptions: this gives a unique fingerprint every two requests, which... isn't perfect (and starting a new browser instance so often looks strange)

In this thread: https://discord.com/channels/801163717915574323/1060467542616965150
a solution for browser pool (without crawler) was suggested.

I would like to have both: new fingerprint per request and PlaywrightCrawler.
Is it possible?
1 comment
L
Actually this should be a great thing!

https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlerOptions#retryOnBlocked

If set to true, the crawler will automatically try to bypass any detected bot protection.
Currently supports:
Cloudflare Bot Management
Google Search Rate Limiting

Can we have some information about ... how to use this thing?
Any prerequisites? Side effects?
Does it needs some special settings in PlaywrightCrawler ?
Example: I have maxRequestRetries=0 - is it OK to use retryOnBlocked in such case?
1 comment
L
Actually this should be a great thing!

https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlerOptions#retryOnBlocked

If set to true, the crawler will automatically try to bypass any detected bot protection.
Currently supports:
Cloudflare Bot Management
Google Search Rate Limiting

Can we have some information about ... how to use this thing?
Any prerequisites? Side effects?
Does it needs some special settings in PlaywrightCrawler ?
Example: I have maxRequestRetries=0 - is it OK to use retryOnBlocked in such case?
1 comment
L
This exception happens in about 15-20% of all requests... quite often!

This line in code:
Plain Text
content = await page.content();


Throws this exception
Plain Text
page.content: Target page, context or browser has been closed
   at (<somewhere-in-my-code>.js:170:54)
   at PlaywrightCrawler.requestHandler (<somewhere-in-my-code>.js:596:15)
   at async wrap (.../node_modules/@apify/timeout/index.js:52:21)


Is it something well known?

Should I check(wait for) something before calling page.content() ?
It is already checked that status of response.status() is less than 400 (it is actually 200, i see it in the logs)
7 comments
L
n
This exception happens in about 15-20% of all requests... quite often!

This line in code:
Plain Text
content = await page.content();


Throws this exception
Plain Text
page.content: Target page, context or browser has been closed
   at (<somewhere-in-my-code>.js:170:54)
   at PlaywrightCrawler.requestHandler (<somewhere-in-my-code>.js:596:15)
   at async wrap (.../node_modules/@apify/timeout/index.js:52:21)


Is it something well known?

Should I check(wait for) something before calling page.content() ?
It is already checked that status of response.status() is less than 400 (it is actually 200, i see it in the logs)
7 comments
L
n
Got captha and HTTP 403 when accessing wellfound.com

I get captcha all the time when I access links like these (basically - accessing any job ad on wellfound):
https://wellfound.com/company/kalepa/jobs/2651640-tech-lead-manager-full-stack-europe
https://wellfound.com/company/pinatacloud/jobs/2655889-principal-software-engineer
https://wellfound.com/company/wingspanapp/jobs/2629420-senior-software-engineer

Screenshot attached.


and this is not Cloudflare protection - it's some other anti-bot thing.

I am using:
  • US residential proxies from smartproxy.com
  • PlaywrightCrawler with useSessionPool: false and persistCookiesPerSession: false
  • headless Firefox, both as launcher and in fingerprintGeneratorOptions browsers
  • my locale is en-US, timezone in America/New_York (to match US proxies)
  • in fingerprintGeneratorOptions devices: ['desktop']
  • in launchContext: { useIncognitoPages: true }
  • I set pluginContent in preNavigationHooks to fix the "plugin length" problem, as described here: https://discord.com/channels/801163717915574323/1059483872271798333
And still this site detects me as robot!
Any ideas how to overcome this?

UPDATE1: the IP on screenshot is somewhere in US/Texas...

UPDATE2: when I open these links in my desktop browser incognito mode - I get this captcha too...
11 comments
n
M
O
Got captha and HTTP 403 when accessing wellfound.com

I get captcha all the time when I access links like these (basically - accessing any job ad on wellfound):
https://wellfound.com/company/kalepa/jobs/2651640-tech-lead-manager-full-stack-europe
https://wellfound.com/company/pinatacloud/jobs/2655889-principal-software-engineer
https://wellfound.com/company/wingspanapp/jobs/2629420-senior-software-engineer

Screenshot attached.


and this is not Cloudflare protection - it's some other anti-bot thing.

I am using:
  • US residential proxies from smartproxy.com
  • PlaywrightCrawler with useSessionPool: false and persistCookiesPerSession: false
  • headless Firefox, both as launcher and in fingerprintGeneratorOptions browsers
  • my locale is en-US, timezone in America/New_York (to match US proxies)
  • in fingerprintGeneratorOptions devices: ['desktop']
  • in launchContext: { useIncognitoPages: true }
  • I set pluginContent in preNavigationHooks to fix the "plugin length" problem, as described here: https://discord.com/channels/801163717915574323/1059483872271798333
And still this site detects me as robot!
Any ideas how to overcome this?

UPDATE1: the IP on screenshot is somewhere in US/Texas...

UPDATE2: when I open these links in my desktop browser incognito mode - I get this captcha too...
11 comments
n
M
O
I see these messages on the console
Plain Text
 INFO  Statistics: PlaywrightCrawler request statistics: {"requestAvgFailedDurationMillis":null,


how can I disable it?

P.S.

I already have this:
Plain Text
... new PlaywrightCrawler({
    autoscaledPoolOptions: {
        loggingIntervalSecs: null,
3 comments
n
L
I see these messages on the console
Plain Text
 INFO  Statistics: PlaywrightCrawler request statistics: {"requestAvgFailedDurationMillis":null,


how can I disable it?

P.S.

I already have this:
Plain Text
... new PlaywrightCrawler({
    autoscaledPoolOptions: {
        loggingIntervalSecs: null,
3 comments
n
L
I already block images as described in [1] and this helps to save some bandwith.
Next step: looking at statistics in my proxy service I see a significant number of requests like these:

Plain Text
https://www.googletagmanager.com/gtag/js?id=...
https://connect.facebook.net/en_US/fbevents.js
https://www.google-analytics.com/analytics.js
https://fonts.googleapis.com/css?family=Lato


Can somebody show me an example of code blocking these domains? (better: to block all domains from a given list)

I assume it should be something in PlaywrightCrawler.preNavigationHooks, right?
Prerequisites: PlaywrightCrawler, Firefox as launcher (Chrome-specific hacks probably would not work)

(I'm not good at writing Javascript from scratch, so need some help)

[1] https://discord.com/channels/801163717915574323/1060986956961546320
1 comment
L
I already block images as described in [1] and this helps to save some bandwith.
Next step: looking at statistics in my proxy service I see a significant number of requests like these:

Plain Text
https://www.googletagmanager.com/gtag/js?id=...
https://connect.facebook.net/en_US/fbevents.js
https://www.google-analytics.com/analytics.js
https://fonts.googleapis.com/css?family=Lato


Can somebody show me an example of code blocking these domains? (better: to block all domains from a given list)

I assume it should be something in PlaywrightCrawler.preNavigationHooks, right?
Prerequisites: PlaywrightCrawler, Firefox as launcher (Chrome-specific hacks probably would not work)

(I'm not good at writing Javascript from scratch, so need some help)

[1] https://discord.com/channels/801163717915574323/1060986956961546320
1 comment
L