Got captha and HTTP 403 using PlaywrightCrawler

nnew_in_town

Got captha and HTTP 403 when accessing wellfound.com

I get captcha all the time when I access links like these (basically - accessing any job ad on wellfound):
https://wellfound.com/company/kalepa/jobs/2651640-tech-lead-manager-full-stack-europe
https://wellfound.com/company/pinatacloud/jobs/2655889-principal-software-engineer
https://wellfound.com/company/wingspanapp/jobs/2629420-senior-software-engineer

Screenshot attached.

~~and this is not Cloudflare protection - it's some other anti-bot thing~~.

I am using:

US residential proxies from smartproxy.com
PlaywrightCrawler with useSessionPool: false and persistCookiesPerSession: false
headless Firefox, both as launcher and in fingerprintGeneratorOptions browsers
my locale is en-US, timezone in America/New_York (to match US proxies)
in fingerprintGeneratorOptions devices: ['desktop']
in launchContext: { useIncognitoPages: true }
I set pluginContent in preNavigationHooks to fix the "plugin length" problem, as described here: https://discord.com/channels/801163717915574323/1059483872271798333

And still this site detects me as robot!
Any ideas how to overcome this?

UPDATE1: the IP on screenshot is somewhere in US/Texas...

UPDATE2: when I open these links in my desktop browser incognito mode - I get this captcha too...

Attachment

11 comments

nnew_in_town

UPDATE 1: this IP on screenshot - it is somewhere in US/Texas

nnew_in_town

UPDATE 2: when I open these links in my desktop browser/incognito mode - get this captcha too...

nnew_in_town

UPDATE3: it is some variation of Cloudflare, just looked into source code of this captcha HTML... for people who like to look inside: https://wellfound.com/cdn-cgi/apps/head/JIiAUxCYLtpv-hVKsQ6mzsTHfds.js

nnew_in_town

UPDATE4: It seems, opening wellfound.com in desktop browser, scrolling the page down to the end, and than in the same window opening one of the above links - it worked one-two times, no captcha.

It looks like this thing thinks "aha, this is a normal end-user behavior"

I am sure many people here already saw this protection, so let us share the experience...

nnew_in_town

Got captha and HTTP 403 using PlaywrightCrawler

nnew_in_town

Any hints how to fight this protection?

OOleg V.

Did You try to use headfull mode?
+ modifying headers can have positive impact
+ maybe don't use incognito mode

MMichal

try higher number of retries, it can help sometimes

MMichal

I think the rationale is some IPs are burnt and you need to go through the pool of IPs and find those that work with that site

MMichal

and then ride those IPs

nnew_in_town

modifying headers can have positive impact

can you show an example? what should be changed? I mean, there are lot of headers...

Did You try to use headfull mode?

i did not.
how headull mode would help? at the end it must run in headless mode anyway...

Add a reply

Join on Discord

Apify and Crawlee Official Forum

Got captha and HTTP 403 using PlaywrightCrawler