Apify

Apify and Crawlee Official Forum

b
F
A
J
A

Got captha and HTTP 403 using PlaywrightCrawler

Got captha and HTTP 403 when accessing wellfound.com

I get captcha all the time when I access links like these (basically - accessing any job ad on wellfound):
https://wellfound.com/company/kalepa/jobs/2651640-tech-lead-manager-full-stack-europe
https://wellfound.com/company/pinatacloud/jobs/2655889-principal-software-engineer
https://wellfound.com/company/wingspanapp/jobs/2629420-senior-software-engineer

Screenshot attached.


and this is not Cloudflare protection - it's some other anti-bot thing.

I am using:
  • US residential proxies from smartproxy.com
  • PlaywrightCrawler with useSessionPool: false and persistCookiesPerSession: false
  • headless Firefox, both as launcher and in fingerprintGeneratorOptions browsers
  • my locale is en-US, timezone in America/New_York (to match US proxies)
  • in fingerprintGeneratorOptions devices: ['desktop']
  • in launchContext: { useIncognitoPages: true }
  • I set pluginContent in preNavigationHooks to fix the "plugin length" problem, as described here: https://discord.com/channels/801163717915574323/1059483872271798333
And still this site detects me as robot!
Any ideas how to overcome this?

UPDATE1: the IP on screenshot is somewhere in US/Texas...

UPDATE2: when I open these links in my desktop browser incognito mode - I get this captcha too...
Attachment
wellfound.com-01.png
n
O
M
11 comments
UPDATE 1: this IP on screenshot - it is somewhere in US/Texas
UPDATE 2: when I open these links in my desktop browser/incognito mode - get this captcha too...
UPDATE3: it is some variation of Cloudflare, just looked into source code of this captcha HTML... for people who like to look inside: https://wellfound.com/cdn-cgi/apps/head/JIiAUxCYLtpv-hVKsQ6mzsTHfds.js
UPDATE4: It seems, opening wellfound.com in desktop browser, scrolling the page down to the end, and than in the same window opening one of the above links - it worked one-two times, no captcha.

It looks like this thing thinks "aha, this is a normal end-user behavior"

I am sure many people here already saw this protection, so let us share the experience...
Got captha and HTTP 403 using PlaywrightCrawler
Any hints how to fight this protection?
Did You try to use headfull mode?
+ modifying headers can have positive impact
+ maybe don't use incognito mode
try higher number of retries, it can help sometimes
I think the rationale is some IPs are burnt and you need to go through the pool of IPs and find those that work with that site
and then ride those IPs
modifying headers can have positive impact
can you show an example? what should be changed? I mean, there are lot of headers...

Did You try to use headfull mode?
i did not.
how headull mode would help? at the end it must run in headless mode anyway...
Add a reply
Sign up and join the conversation on Discord
Join