Apify

Apify and Crawlee Official Forum

b
F
A
J
A

pass the cloudflare browser check

Anybody know how to pass cloudflare browser check with crawlee playwrightCrawler?
site I have problem with: https://www.g2.com/
I have tried residential proxies, no proxies, chrome and firefox browser, headful or headless but nothing works.
My chrome browser passes the check with no proxies and residential proxies too, so I guess proxy is not the problem. The problem is that cloudflare somehow knows that it is automated browser.
In apify store there is working scraper for g2 but it is written in python but atleast I know it is possible to do it.
3
A
H
A
19 comments
Playwright with FF should work, I used it to bypass CF few months ago, pls share run or snapshot, make sure you await for some real content control, on page load there will be CF checkup running for a while
here is the run https://console.apify.com/view/runs/UHYyD8JIrtj68ePpW but there is not much to see, it just returns 403
I got through CF many times with playwright but now it looks like they have improved the protection.
Looks like some IPs working and some are not, content reached under Chrome after two retries: https://console.apify.com/view/runs/MWefTdPk6wfZZ3rz5
just advanced to level 12! Thanks for your contributions! πŸŽ‰
I took your config and just changed the url to https://www.g2.com/products/monday-com-monday-com/reviews and number of retries to 20 but no luck. https://console.apify.com/view/runs/3RQyInzk9aQ0SEOJS
Any suggestions would be greatly appreciated.
could add it to his repository of Cloudflare sites
We chatted about this in private with , as I encountered CF blocking too...

If I understand it correctly, CF has two modes of bot protection (with kinda confusing names TBH)
  • a) Bot Management – basic
  • b) Super Bot Fight Mode – advanced
The sites I’m scraping seems to use A) The solution to that seems to be pretty easy:
  • using CheerioCrawler with playwright:firefox Dockerfile
  • in createSessionFunction: open Firefox (via Playwright), goto the site, let the Firefox solve the Javascript challenge, and save all the cookies and request headers to the session.
  • in preNavigationHooks: get stored cookies/headers from session and set them to gotScraping
This solution works for me both locally and on Apify platform, without any proxies used. Beware that it probably only works for sites that use the basic bot protection mode.
What do you use for debugging with MITM?
Is it mitmproxy?
https://mitmproxy.org/
mitmproxy would probably work too, but I like nice things so I use https://proxyman.io/ πŸ˜„
just advanced to level 1! Thanks for your contributions! πŸŽ‰
It was crucial for me for discovering it's the headers order that causes the issue
Attachments
CleanShot_2022-11-13_at_21.28.02.png
CleanShot_2022-11-13_at_21.28.14.png
Wow, never heard about headers order messing things up
I used the same approach but with internal fetch calls from inside browser, imho might be more reliable since they should be doing something logically equal to "heartbeat" checkup to see if web visitor still online
This also should be working regardless of their internal protection mode: if page context reached then fetch expected to work, otherwise they (CF) will not be able to support web apps
it was first time for me too, but it's probably not too uncommon as there's logic exactly for this in header-generator https://github.com/apify/header-generator/blob/master/src/header-generator.ts#L208
Attachments
CleanShot_2022-11-16_at_07.21.122x.png
CleanShot_2022-11-16_at_07.20.592x.png
Thanks, good to keep in mind
Add a reply
Sign up and join the conversation on Discord
Join