pass the cloudflare browser check

HHonzaS

Anybody know how to pass cloudflare browser check with crawlee playwrightCrawler?
site I have problem with: https://www.g2.com/
I have tried residential proxies, no proxies, chrome and firefox browser, headful or headless but nothing works.
My chrome browser passes the check with no proxies and residential proxies too, so I guess proxy is not the problem. The problem is that cloudflare somehow knows that it is automated browser.
In apify store there is working scraper for g2 but it is written in python but atleast I know it is possible to do it.

19 comments

AAlexey Udovydchenko

Playwright with FF should work, I used it to bypass CF few months ago, pls share run or snapshot, make sure you await for some real content control, on page load there will be CF checkup running for a while

HHonzaS

here is the run https://console.apify.com/view/runs/UHYyD8JIrtj68ePpW but there is not much to see, it just returns 403
I got through CF many times with playwright but now it looks like they have improved the protection.

AAlexey Udovydchenko

Looks like some IPs working and some are not, content reached under Chrome after two retries: https://console.apify.com/view/runs/MWefTdPk6wfZZ3rz5

HHonzaS

interesting, looks like https://www.g2.com/ works but for example does not https://www.g2.com/products/monday-com-monday-com/reviews

AApifyBot

just advanced to level 12! Thanks for your contributions! 🎉

HHonzaS

I took your config and just changed the url to https://www.g2.com/products/monday-com-monday-com/reviews and number of retries to 20 but no luck. https://console.apify.com/view/runs/3RQyInzk9aQ0SEOJS
Any suggestions would be greatly appreciated.

LLukas Krivka

could add it to his repository of Cloudflare sites

sstrajk

We chatted about this in private with , as I encountered CF blocking too...

If I understand it correctly, CF has two modes of bot protection (with kinda confusing names TBH)

a) Bot Management – basic
b) Super Bot Fight Mode – advanced

The sites I’m scraping seems to use A) The solution to that seems to be pretty easy:

using CheerioCrawler with playwright:firefox Dockerfile
in createSessionFunction: open Firefox (via Playwright), goto the site, let the Firefox solve the Javascript challenge, and save all the cookies and request headers to the session.
in preNavigationHooks: get stored cookies/headers from session and set them to gotScraping

This solution works for me both locally and on Apify platform, without any proxies used. Beware that it probably only works for sites that use the basic bot protection mode.

sstrajk

Here's my code https://gist.github.com/Strajk/baff5a13c66b21a41ca002389f6328ec#file-r2-bike-com-ts-L125

LLeMoussel

What do you use for debugging with MITM?
Is it mitmproxy?
https://mitmproxy.org/

sstrajk

mitmproxy would probably work too, but I like nice things so I use https://proxyman.io/ 😄

AApifyBot

just advanced to level 1! Thanks for your contributions! 🎉

sstrajk

It was crucial for me for discovering it's the headers order that causes the issue

Attachments

LLukas Krivka

Wow, never heard about headers order messing things up

AAlexey Udovydchenko

I used the same approach but with internal fetch calls from inside browser, imho might be more reliable since they should be doing something logically equal to "heartbeat" checkup to see if web visitor still online

AAlexey Udovydchenko

This also should be working regardless of their internal protection mode: if page context reached then fetch expected to work, otherwise they (CF) will not be able to support web apps

sstrajk

it was first time for me too, but it's probably not too uncommon as there's logic exactly for this in header-generator https://github.com/apify/header-generator/blob/master/src/header-generator.ts#L208

Attachments

LLukas Krivka

Thanks, good to keep in mind

HHonzaS

finally solved - https://discord.com/channels/801163717915574323/1051917834290200608/1052147143508500490

Add a reply

Join on Discord

Apify and Crawlee Official Forum

pass the cloudflare browser check