Anybody know how to pass cloudflare browser check with crawlee playwrightCrawler? site I have problem with: https://www.g2.com/ I have tried residential proxies, no proxies, chrome and firefox browser, headful or headless but nothing works. My chrome browser passes the check with no proxies and residential proxies too, so I guess proxy is not the problem. The problem is that cloudflare somehow knows that it is automated browser. In apify store there is working scraper for g2 but it is written in python but atleast I know it is possible to do it.
Playwright with FF should work, I used it to bypass CF few months ago, pls share run or snapshot, make sure you await for some real content control, on page load there will be CF checkup running for a while
here is the run https://console.apify.com/view/runs/UHYyD8JIrtj68ePpW but there is not much to see, it just returns 403 I got through CF many times with playwright but now it looks like they have improved the protection.
We chatted about this in private with , as I encountered CF blocking too...
If I understand it correctly, CF has two modes of bot protection (with kinda confusing names TBH)
a) Bot Management β basic
b) Super Bot Fight Mode β advanced
The sites Iβm scraping seems to use A) The solution to that seems to be pretty easy:
using CheerioCrawler with playwright:firefox Dockerfile
in createSessionFunction: open Firefox (via Playwright), goto the site, let the Firefox solve the Javascript challenge, and save all the cookies and request headers to the session.
in preNavigationHooks: get stored cookies/headers from session and set them to gotScraping
This solution works for me both locally and on Apify platform, without any proxies used. Beware that it probably only works for sites that use the basic bot protection mode.
I used the same approach but with internal fetch calls from inside browser, imho might be more reliable since they should be doing something logically equal to "heartbeat" checkup to see if web visitor still online
This also should be working regardless of their internal protection mode: if page context reached then fetch expected to work, otherwise they (CF) will not be able to support web apps