Replicate XHR requests to wait for cheerio page to load...

ccurioussoul

Dear all, after trying out broweser based data extraction. I need to enhance the crawler and make it light weight. But cheerio crawler doesnt work for sites which has cloudflare secuirty. Because it just grabs the very first html (which is from CF) and thats it.

Is there anyway to wait for its completion ? Also any suggestion to find not the actual endpoint which is loading the data ? I am using dev tools but it looks like js and data is quite strongly coupled.

Any help would be highly appreciated.

Thanks for the great work. 🙂

2 comments

PPepa J

Hello ,
I am afraid there is currently no simple way for CheerioCrawler to evaluate in-page javascript which I believe the Cloudflare protection is based on.

It really depends on the page, but one of those solutions could be to use Playwright/Puppeteer crawler just to deal with the cloudflare protection and then use the obtained cookies (and possibly the same fingerprint) afterwards in your CheerioCrawler solution - but if the page is SPA based, or the protection requires additional communication with Cloudflare, you may use the browser based crawler anyway 😕 .

ccurioussoul

Ok thankyou lets continue with browser and see where I hit the limits.

Add a reply

Join on Discord

Apify and Crawlee Official Forum

Replicate XHR requests to wait for cheerio page to load further