Replicate XHR requests to wait for cheerio page to load...

At a glance

The community member is trying to enhance a crawler to make it lightweight, but the Cheerio crawler doesn't work for sites with Cloudflare security. The community member is looking for a way to wait for the completion of the Cloudflare protection and suggestions to find the actual endpoint loading the data, as the JavaScript and data seem tightly coupled. A community member suggests using a Playwright/Puppeteer crawler to deal with the Cloudflare protection and then using the obtained cookies and fingerprint in the Cheerio crawler, but notes that if the page is SPA-based or the protection requires additional communication with Cloudflare, a browser-based crawler may still be necessary.

ccurioussoul

Dear all, after trying out broweser based data extraction. I need to enhance the crawler and make it light weight. But cheerio crawler doesnt work for sites which has cloudflare secuirty. Because it just grabs the very first html (which is from CF) and thats it.

Is there anyway to wait for its completion ? Also any suggestion to find not the actual endpoint which is loading the data ? I am using dev tools but it looks like js and data is quite strongly coupled.

Any help would be highly appreciated.

Thanks for the great work. 🙂

2 comments

PPepa J

Hello ,
I am afraid there is currently no simple way for CheerioCrawler to evaluate in-page javascript which I believe the Cloudflare protection is based on.

It really depends on the page, but one of those solutions could be to use Playwright/Puppeteer crawler just to deal with the cloudflare protection and then use the obtained cookies (and possibly the same fingerprint) afterwards in your CheerioCrawler solution - but if the page is SPA based, or the protection requires additional communication with Cloudflare, you may use the browser based crawler anyway 😕 .

ccurioussoul

Ok thankyou lets continue with browser and see where I hit the limits.

Add a reply

Apify Discord Mirror

Replicate XHR requests to wait for cheerio page to load further