Apify

Apify and Crawlee Official Forum

b
F
A
J
A

got-scraping vs cheerioCrawler or sendRequest

I have weird case of url like this: https://www.firmy.cz/detail/13470923-veronika-vankova-mseno.html
With got-scraping it returns data but with cheerioCrawler or using sendRequest in BasicCrawler I get just small html with this text: 'Pravděpodobně používáte rozšíření či blokátory, jež mohou ovlivňovat načtení této stránky. Pro správnou funkčnost prosíme, deaktivujte všechna tato rozšíření a zkuste stránku načíst znovu.'
Anybody clever can tell me what is the difference? I thought that cheerioCrawler and sendRequest both use got-scraping inside.
Thanks
1
S
v
P
4 comments
Our team will reply soon.
Hi ,
it seems that this is caused by the isStream: true got option that is being used in crawlee, but I am not entirely sure why that is so. I will ask internally for more info.
Hi , can you please confirm this is what is causing your issue? If so as a workaround I believe you should be able to set this value in preNavigationHook.
I have tried the got-scraping with isStream: true but it makes the same result as isStream: false.
But I have found out that it is accept header that makes the site return consent screen instead of data.
So for cheerio the solution is to delete that header in preNavigationHooks like this:
Plain Text
preNavigationHooks:[async (crawlingContext, gotOptions)  =>{
    crawlingContext.request.headers['accept']= '';}
    ]
Add a reply
Sign up and join the conversation on Discord
Join