Apify and Crawlee Official Forum

Updated 3 months ago

Cheerio Crawler works for Amazon.de but gets detected bot at amazon.com

Dear all, I am experimenting with cheerio crawler to scrape Amazon. I followed the tutorial online and it works for Germany but the same crawler gets detected as a bot for US. For Germany, I am using a data center proxy of Germany and it works but for USA the datacenter proxy of US doesn't work. Below is the configuration. I am building an Amazon scraper for multiple marketplaces. But this inconsistency makes it challenging.


const crawler = new CheerioCrawler({ proxyConfiguration, requestQueue: queue, useSessionPool: true, persistCookiesPerSession: true, maxRequestRetries: 20, maxRequestsPerMinute: 250, autoscaledPoolOptions:{ maxConcurrency:100, minConcurrency: 5, isFinishedFunction: async () => { // Tell the pool whether it should finish // or wait for more tasks to become available. // Return true or false return false } }, failedRequestHandler: async (context) => rebirth_requests({ ...context}), requestHandler: async (context) => router({ ...context, dbPool}) //sessionPoolOptions:{blockedStatusCodes:[]}, });
c
A
h
9 comments
When I use this proxy in my system with real browser it works. So I assume proxy is fine only problem is the config in cheerio.
Have you tried other proxies (groups, maybe residential). But amazon is quite protected, and it's common that one country will be better protected than the other for the "same" website
But the point is same proxy works in the browser. So an http call in a browser with same with same proxy works but in cheerio doesnt.

To me it feels like headers, cookies etc when browser is used is different then what being used in cheerio.

Is there any fingerprints used when we scrap via cheerio ?
I tried out same proxy in playwright and it works. So there must be some settings different in Cheerio which are inconsistent.
CheerioCrawler is using got-scraping, and yes - it use the fingerprints..
preNavigationHooks: [ async (crawlingContext, gotOptions) => { // ... gotOptions.headerGeneratorOptions= { // browsers: [ // { // name: 'chrome', // minVersion: 90, // maxVersion: 100 // } //], devices: ['mobile'], locales: ['en-US'], operatingSystems: ['ios','android'], } },

with these settings its better now.
for me i have no problem as long as i send the user agent in the headers
Ok for .com or for .de ?
Add a reply
Sign up and join the conversation on Discord