Dear all, I am experimenting with cheerio crawler to scrape Amazon. I followed the tutorial online and it works for Germany but the same crawler gets detected as a bot for US. For Germany, I am using a data center proxy of Germany and it works but for USA the datacenter proxy of US doesn't work. Below is the configuration. I am building an Amazon scraper for multiple marketplaces. But this inconsistency makes it challenging.
const crawler = new CheerioCrawler({
proxyConfiguration,
requestQueue: queue,
useSessionPool: true,
persistCookiesPerSession: true,
maxRequestRetries: 20,
maxRequestsPerMinute: 250,
autoscaledPoolOptions:{
maxConcurrency:100,
minConcurrency: 5,
isFinishedFunction: async () => {
// Tell the pool whether it should finish
// or wait for more tasks to become available.
// Return true or false
return false
}
},
failedRequestHandler: async (context) => rebirth_requests({ ...context}),
requestHandler: async (context) => router({ ...context, dbPool})
//sessionPoolOptions:{blockedStatusCodes:[]},
});
Have you tried other proxies (groups, maybe residential). But amazon is quite protected, and it's common that one country will be better protected than the other for the "same" website