conducting faster scrapes with pagination and individua...

hharish

hey i was curious that when im scraping amazon, what's a reasonable time frame for the scraping duration considering i scrape each product link from the results page and then scrape each individual product page for the information and also paginate through each results page until there are no more pages left
i did previously just scrape product info straight of product cards on the results page but it would some times give dummy links that would lead to an unrelated amazon page and the product info would be more innacurate
how can i increase the speed of my scrapes, especially considering i want add on more and more scrapers in the future that i all want to happen concurrently to save time, im aiming for quite a low scrape time of within 10 seconds - 15 seconds or lower and its taking upwards of 1 minute
this is a cheerio crawler

3 comments

AAndrey Bykov

Cheerio is pretty much the fastest solution (faster would only be cheerio + using API/XHR links with structured JSON). So with this you're pretty much limited by the network speed/response time. Also - you should consider that if you send some subsequent requests for 1 product - it will take some extra time. But otherwise higher concurrency (more availably memory, CPU power) solves the problem

hharish

are there any methods in crawlee to parallely scrape different sites and links within that specific site because i see that each results page in the site is scraped one by one and so is each product page, so is there any way to do so

AAndrey Bykov

It's done automatically out of the box. When there's spare memory/cpu capacity - autoscaled pool start more requests. Per se every page have to be opened, but crawlee opens these pages in parallel

Add a reply

Join on Discord

Apify and Crawlee Official Forum

conducting faster scrapes with pagination and individual product scraping