system design of concurrent crawlers

hharish

i have multiple crawlers - primarily playwright, per site that work on their own completely fine when i use only 1 crawler per site
i have tried running these crawlers concurrently through a scrape event that is emitted from the server that emits individual scrape events for each site to run each crawler
i face a lot of memory overloads, timed out navigations, skipping of many products, and early ending of the crawlers
each crawler essentially takes base urls or scrapes these base urls to get product urls that are then indvidually scraped through to get product page info

3 comments

hharish

Plain Text

WARN  PlaywrightCrawler:AutoscaledPool:Snapshotter: Memory is critically overloaded. Using 6177 MB of 4017 MB (154%). Consider increasing available memory.
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. This crawler instance is already running, you can add more requests to it via `crawler.addRequests()`.
 {"id":"vI4UdrhFP5NVjsV","url":"https://www.tentree.com/collections/kids?page=10","retryCount":1}
INFO  PlaywrightCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":1,"systemStatus":{"isSystemIdle":false,"memInfo":{"isOverloaded":true,"limitRatio":0.2,"actualRatio":1},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0.05},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. page.goto: Timeout 30000ms exceeded.
=========================== logs ===========================
navigating to "https://www.tentree.com/collections/womens?page=50", waiting until "networkidle"
============================================================
 {"id":"2QcFPmYcDLgjTat","url":"https://www.tentree.com/collections/womens?page=50","retryCount":2}
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. Navigation timed out after 60 seconds. {"id":"qgjcGJ50OKucpO6","url":"https://www.kleankanteen.com/collections/all/products/party-kit","retryCount":1}

how do i design the scrapers and run them either one by one, certain aspects one by one, how to manage multiple running at once, etc. and is it doable to do that in the first place

hharish

should i run them in separate terminalsor how else should i run thrm

hharish

and finally how would i run each one in actors on apify, what plan would i choose, how would i manage memory, storage, cu's, etc. and should i run them under the same actor instance or should i run them on individual actor instances and how would you reccomed for me to effectively manage each site crawler

Add a reply

Join on Discord

Apify and Crawlee Official Forum

system design of concurrent crawlers