Apify Discord Mirror

Updated 3 months ago

Concurrency Settings vs Autoscaling Pool

At a glance

The community member is deploying a Playwright crawler on a beefy EC2 instance with concurrency settings of 10-100. However, the autoscaling pool is reporting a current concurrency of 0 and a desired concurrency of 10. The community member is using a curl impersonate HTTP client with the Playwright crawler and is asking for hints on optimizing for concurrency. A comment from another community member suggests that using an HTTP client with the Playwright crawler does not make sense, as Playwright is a browser-based crawler and does not use an HTTP client. The comment also mentions that the desired concurrency can be configured when the crawler starts, and the number of tasks affects the calculation of the current concurrency. The community member is advised to read more about this in a specific GitHub issue.

Useful resources

NNicolay

I am really curious about what I configure and what I see.

I am deploying it on a beefy EC2 with the following settings:

concurrency_settings = ConcurrencySettings(
min_concurrency=10,
max_concurrency=100,
)

But my autoscaling pool tells me: [crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 10; cpu = 0.0; mem = 0.0; event_loop = 0.212; client_info = 0.0

Using playwright crawler with a curl impersonate http client:

return PlaywrightCrawler(
request_handler=router,
request_handler_timeout=timeout,
max_request_retries=config.max_retries,
concurrency_settings=concurrency_settings,
http_client=http_client
)

Is there any hints on optimizing for concurrency?

1 comment

MMantisus

Hi, it makes no sense for PlaywrightCrawler to use http_client because it doesn't use it.

http_client is for HTTP based crawlers.
PlaywrightCrawler is a browser-based crawler.

You can also configure desired_concurrency to be initiated when the crawler starts. Also the number of tasks affects the calculation of the current current_concurrency. You can read more in this issue - https://github.com/apify/crawlee-python/issues/786#issuecomment-2527802437.

Add a reply