Apify Discord Mirror

Home
Members
Nicolay
N
Nicolay
Offline, last seen 2 months ago
Joined December 12, 2024
I am really curious about what I configure and what I see.

I am deploying it on a beefy EC2 with the following settings:

concurrency_settings = ConcurrencySettings(
min_concurrency=10,
max_concurrency=100,
)

But my autoscaling pool tells me: [crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 10; cpu = 0.0; mem = 0.0; event_loop = 0.212; client_info = 0.0

Using playwright crawler with a curl impersonate http client:

return PlaywrightCrawler(
request_handler=router,
request_handler_timeout=timeout,
max_request_retries=config.max_retries,
concurrency_settings=concurrency_settings,
http_client=http_client
)

Is there any hints on optimizing for concurrency?
1 comment
M
Hey

I have a crawler which scrapes a lot of different websites, each with multiple urls.

Each website has an associated id, I need for the dataset.

So I want to scrape the urls, get the data but then instantly send it to a database, so I don't have to keep it on the EC2 instance.

Is there a way to pass extra variables to @router.default_handler

for company in valid_company_urls:
crawler = await create_crawler(config, company)

# Run crawler for this company's URLs
await crawler.run(company['url'])

So when I do something like this. How could I pass additional arguments to run that are then passed to the handler.

I have not found anything in the docs.

Thanks for any hints!
3 comments
A
N