Python crawlers running in parallel

ggrackle

Hi, I have a custom Python + requests Actor that works great. It's pretty simple, it works against a list of starting URLs and pulls out a piece of information per URL.

My question is: If (for example) one run of 1,000 input URLs takes an hour to complete, i would like to parallel-ize it 4 ways so that I can run 4,000 URLs in an hour.

What's the best way to do this? I could kick off 4 copies of the run with segmented data, but this seems like something Apify could support natively.

I saw that if I was using Crawlee (and therefore JS) I could use autoscaling: https://docs.apify.com/platform/actors/running/usage-and-resources . But is there a way to build a single Python based Actor that uses more threads/CPU cores if needed?

4 comments

AApifyBot

just advanced to level 1! Thanks for your contributions! 🎉

HHamza

Hi,

We don't have a similar functionality in the Python SDK yet (it's planned in the upcoming months).

But for now, you could just write some simple utility, using asyncio.Queue, to get what you need: https://docs.python.org/3/library/asyncio-queue.html#examples

ggrackle

Thank you ! I will try out the queue system.

ggrackle

Hello! I've ported my crawler to use asyncio queues. It does work faster, at 4 tasks running it's roughly 2x the speed. But increasing that task # any more doesn't really increase speed. Going to 8 tasks is a slight speed boost (measured in requests/second) but 16 is the same and sometimes slower than 8. I've tried increasing memory but it doesn't make anything go faster, just costs more. My RAM / CPU graph is pretty stable (this is at 16)

I do notice if i just boot another Actor run, it runs again at the same speed, so I know I can get a speed boost by running in parallel, but i don't know how to do that with just one Actor. Are we limited in bandwidth per Actor run / docker container? Is there a way to increase that limit so i can just have one Actor run for simplicity?

Attachment

Add a reply

Apify and Crawlee Official Forum

Python crawlers running in parallel