scraping at scale

At a glance

The community member is asking how to structure a crawler when scraping possibly hundreds of different websites with varying structures, while handling multiple requests at once using Crawlee. Another community member suggests using an external message queue, such as beanstalkd, to manage the requests. They also recommend creating a configuration file (e.g., YAML or JSON) that describes where to find specific content on each website. There is no explicitly marked answer in the comments.

Useful resources

hharish

How should I structure my crawler when scraping possibly 100s of different sites with different structures, handling multiple requests at once in Crawlee

3 comments

nnew_in_town

Well, I am implementing something similar... 30-40 sites but with SIMILAR structure (if the structure of your sites is different -> you are implementing something like google/bing - king of generic web crawler)

You might use something like an external message queue, we discussed it here and in few other places:

https://discord.com/channels/801163717915574323/1056348705407651941

beanstalkd if just fine for these purposes

you can create one big config file (YML, JSON...) describing "where-to-find-what on each site"

Example:

abc123.com:
listOfTopics: h1 > div.list > div
...

xyz987.com:
listOfTopics: div.bigListClass > div > p
....

hharish

thank you

AApifyBot

just advanced to level 1! Thanks for your contributions! 🎉

Add a reply

Apify Discord Mirror

scraping at scale