Apify and Crawlee Official Forum

Updated 3 months ago

scraping at scale

How should I structure my crawler when scraping possibly 100s of different sites with different structures, handling multiple requests at once in Crawlee
n
h
A
3 comments
Well, I am implementing something similar... 30-40 sites but with SIMILAR structure (if the structure of your sites is different -> you are implementing something like google/bing - king of generic web crawler)

  1. You might use something like an external message queue, we discussed it here and in few other places:
https://discord.com/channels/801163717915574323/1056348705407651941

beanstalkd if just fine for these purposes

  1. you can create one big config file (YML, JSON...) describing "where-to-find-what on each site"
Example:

abc123.com:
listOfTopics: h1 > div.list > div
...

xyz987.com:
listOfTopics: div.bigListClass > div > p
....
just advanced to level 1! Thanks for your contributions! πŸŽ‰
Add a reply
Sign up and join the conversation on Discord