Apify Discord Mirror

Updated 5 months ago

scraping at scale

At a glance

The community member is asking how to structure a crawler when scraping possibly hundreds of different websites with varying structures, while handling multiple requests at once using Crawlee. Another community member suggests using an external message queue, such as beanstalkd, to manage the requests. They also recommend creating a configuration file (e.g., YAML or JSON) that describes where to find specific content on each website. There is no explicitly marked answer in the comments.

Useful resources
How should I structure my crawler when scraping possibly 100s of different sites with different structures, handling multiple requests at once in Crawlee
n
h
A
3 comments
Well, I am implementing something similar... 30-40 sites but with SIMILAR structure (if the structure of your sites is different -> you are implementing something like google/bing - king of generic web crawler)

  1. You might use something like an external message queue, we discussed it here and in few other places:
https://discord.com/channels/801163717915574323/1056348705407651941

beanstalkd if just fine for these purposes

  1. you can create one big config file (YML, JSON...) describing "where-to-find-what on each site"
Example:

abc123.com:
listOfTopics: h1 > div.list > div
...

xyz987.com:
listOfTopics: div.bigListClass > div > p
....
just advanced to level 1! Thanks for your contributions! πŸŽ‰
Add a reply
Sign up and join the conversation on Discord