Any suggestions for improving the speed of the crawling run?
Any suggestions for improving the speed of the crawling run?
At a glance
The post asks for suggestions on how to accelerate a web crawling process, beyond reducing the scope of what is being crawled. The community members provided the following suggestions:
1. If using a browser for scraping, try to replace parts of the process with simple HTTP calls, without relying on the browser navigation. This can involve scraping the website's internal or public JSON API or the site's pure HTML.
2. Increase parallelism and concurrency, which is the first factor to consider.
3. Optimize the request and parsing logic in some cases.
There is no explicitly marked answer in the comments.
hey, aside from obvious improvements like allocating more memory or increasing maximum concurrency, if you're using a browser for scraping, try to find ways to replace parts of the process with simple HTTP calls, without relying on the browser navigation. You can scrape the website's internal or public JSON API or the site's pure HTML this way. Use the browser only when necessary.