Any suggestions for improving the speed of the crawling...

At a glance

The post asks for suggestions on how to accelerate a web crawling process, beyond reducing the scope of what is being crawled. The community members provided the following suggestions:

1. If using a browser for scraping, try to replace parts of the process with simple HTTP calls, without relying on the browser navigation. This can involve scraping the website's internal or public JSON API or the site's pure HTML.

2. Increase parallelism and concurrency, which is the first factor to consider.

3. Optimize the request and parsing logic in some cases.

There is no explicitly marked answer in the comments.

ggablabelle

Hello there!

Beside reducing the scope of what is being crawled, for example of number of pages, etc, what can we do in order to accelerate the run?

Any suggestions are welcomed, I'm simply curious.

3 comments

RRado Ch.

hey, aside from obvious improvements like allocating more memory or increasing maximum concurrency, if you're using a browser for scraping, try to find ways to replace parts of the process with simple HTTP calls, without relying on the browser navigation. You can scrape the website's internal or public JSON API or the site's pure HTML this way. Use the browser only when necessary.

RRes

Basically increasing parallelism and concurrency is the first factor.
But in some cases its worth it to optimize request and parsing logic.

RRes

@gablabelle

Add a reply

Apify Discord Mirror

Any suggestions for improving the speed of the crawling run?