Website Content Crawler Execution Speed

At a glance

The community member is experiencing an issue with the Apify website crawler actor, where a single page scrape is taking longer than 2 minutes to execute. They wonder if changing the requestTimeoutSecs input parameter would help speed up the process, and ask what other options they have to control the execution speed. Other community members respond by providing a link to the specific run, noting that a single page scrape should not take that long, and suggesting the community member try running the actor again as it may have been a one-off issue. The community members also discuss the requestTimeoutSecs, initialConcurrency, and maxConcurrency parameters, with one member explaining that the concurrency fields are more relevant for scraping multiple pages, while requestTimeoutSecs determines how long the actor should wait on a request before timing out.

Useful resources

ddavhad

Hello, in the website crawler content actor (Apify developed), it's often taking longer than 2 minutes to scrape only one page.

I'm wondering if we change the input param
"requestTimeoutSecs": 60 does that force the actor to go faster? What other levers do we have to control execution speed.

I understand asynchronous runs for website crawls, but don't get why one page would take longer than 2 minutes to execute.

8 comments

!! ! !.Terry

Check your inbox.

JJameEnder

Hello, can you send me the run link into my private messages? I will take a look, as this is unusual behavior, a simple 2 page scrape should not take 2 minutes.

ddavhad

Don't see a way to dm you on discord.

But here's the link anyway: https://console.apify.com/actors/tasks/d0h4jRkVTRHLNDsJG/runs/9AYQvFj6SmbBiuloZ

Can you access it?

Thanks for sharing your thoughts

ddavhad

One page
Max crawl depth: 0
Duration: 3m22s

ddavhad

just accepted your connect request. feel free to dm.

JJameEnder

I tried running the actor with the exact same input, and it seems to be finished after 18 seconds. Can you try running it again? It's possible there was some very temporary outage, or just weird Docker behavior in the background.

ddavhad

Yeah I can see how that could be a one-off due to a slow cold start or something.

It's happened multiple times over multiple days 😦

I'm calling it from Make (with a 2 minute execution limit on the apify synchronous actor calling node.

Do you know what the input "requestTimeoutSecs" is supposed to do and if it would have any impact?

And "initialConcurrency" and "maxConcurrency"? (Probably useless for one page scrapes but double checking as I'm unsure exactlty how the implementation is behind the scenes)

JJameEnder

requestTimeoutSecs is a field that describes how long the actor should wait on a request before it timeouts and tries again or fails.

concurrency fields are used for scraping multiple pages at the same time. It's something you don't really have to worry about if you scrape only one page at a time.

Add a reply

Apify Discord Mirror

Website Content Crawler Execution Speed