Apify and Crawlee Official Forum

Updated 3 months ago

requestHandler timed out

Hello,
I have a quite big scraper, it goes over 200k pages and it will take approximately 12 hours, but after 6 hours, for some reason all the requests are getting this requestHandler timed out after 30 seconds.

I don't think increasing the requestHandler timeout will solve it, maybe there is something else wrong that I don't get ?
2
H
N
O
15 comments
that is most propably because because the website started to block you because have burnt through all the proxies or it is just overloaded
I would try with slower scraping (less concurrency or some delays)
The proxy pool is kind of huge because I pay per traffic
I don't think it blocked a few hundreds of thousands of ips
I will try slower
I'm using incognito windows and each request gets its own proxy, so the ips repeat very rare
Also, maybe try to add try-catch and make a screenshot in case of time out. To check what is happening.
great idea !
thank you.
will try a bit later, now I'm stuck with a different project
hey, how do I handle these timeout errors? I have tried using a try/catch and nested my entire route handler inside it but still its not triggered and I keep getting these timeout errors
Reclaiming failed request back to the list or queue. Navigation timed out after 60 seconds.
do I need to do this in the postNavigationHooks? usecase is to rotate IPs in my proxy if a request times out
There is automatic mechanism in crawlee that drops using proxies that are being blocked, but since you end due to timeout instead of http status it might not be triggered.

I suggest you to implement you own errorHandler and in case that your request ends due to timeout during navigation, you may call session.markBad() . ( see https://docs.apify.com/sdk/js/docs/guides/session-management )

Another thing could be that the mechanism already work, but all the proxies from your proxy pool were already used and blocked, so rotating to new ones doesn't change much.
hey, yeah thats tte issue, I have been trying to handle this, only issue its a navigation error and not a request one, so routes wont handle it, the only way I figured Ican handle it is in postNav hooks but it only takes crawling contect as amm argument and I'm not sure how to check for this specific error where the navigation itself is taking time
Did You tried to setup your own errorHandler?
Plain Text
const crawler = new PuppeteerCrawler({
    // ...
    errorHandler: async ({ page, log }, error) => {
        // ...        
    },
    requestHandler: async ({ session, page}) => {
        // ...
    },
});
no, but I'll check it out thanks, I have a custom logging solution but not an error handler
just one thing since crawlee is not exporting TimeoutError anywhere, do I have to manually check for it like error.name==="TimeoutError? and if I do add in my errorhandler, crawlee's default settings wont get affected right? though from the documentation it seems, this is specifically exposed to us for explicitly modifying request obj before we retry it
what do you mean by crawlee's default settings? I believe adding the condition for error name and calling session.markBad() should improve the situation, of course you can always try to test more possible scenarios.
thanks, this is what I ended up doing, aggressively cycling through proxies, there are still some files requests but I'm guessing it's on proxy's side now
Add a reply
Sign up and join the conversation on Discord