retryOnBlocked with HttpCrawler

Question

Hi, I'm using the HttpCrawler to scrape a static list of URLs. However, when I do get a 403 response as a result of CloudFlare challenge, the request is not retried with retryOnBlocked: true. However, if I remove retryOnBlocked, I see my errorHandler getting invoked and the request is retried. Do I understand retryOnBlocked wrong?

Pepa J · Answer

Hi @triGun can you provide us with minimal reproducable code?

Crafty · Answer

The error handler runs after every failed request, the failed request handler runs after max retries. Perhaps you might want to move some logic from one to the other?

triGun · Answer

@Pepa J Not sure if this reproduces it, in my case it lead to the described result:const crawler = new HttpCrawler( { maxConcurrency: 2, maxRequestsPerMinute: 180, ...options, proxyConfiguration, useSessionPool: true, persistCookiesPerSession: true, retryOnBlocked: true, additionalMimeTypes: ["text/plain", "application/pdf"], async requestHandler({ pushData, request, response }) { await pushData({ url: request.url, statusCode: response.statusCode, }); }, async failedRequestHandler({ pushData, request, response }) { log.error(`Request for URL "${request.url}" failed.`); await pushData({ url: request.url, statusCode: response?.statusCode ?? 0, }); }, async errorHandler({ request }, { message }) { log.error(`Request failed with ${message}`); if (!request.noRetry) { const baseWaitTime = Math.pow(2, request.retryCount) * 1000; const jitter = baseWaitTime * (Math.random() - 0.5); const waitTime = baseWaitTime + jitter; await new Promise((resolve) => setTimeout(resolve, waitTime)); } }, }, config, );
crawler.run(urls)Nothing really special. I have two proxies in my configuration one in tier1 and the second in tier2.

Pepa J · Answer

Depends on the implementation of the website.

If you experience the captcha even in regular browser without proxy, than you cannot pass it just with HttpCrawler, you may need to you a browser-based solution like PuppeteerCrawler or PlaywrightCrawler.

If you don't experience the captcha in your browser - it could be about the quality of the Proxies that you set up - the website provides the captcha just to suspicious visitors (ip from proxy).

triGun · Answer

It's OK to get a 403 as a result of the request. The problem is that with retryOnBlocked present, the request is not retried with a different proxy tier unless this property is removed.

Apify and Crawlee Official Forum

retryOnBlocked with HttpCrawler