Apify and Crawlee Official Forum

Updated 2 weeks ago

How to retry when hit with 429

When using crawlee-js its working fine, but when using python 429 is not getting retried. Is there anything I am missing
I am using BeautifulSoupCrawler. Please help.
O
S
M
10 comments
Is it still an issue ? Can you please provide short code reproduction, so we can check it ?
hi sorry for the delayed reply, yes when we get any errors in 429, 403 and anything in 400 range its not retrying
Hi, could you please show a code sample, I'm wondering how you configure the crowler (max_request_retries and max_session_rotations) and do you handle the cases of getting an error somehow additionally?

It is possible that when you get a 429 response, a re-request is executed, but it happens too fast and all re-requests get 429 error status too?

The 403 response status signals that you have received an access lock. I don't think a re-request should be performed in this case, more like a session change.

400 usually signals that the request is invalid, I don't think such requests should be repeated.

In general it seems to me that if you are encountering 429, you should adjust ConcurrencySettings to reduce the aggressiveness of scraping.
Also, what http client are you using?
@Shine
Yes, please share some code to reproduce the issue, including the configuration of your scraper.
Also, provide logs or proof showing that your requests are not being retried in case of a 429 response.

Without this information, it’s difficult to assist, as your case seems quite unusual. By default, such requests should be retried automatically.
Hi the below is the code
Plain Text
from apify import Actor, Request
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from .routes import router
from crawlee import ConcurrencySettings

async def main() -> None:
    concurrency_settings = ConcurrencySettings(
        max_concurrency=3,
    )
    async with Actor:
    
        # Create a crawler.
        crawler = BeautifulSoupCrawler(
            request_handler=router,
            max_requests_per_crawl=100,
            max_request_retries=10,
             concurrency_settings=concurrency_settings
         )

        # Run the crawler with the starting requests.
        await crawler.run(['https://example.com'])
if there is 403 error when we try again then its accessible so I want to do retry for this status code
Yes, it looks like you can't call repeats for status code in the 400-499 range

https://github.com/apify/crawlee-python/blob/master/src/crawlee/basic_crawler/_basic_crawler.py#L653

I don't think it's supposed to work that way
for now what I did is used
Plain Text
ignore_http_error_status_codes=[403]
and in request handler it will have error in elements so retry works from there
I created an Issue on this problem - https://github.com/apify/crawlee-python/issues/756

I'll post here when it's resolved.
Add a reply
Sign up and join the conversation on Discord