Apify

Apify Crawlee GitHub

Apify Discord Mirror

Updated 2 months ago

Playwright increase timeout

Playwright increase timeout

At a glance

The community member is experiencing issues with slow page load times when using Playwright with proxies. They have tried increasing the timeout for the page.goto() method, as well as adjusting the timeout settings in the PlaywrightBrowserPlugin. However, they are still encountering the Page.goto: Timeout 30000ms exceeded error.

The community members have suggested several solutions, including using the pre_navigation_hook to set the default navigation timeout for the page. This seems to be the best and only available solution, as the Crawlee library currently does not provide direct access to the browser context.

There is no explicitly marked answer, but the community members have worked together to troubleshoot the issue and provide potential solutions.

Useful resources

·

While using playwright with proxies sometimes the page is taking more time to load, so how can I increase the load time.

Plain Text

Page.goto: Timeout 30000ms exceeded

1

A

E

M

27 comments

@Shine just advanced to level 2! Thanks for your contributions! 🎉

Hi
Did you try to use this code?
try:
await page.goto("https://example.com", timeout=60000) # 60 seconds timeout
except Exception as e:
print(f"Error loading the page: {e}")

If you're referring to the PlaywrightCrawler in crawlee

You can increase the default timeout by passing the appropriate parameter to the browser

Update:
Solution - https://discord.com/channels/801163717915574323/1314296091650428948/1314315014118834207

Plain Text

from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.browsers import BrowserPool, PlaywrightBrowserPlugin


user_plugin = PlaywrightBrowserPlugin(browser_options={"timeout": 60000})

browser_pool = BrowserPool(plugins=[user_plugin])

crawler = PlaywrightCrawler(browser_pool=browser_pool)

You can pass any parameters that Playwright supports.

https://playwright.dev/python/docs/api/class-browsertype#browser-type-launch

I was referring to PlaywrightCrawler

let me try this

Plain Text

[crawlee.playwright_crawler._playwright_crawler] ERROR Request failed and reached maximum retries
Traceback (most recent call last):
File "/usr/local/lib/python3.12/site-packages/crawlee/basic_crawler/_context_pipeline.py", line 65, in __call__
result = await middleware_instance.__anext__()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/crawlee/playwright_crawler/_playwright_crawler.py", line 260, in _handle_blocked_request
selector for selector in RETRY_CSS_SELECTORS if (await context.page.query_selector(selector))
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/playwright/async_api/_generated.py", line 8064, in query_selector
await self._impl_obj.query_selector(selector=selector, strict=strict)
File "/usr/local/lib/python3.12/site-packages/playwright/_impl/_page.py", line 414, in query_selector
return await self._main_frame.query_selector(selector, strict)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/playwright/_impl/_frame.py", line 304, in query_selector
await self._channel.send("querySelector", locals_to_params(locals()))
File "/usr/local/lib/python3.12/site-packages/playwright/_impl/_connection.py", line 59, in send
return await self._connection.wrap_api_call(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/playwright/_impl/_connection.py", line 520, in wrap_api_call
raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None
playwright._impl._errors.Error: Page.query_selector: Execution context was destroyed, most likely because of a navigation
[crawlee._autoscaling.autoscaled_pool] INFO  Waiting for remaining tasks to finish
[crawlee.playwright_crawler._playwright_crawler] INFO  Error analysis: total_errors=3 unique_errors=1

I am getting this error

Plain Text

from apify import Actor, Request
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.proxy_configuration import ProxyConfiguration
from crawlee.browsers import BrowserPool, PlaywrightBrowserPlugin

async def main() -> None:
    async with Actor:
        # Retrieve the Actor input, and use default values if not provided.
        actor_input = await Actor.get_input() or {}
        start_urls = [url.get('url') for url in actor_input.get('start_urls', [{'url': 'https://apify.com'}])]
        proxy_settings = actor_input.get('proxy')
        proxy_configuration = ProxyConfiguration(proxy_urls=[
            'http://xxx:xxx@xxx:xxxx',
        ])

        # Exit if no start URLs are provided.
        if not start_urls:
            Actor.log.info('No start URLs specified in Actor input, exiting...')
            await Actor.exit()

        user_plugin = PlaywrightBrowserPlugin(browser_options={"timeout": 60000})
        browser_pool = BrowserPool(plugins=[user_plugin])

        # Create a crawler.
        crawler = PlaywrightCrawler(
            max_requests_per_crawl=50,
            proxy_configuration=proxy_configuration,
            browser_pool=browser_pool
        )

        # Define a request handler, which will be called for every request.
        @crawler.router.default_handler
        async def request_handler(context: PlaywrightCrawlingContext) -> None:
            Actor.log.info("H")
            url = context.request.url
            Actor.log.info(f'Scraping {url}...')
 

        # Run the crawler with the starting requests.
        await crawler.run(start_urls)

This is the code I am trying

The same code without proxy works correctly for me. Even when set to high slow_mo to simulate a slow connection.

Is it possible that the problem is with the proxy?

yes proxy is working

checked locally

Hmm, I don't have any ideas yet.

The error looks like an attempt to work with a page that no longer exists in the context of browser execution.

I would try to make request_handler_timeout higher as its default value is 60 seconds maybe the problem occurs when there is interaction with the element and the handler closes by timeout.

same error when using apify proxy

I tried in local system with

Plain Text

user_plugin = PlaywrightBrowserPlugin(browser_options={"timeout": 600000, 'headless': False})

and

Plain Text

request_handler_timeout=timedelta(minutes=100)

but still I am getting this error

Plain Text

          raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None
      playwright._impl._errors.TimeoutError: Page.goto: Timeout 30000ms exceeded.
      Call log:
        - navigating to "https://apify.com/", waiting until "load"

With Apify's auto proxy, it works on my side.

the timeout is not changing

I apologize, my mistake. That timeout only affects the opening of the browser. But not the page open 😢

yes

Plain Text

        @crawler.pre_navigation_hook
        async def log_navigation_url(context: PlaywrightPreNavigationContext) -> None:
            context.log.info(f'Navigating to {context.request.url} ...')
            context.page.set_default_navigation_timeout(60000)

I think this should work

Yeah, I think that should help, too.

now I am not receiving any timeout

I need to set the page navigation timeout, is this the only way?
https://playwright.dev/python/docs/api/class-browsercontext#browser-context-set-default-navigation-timeout

@Shine just advanced to level 3! Thanks for your contributions! 🎉

Crawlee doesn't have access to browser context right now, pre_navigation_hook is the only way available

So I think that's the best and only way.

https://discord.com/channels/801163717915574323/1314296091650428948/1314315014118834207

thank you

Add a reply

Sign up and join the conversation on Discord