Apify

Apify and Crawlee Official Forum

b
F
A
J
A

Reduce time between "PlaywrightCrawler: Starting the crawler." and the "requestHandler"

My crawler is having a long delay between the "PlaywrightCrawler: Starting the crawler." log and the actual request being handled. Could this be related to the time if takes to connect to the proxy server or could this be something else?
g
O
G
4 comments
You can try profiling your code to see where the bottlenecks are.
What do You mean by "long delay" ?)

You are using browser. So, in this case, the speed is affected by many factors including the run memory setting, loading resources process, and the required time for rendering the data.

  • You can increasy run's running memory;
  • try to set proper "waitUntil" event and use blockRequest() util func:
Plain Text
    preNavigationHooks: [async (
        // {
        //  blockRequests
        // },
        gotoOptions) => {
        // await blockRequests();
        gotoOptions.waitUntil = 'domcontentloaded'; // fastest resolver 
    }]

blockRequests:
https://crawlee.dev/api/playwright-crawler/namespace/playwrightUtils#blockRequests


  • Finally You can try Cheerio crawler for its high performance.
By a long delay I feel that sometimes it looks like it takes a long time before the crawler start actually processing the requests. I'm already blocking the static assets. (expect the JS) because I need the JS to be enabled to render the page.

I'll try with the domcontentloaded because sometimes with the load option the page timeout, due the fact that the website has a lot of scripts :/ to load.

One thing that I'm not sure if this might be taking some time too is that I'm currently using some page.route to intercept some of the API request and extract the bearer token from it, but according to the docs, when I use page.route it disables the browser caching. Taking into consideration that it has a lot of JS to load, I'd like to be able to cache these, but I wasn't able to achieve this either.

Right now I'm user Firefox instead of Chrome as it reduces the changes on the website WAP protection detects as a crawler
Add a reply
Sign up and join the conversation on Discord
Join