Hello Playwright Community,
I am currently experiencing a challenging issue with memory management in a high-volume web crawling application using Playwright. Our application is designed to scan and process thousands of web pages. However, I've noticed a significant increase in memory usage after processing approximately 2000 URLs.
Here's a brief overview of our Playwright setup:
new PlaywrightCrawler({
autoscaledPoolOptions: {
autoscaleIntervalSecs: 5,
loggingIntervalSecs: null,
maxConcurrency: CONFIG.SOURCE_MAX_CONCURRENCY, // here 6
minConcurrency: CONFIG.SOURCE_MIN_CONCURRENCY, // here 1
},
browserPoolOptions: {
operationTimeoutSecs: 5,
retireBrowserAfterPageCount: 10,
maxOpenPagesPerBrowser: 5,
closeInactiveBrowserAfterSecs: 3,
},
launchContext: {
launchOptions: {
chromiumSandbox: false,
headless: true,
},
},
requestHandlerTimeoutSecs: 60,
maxRequestRetries: 3,
keepAlive: true, // Keeps the crawler alive even if all requests are handled; useful for long-running crawls
retryOnBlocked: false, // Automatically retries a request if it is identified as blocked (e.g., by bot detection)
requestHandler: this.requestHandler.bind(this), // Function to handle each request
failedRequestHandler: this.failedRequestHandler.bind(this), // Function to handle each failed request
})
Despite ensuring that pages are closed after each crawl, the memory usage spikes by around 400% (increasing to roughly 800MB) and then Playwright becomes unresponsive. This behavior is puzzling as we've taken care to manage resources efficiently.
I am looking for insights or suggestions on how to troubleshoot and resolve this memory leak issue. Specifically: