npx playwright install chromium
install chromium headless shellprotected async _handleFailedRequestHandler(crawlingContext: Context, error: Error): Promise<void> { // Always log the last error regardless if the user provided a failedRequestHandler const { id, url, method, uniqueKey } = crawlingContext.request; const message = this._getMessageFromError(error, true); this.log.error(`Request failed and reached maximum retries. ${message}`, { id, url, method, uniqueKey }); if (this.failedRequestHandler) { await this._tagUserHandlerError(() => this.failedRequestHandler?.(this._augmentContextWithDeprecatedError(crawlingContext, error), error), ); } }
Error processing batch: "errorName": "BrowserLaunchError", "errorMessage": "Failed to launch browser. Please check the following: - Try installing the required dependencies by running `npx playwright install --with-deps` (https://playwright.dev/docs/browsers). The original error is available in the `cause` property. Below is the error received when trying to launch a browser: browserType.launchPersistentContext: Executable doesn't exist at /home/webapp/.cache/ms-playwright/chromium-1117/chrome-linux/chrome
/home/webapp/.cache/ms-playwright/chromium-1117/chrome-linux/chrome
/home/webapp/.cache/ms-playwright/chromium-1140/chrome-linux/chrome
headless: false
mode to check how things look, and I've noticed it opens new windows for each URL instead of opening new tabs. Is there a way to configure this behavior? I'd prefer to have new tabs open instead of separate windows.WARN Playwright Utils: blockRequests() helper is incompatible with non-Chromium browsers.
playwrightUtils
features that are Chromium-specific? I'd like to use blockRequests()
or similar functionality with Firefox if possible.async processBatch(batch){ // requests: { // url: string; // userData: CrawlerUserData; // }[] const requests = this.generateRequests(batch) await this.crawler.addRequests(requests) return this.processResults(requests) } ... async processResults(requests){ ... for (const request of requests) { const userData = request.userData as CrawlerUserData if (userData.error) { this.statistics.incrementErrors() continue } if (userData.results) { ... await this.saveResults(userData) } } return batchResults }
import { createPlaywrightRouter } from 'crawlee' export const router = createPlaywrightRouter() router.addDefaultHandler(async ({ page, request, log }) => { const userData = request.userData as CrawlerUserData try { await page.waitForLoadState('networkidle', { timeout: 5000 }) const analyzer = new AlertsProximityAnalyzer(userData, callbackCheckingIfDataExist) await analyzer.analyze(page) // executing callback userData.results = analyzer.results // Do I need to save the results here? } catch (error) { ... } finally { // Instead of closing the page, reset it for the next use await page.evaluate(() => window.stop()) await page.setContent('<html></html>') } })
processBatch
is done, eg. all batches are added to requestQueue and processResults
is executed ( which do not have any data since there is not yet created userData.results
so what I want to know it I need to move my logic to saving results to DB to route handler or is there some way to stop executing this function and start executing route handler and then move back to executing processResults Mar 7 15:20:22 ip-10-249-15-251 web: at async /var/app/current/node_modules/@crawlee/browser-pool/browser-pool.js:274:37 { Mar 7 15:20:22 ip-10-249-15-251 web: name: 'BrowserLaunchError', Mar 7 15:20:22 ip-10-249-15-251 web: [cause]: browserType.launchPersistentContext: Executable doesn't exist at /home/webapp/.cache/ms-playwright/ch romium-1097/chrome-linux/chrome
sh-4.2$ pwd /home/webapp/.cache/ms-playwright sh-4.2$ ls chromium-1105 ffmpeg-1009
crawlee: 3.7.2
without caret (^
) and underneath its updated to the newest version?1.4.1
where newest is 1.4.2
new PlaywrightCrawler({ autoscaledPoolOptions: { autoscaleIntervalSecs: 5, loggingIntervalSecs: null, maxConcurrency: CONFIG.SOURCE_MAX_CONCURRENCY, // here 6 minConcurrency: CONFIG.SOURCE_MIN_CONCURRENCY, // here 1 }, browserPoolOptions: { operationTimeoutSecs: 5, retireBrowserAfterPageCount: 10, maxOpenPagesPerBrowser: 5, closeInactiveBrowserAfterSecs: 3, }, launchContext: { launchOptions: { chromiumSandbox: false, headless: true, }, }, requestHandlerTimeoutSecs: 60, maxRequestRetries: 3, keepAlive: true, // Keeps the crawler alive even if all requests are handled; useful for long-running crawls retryOnBlocked: false, // Automatically retries a request if it is identified as blocked (e.g., by bot detection) requestHandler: this.requestHandler.bind(this), // Function to handle each request failedRequestHandler: this.failedRequestHandler.bind(this), // Function to handle each failed request })
npx playwright install
, as it primarily supports Ubuntu, and I'm working with Amazon Linux on AWS.npx playwright install chromium --with-deps --dry-run
to identify the dependencies and tried installing them using yum
(since Amazon Linux is Fedora-based).npx playwright install
compatibility with Amazon Linux is problematic.npx playwright install
? Any tips or alternative approaches that have worked in a similar setup would be greatly appreciated.npx playwright install chromium
but its failingPlaywrightCrawler
in Crawlee where I need to handle errors from the failedRequestHandler
in a single URL flow scenario. The errors aren't propagating to the main function, which is critical for my use case as I manage different scenarios based on the content of individual URLs.failedRequestHandler
in Single URL Flow with PlaywrightCrawler//playwright-winston-proxy-logger.ts import { Log } from 'crawlee' import type { Logger } from 'winston' type AdditionalData = Record<string, unknown> | null export class WinstonLoggerProxy extends Log { private logger: Logger constructor(logger: Logger) { super() this.logger = logger } debug(message: string, data?: AdditionalData): void { if (data) { this.logger.debug(message, data) } else { this.logger.debug(message) } } info(message: string, data?: AdditionalData): void { if (data) { this.logger.info(message, data) } else { this.logger.info(message) } } warning(message: string, data?: AdditionalData): void { if (data) { this.logger.warn(message, data) } else { this.logger.warn(message) } } error(message: string, data?: AdditionalData): void { if (data) { this.logger.error(message, data) } else { this.logger.error(message) } } exception(exception: Error, message: string, data?: AdditionalData): void { if (data) { this.logger.error(message, { exception, ...data }) } else { this.logger.error(message, { exception }) } } }
... private createCrawler = (): PlaywrightCrawler => { const loggerCrawler = new WinstonLoggerProxy( createLogger({ module: 'PLAYWRIGHT' }) ) return new PlaywrightCrawler({ log: loggerCrawler, // Provide the custom logger proxy ...
<button/>
elements on the page. Here's an overview of my approach:<button/>
elements into an array.<button/>
elements. However, I'm encountering an issue here as this process also re-adds buttons that were previously clicked.const doesContainAllParts: AlertsProximityAnalyzerComparator< Frame | Page > = async (element) => { try { const test = element.locator('body'); const result = await test.count(); // Page closes unexpectedly here return result > 0; } catch (error) { console.error('Error in doesContainAllParts:', error); throw error; } };
const result = await test.count()
. Each time this line executes, the page closes, leading to the failure of the operation. await
operation on the locator instance.