Apify and Crawlee Official Forum

Home
Members
Wojciech
W
Wojciech
Offline, last seen 4 weeks ago
Joined August 28, 2024
I noticed that npx playwright install chromium install chromium headless shell

And now those run in processes instead of chromium app, I think they take less cpu, but i couldnt find any information about them on crawlee
1 comment
A
As title said

I’m using chromium currently but it is cpu heavy in usage

Killing browser do not kill the process and because of that it’s easy to get 100% cpu usage pretty quickly

(I’m crawling thousands of websites where on each I’m looking for different data) I already try to load pure html without css, images and other assets, that helped a lot but issue is still there
3 comments
W
L
O
there is a function
Plain Text
    protected async _handleFailedRequestHandler(crawlingContext: Context, error: Error): Promise<void> {
        // Always log the last error regardless if the user provided a failedRequestHandler
        const { id, url, method, uniqueKey } = crawlingContext.request;
        const message = this._getMessageFromError(error, true);

        this.log.error(`Request failed and reached maximum retries. ${message}`, { id, url, method, uniqueKey });

        if (this.failedRequestHandler) {
            await this._tagUserHandlerError(() =>
                this.failedRequestHandler?.(this._augmentContextWithDeprecatedError(crawlingContext, error), error),
            );
        }
    }


that is triggered once maxRequestRetries is done, how could I override this log message with my own? I don't want to see whole stacktrace in my logs, I just want to notify that there is an error and details can be found under some id in DB

should I disable logs for error and handle them manually?
2 comments
W
Hey Playwright creators! 👋

I'm running into a frustrating issue with Playwright and Chromium, and I could really use some help. Here's what's going on:

The Error:
Plain Text
Error processing batch: 
"errorName": "BrowserLaunchError",
"errorMessage": "Failed to launch browser. Please check the following:
- Try installing the required dependencies by running `npx playwright install --with-deps` (https://playwright.dev/docs/browsers).

The original error is available in the `cause` property. Below is the error received when trying to launch a browser:

browserType.launchPersistentContext: Executable doesn't exist at /home/webapp/.cache/ms-playwright/chromium-1117/chrome-linux/chrome


The Situation:
  • Playwright is looking for Chromium in /home/webapp/.cache/ms-playwright/chromium-1117/chrome-linux/chrome
  • But I actually have Chromium installed at /home/webapp/.cache/ms-playwright/chromium-1140/chrome-linux/chrome
My Question:
How the hell can I specify which Chromium version Playwright should use? 🤔

I don't want to specify this in ENV since I want it to work out of the box and use playwright version that it install

Any help would be greatly appreciated. I'm pulling my hair out over here! 😫

Thanks in advance!
10 comments
o
W
W
Wojciech
·

Router Class

I recently read a blog post about Playwright web scraping (https://blog.apify.com/playwright-web-scraping/#bonus-routing) and implemented its routing concept in my project. However, I'm encountering an issue with handling failed requests. Currently, when a request fails, the application stalls instead of proceeding to the next request. Do you have any suggestions for implementing a failedRequestHandler to address this problem?
1 comment
o
Hey Playwright community! I've been using Firefox with Playwright because it uses less CPU, but I've run into a couple of issues I'd love some help with:

  1. New Windows Instead of Tabs
    I'm running Firefox in headless: false mode to check how things look, and I've noticed it opens new windows for each URL instead of opening new tabs. Is there a way to configure this behavior? I'd prefer to have new tabs open instead of separate windows.
  1. Chromium-specific Features in Firefox
    I'm getting this warning when using Firefox:
    Plain Text
    WARN Playwright Utils: blockRequests() helper is incompatible with non-Chromium browsers.

    Are there any polyfills or workarounds for the playwrightUtils features that are Chromium-specific? I'd like to use blockRequests() or similar functionality with Firefox if possible.
Any insights or suggestions would be greatly appreciated! Thanks in advance for your help.

#playwright #firefox
Hi

I have this code:
Plain Text
  async processBatch(batch){
// requests: {
//     url: string;
//     userData: CrawlerUserData;
// }[]
    const requests = this.generateRequests(batch)
    await this.crawler.addRequests(requests)

    return this.processResults(requests)
  }
...
  async processResults(requests){
    ...
    for (const request of requests) {
      const userData = request.userData as CrawlerUserData
      if (userData.error) {
        this.statistics.incrementErrors()
        continue
      }

      if (userData.results) {
        ...
        await this.saveResults(userData)
      }
    }

    return batchResults
  }


and this is my route handler:

Plain Text
import { createPlaywrightRouter } from 'crawlee'

export const router = createPlaywrightRouter()

router.addDefaultHandler(async ({ page, request, log }) => {
  const userData = request.userData as CrawlerUserData
  try {
    await page.waitForLoadState('networkidle', { timeout: 5000 })

    const analyzer = new AlertsProximityAnalyzer(userData, callbackCheckingIfDataExist)

    await analyzer.analyze(page) // executing callback

    userData.results = analyzer.results
    // Do I need to save the results here?
  } catch (error) {
    ...
  } finally {
    // Instead of closing the page, reset it for the next use
    await page.evaluate(() => window.stop())
    await page.setContent('<html></html>')
  }
})


The problem is the crawling process executes once the whole code in processBatch is done, eg. all batches are added to requestQueue and processResults is executed ( which do not have any data since there is not yet created userData.results so what I want to know it I need to move my logic to saving results to DB to route handler or is there some way to stop executing this function and start executing route handler and then move back to executing processResults

In response I will paste pseudo algorithm what I expect
1 comment
W
I have a problem with my playwright after restarting app on AWS EB as you can see

Plain Text
Mar  7 15:20:22 ip-10-249-15-251 web: at async /var/app/current/node_modules/@crawlee/browser-pool/browser-pool.js:274:37 {
Mar  7 15:20:22 ip-10-249-15-251 web: name: 'BrowserLaunchError',
Mar  7 15:20:22 ip-10-249-15-251 web: [cause]: browserType.launchPersistentContext: Executable doesn't exist at /home/webapp/.cache/ms-playwright/ch                                    romium-1097/chrome-linux/chrome

my app is trying to run chromium under above path, but I don't have that version installed

I have this:
Plain Text
sh-4.2$ pwd
/home/webapp/.cache/ms-playwright
sh-4.2$ ls
chromium-1105  ffmpeg-1009


I'm not sure why this is happening

is this because im using meta package crawlee: 3.7.2 without caret (^) and underneath its updated to the newest version?
also playwright: 1.4.1 where newest is 1.4.2
1 comment
O
Hello,

I'm exploring ways to optimize web crawling speed using Playwright. I'm curious if there's a method to navigate to new URLs without closing and reopening pages each time. Essentially, updating the URL in the address bar and initiating navigation.

Additionally, is there a way to disable the rendering of images, fonts, and stylesheets, assuming I only need access to the DOM? Any insights or tips would be greatly appreciated!
3 comments
A
M
Hello Playwright Community,

I am currently experiencing a challenging issue with memory management in a high-volume web crawling application using Playwright. Our application is designed to scan and process thousands of web pages. However, I've noticed a significant increase in memory usage after processing approximately 2000 URLs.

Here's a brief overview of our Playwright setup:

Plain Text
new PlaywrightCrawler({
      autoscaledPoolOptions: {
        autoscaleIntervalSecs: 5, 
        loggingIntervalSecs: null,
        maxConcurrency: CONFIG.SOURCE_MAX_CONCURRENCY, // here 6
        minConcurrency: CONFIG.SOURCE_MIN_CONCURRENCY, // here 1 
      },
      browserPoolOptions: {
        operationTimeoutSecs: 5, 
        retireBrowserAfterPageCount: 10, 
        maxOpenPagesPerBrowser: 5,
        closeInactiveBrowserAfterSecs: 3,
      },
      launchContext: {
        launchOptions: {
          chromiumSandbox: false,
          headless: true, 
        },
      },
      requestHandlerTimeoutSecs: 60, 
      maxRequestRetries: 3,
      keepAlive: true, // Keeps the crawler alive even if all requests are handled; useful for long-running crawls
      retryOnBlocked: false, // Automatically retries a request if it is identified as blocked (e.g., by bot detection)
      requestHandler: this.requestHandler.bind(this), // Function to handle each request
      failedRequestHandler: this.failedRequestHandler.bind(this), // Function to handle each failed request
    })


Despite ensuring that pages are closed after each crawl, the memory usage spikes by around 400% (increasing to roughly 800MB) and then Playwright becomes unresponsive. This behavior is puzzling as we've taken care to manage resources efficiently.

I am looking for insights or suggestions on how to troubleshoot and resolve this memory leak issue. Specifically:
Hello Playwright Community,

I'm facing a challenge with deploying a Playwright crawler on AWS Elastic Beanstalk, which uses Amazon Linux. The main issue arises with npx playwright install, as it primarily supports Ubuntu, and I'm working with Amazon Linux on AWS.

Attempts Made:
  1. I executed npx playwright install chromium --with-deps --dry-run to identify the dependencies and tried installing them using yum (since Amazon Linux is Fedora-based).
  2. I attempted to install Chromium through npm as a workaround. This solution worked locally, but not in Docker with the Amazon Linux image on AWS.
Issue Encountered:
  • The npx playwright install compatibility with Amazon Linux is problematic.
  • The workaround with npm installation of Chromium is not effective in the AWS environment, despite success in a local setup and docker container with amazon linux image.
Request for Assistance:
Has anyone successfully deployed Playwright on AWS Elastic Beanstalk with Amazon Linux? If so, could you share insights or steps on how you managed to resolve the compatibility issues with npx playwright install? Any tips or alternative approaches that have worked in a similar setup would be greatly appreciated.

Relevant documentation: Playwright Library - Browser Downloads

Thank you in advance for any guidance or suggestions you can provide!
Does anyone manage to install playwright on aws elasticbeanstalk?

I tried npx playwright install chromium but its failing
2 comments
L
m
Hey everyone 👋,

I'm facing an issue with PlaywrightCrawler in Crawlee where I need to handle errors from the failedRequestHandler in a single URL flow scenario. The errors aren't propagating to the main function, which is critical for my use case as I manage different scenarios based on the content of individual URLs.

I've provided a detailed explanation of my problem and the expected behavior on GitHub. Please take a look for more context and let me know your thoughts or if you have any suggestions.

Here's the link to the discussion: Handling Errors from failedRequestHandler in Single URL Flow with PlaywrightCrawler

Thanks for your help!
3 comments
W
L
Hello I wonder how to override the default logs of crawler, this is how it looks:


This logs came from basic-crawle library: (https://github.com/apify/crawlee/blob/3ffcf56d744ac527ed8d883be3b1a62356a5930c/packages/basic-crawler/src/internals/basic-crawler.ts#L891)

I am using Playwright, and thats how I mange to override default logs with my custom like that:

Plain Text
//playwright-winston-proxy-logger.ts
import { Log } from 'crawlee'

import type { Logger } from 'winston'

type AdditionalData = Record<string, unknown> | null

export class WinstonLoggerProxy extends Log {
  private logger: Logger

  constructor(logger: Logger) {
    super()
    this.logger = logger
  }

  debug(message: string, data?: AdditionalData): void {
    if (data) {
      this.logger.debug(message, data)
    } else {
      this.logger.debug(message)
    }
  }

  info(message: string, data?: AdditionalData): void {
    if (data) {
      this.logger.info(message, data)
    } else {
      this.logger.info(message)
    }
  }

  warning(message: string, data?: AdditionalData): void {
    if (data) {
      this.logger.warn(message, data)
    } else {
      this.logger.warn(message)
    }
  }

  error(message: string, data?: AdditionalData): void {
    if (data) {
      this.logger.error(message, data)
    } else {
      this.logger.error(message)
    }
  }

  exception(exception: Error, message: string, data?: AdditionalData): void {
    if (data) {
      this.logger.error(message, { exception, ...data })
    } else {
      this.logger.error(message, { exception })
    }
  }
}


and thats how I use them:

Plain Text
...
  private createCrawler = (): PlaywrightCrawler => {
    const loggerCrawler = new WinstonLoggerProxy(
      createLogger({ module: 'PLAYWRIGHT' })
    )

    return new PlaywrightCrawler({
      log: loggerCrawler, // Provide the custom logger proxy
...
6 comments
A
W
👋 Hello Playwright Community,

I'm currently working on a Playwright script where I need to handle various data combinations (such as name, username, surname, email, etc.) on a webpage. My challenge is that some of these data elements may be hidden behind interactive elements (like tabs or dialogs).

To tackle this, I devised a strategy that starts by interacting with all visible <button/> elements on the page. Here's an overview of my approach:

  1. Visit the Target Page: Navigate to the page where data needs to be extracted.
  2. Retrieve Visible Buttons: Gather all visible <button/> elements into an array.
  3. Initial Data Scan: Perform a scan for the required data sets.
  4. Interact with Buttons: Sequentially click each button and remove it from the array after clicking.
  5. Rescan for Data: Post-interaction, perform another scan as some buttons might reveal additional content (like dialogs or tabs).
  6. Refresh Button Array: Re-gather all visible <button/> elements. However, I'm encountering an issue here as this process also re-adds buttons that were previously clicked.
I'm seeking insights or suggestions to optimize this process, especially regarding the efficient handling of the button array to avoid redundancies. Any advice or alternative strategies would be greatly appreciated!

Thank you in advance! 🙏
Hello fellow developers,

I'm facing a consistent issue with Playwright in the Crawlee library context. Every time I perform an async operation on a locator instance, the page unexpectedly closes.

Here's the simplified code where the issue is evident:

Plain Text
const doesContainAllParts: AlertsProximityAnalyzerComparator<
  Frame | Page
> = async (element) => {
  try {
    const test = element.locator('body');
    const result = await test.count();  // Page closes unexpectedly here

    return result > 0;
  } catch (error) {
    console.error('Error in doesContainAllParts:', error);
    throw error;
  }
};


The issue specifically happens at the line const result = await test.count(). Each time this line executes, the page closes, leading to the failure of the operation.

Some key points:
  • The problem consistently occurs every time this code is executed.
  • I'm using the latest versions of Playwright and Crawlee.
  • The issue seems to be tied to the await operation on the locator instance.
I'm stumped as to why this is happening. Is this a known issue with Playwright or Crawlee, or could there be something wrong with my implementation? Any insights, suggestions, or similar experiences would be incredibly helpful.

Thanks a lot in advance for any assistance!

PS I'm adding a video with settings headless: false to show you how it looks

PSS And here is disscussion on github with more details: https://github.com/apify/crawlee/discussions/2185
10 comments
W
P
A