Crawlee JavaScript Forum

chromium version error in path

Hey Playwright creators! 👋

I'm running into a frustrating issue with Playwright and Chromium, and I could really use some help. Here's what's going on:

The Error:

Plain Text

Error processing batch: 
"errorName": "BrowserLaunchError",
"errorMessage": "Failed to launch browser. Please check the following:
- Try installing the required dependencies by running `npx playwright install --with-deps` (https://playwright.dev/docs/browsers).

The original error is available in the `cause` property. Below is the error received when trying to launch a browser:

browserType.launchPersistentContext: Executable doesn't exist at /home/webapp/.cache/ms-playwright/chromium-1117/chrome-linux/chrome

The Situation:

Playwright is looking for Chromium in /home/webapp/.cache/ms-playwright/chromium-1117/chrome-linux/chrome
But I actually have Chromium installed at /home/webapp/.cache/ms-playwright/chromium-1140/chrome-linux/chrome

My Question:
How the hell can I specify which Chromium version Playwright should use? 🤔

I don't want to specify this in ENV since I want it to work out of the box and use playwright version that it install

Any help would be greatly appreciated. I'm pulling my hair out over here! 😫

Thanks in advance!

6 comments

FFRxFR

Scrape JSON and HTML responses in different handlers

I do not know how to scrape a website, that contains JSON and HTML responses

My scraper need to:

Send a request and parse a JSON response which contains a list of URL that I will enqueue.
Scrape those URLs but in HTML using cheerio or whatever is required to do so.

1 comment

Playwright with Firefox: New Windows vs Tabs and Chromium-specific Features

Hey Playwright community! I've been using Firefox with Playwright because it uses less CPU, but I've run into a couple of issues I'd love some help with:

New Windows Instead of Tabs
I'm running Firefox in headless: false mode to check how things look, and I've noticed it opens new windows for each URL instead of opening new tabs. Is there a way to configure this behavior? I'd prefer to have new tabs open instead of separate windows.

Chromium-specific Features in Firefox
I'm getting this warning when using Firefox:
Plain Text
```
WARN Playwright Utils: blockRequests() helper is incompatible with non-Chromium browsers.
```
Are there any polyfills or workarounds for the playwrightUtils features that are Chromium-specific? I'd like to use blockRequests() or similar functionality with Firefox if possible.

Any insights or suggestions would be greatly appreciated! Thanks in advance for your help.

#playwright #firefox

FFoudreTower

crawlee.run only scrap the first URL

Hi my problem is crawler.run(['https://keepa.com/#!product/4-B07GS6ZB7T', 'https://keepa.com/#!product/4-B0BZSWWK48']) only scrap the first URL I think this is because crawlee think they are the same URL , if i replace the "#" with a "?" it works , is there any way to make it work with url like this ?

2 comments

Router Class

I recently read a blog post about Playwright web scraping (https://blog.apify.com/playwright-web-scraping/#bonus-routing) and implemented its routing concept in my project. However, I'm encountering an issue with handling failed requests. Currently, when a request fails, the application stalls instead of proceeding to the next request. Do you have any suggestions for implementing a failedRequestHandler to address this problem?

1 comment

JJeno

WebRTC IP leak?

Hi, so for the last couple days I am on a quest to evade detection for a project that proved to be quite challanging. As I researched the issue, I noticed that my real IP leaks through WebRTC with a default Crawlee Playwright CLI project. I see a commit to the fingerprint-suite that I think should prevent that, but based on my tests it doesn't. Does it need special setup or anything?

1 comment

JJeno

Crawlee Playwright is detected as bot

Checking on this page, Crawlee Playwright is detected as bot due to CDP.

https://www.browserscan.net/bot-detection

This is a known issue, also discussed on:

https://github.com/berstend/puppeteer-extra/issues/899

Wondering if Crawlee can come up with a solution?

4 comments

How can I wait with processing further logic untill all request from batch are proceeded

Hi

I have this code:

Plain Text

  async processBatch(batch){
// requests: {
//     url: string;
//     userData: CrawlerUserData;
// }[]
    const requests = this.generateRequests(batch)
    await this.crawler.addRequests(requests)

    return this.processResults(requests)
  }
...
  async processResults(requests){
    ...
    for (const request of requests) {
      const userData = request.userData as CrawlerUserData
      if (userData.error) {
        this.statistics.incrementErrors()
        continue
      }

      if (userData.results) {
        ...
        await this.saveResults(userData)
      }
    }

    return batchResults
  }

and this is my route handler:

Plain Text

import { createPlaywrightRouter } from 'crawlee'

export const router = createPlaywrightRouter()

router.addDefaultHandler(async ({ page, request, log }) => {
  const userData = request.userData as CrawlerUserData
  try {
    await page.waitForLoadState('networkidle', { timeout: 5000 })

    const analyzer = new AlertsProximityAnalyzer(userData, callbackCheckingIfDataExist)

    await analyzer.analyze(page) // executing callback

    userData.results = analyzer.results
    // Do I need to save the results here?
  } catch (error) {
    ...
  } finally {
    // Instead of closing the page, reset it for the next use
    await page.evaluate(() => window.stop())
    await page.setContent('<html></html>')
  }
})

The problem is the crawling process executes once the whole code in processBatch is done, eg. all batches are added to requestQueue and processResults is executed ( which do not have any data since there is not yet created userData.results so what I want to know it I need to move my logic to saving results to DB to route handler or is there some way to stop executing this function and start executing route handler and then move back to executing processResults

In response I will paste pseudo algorithm what I expect

1 comment

CCalypso

Puppeteer browser page stuck on redirections

when i use puppeteer & fingerprint injector with generator, some redirects make puppeteer on page firefox/chromium stuck
after these redirections the page stops logging my interceptors (they just write the url), the page stops responding to the resize

if I create a new page manually in this browser and follow the link with redirections, it's fine
without injector and generator everything works fine too

I'm not sure I can provide those links for some reason, but the problem is clearly related to the fingerprint injector
the main link has pointers to other links

Plain Text

browser = await puppeteer
    .launch({
        browser: "firefox",
        headless: false,
        devtools: true,
        args: ['--disable-web-security', '--allow-running-insecure-content'],
    });

const fingerprintGenerator = new FingerprintGenerator();
const fingerprint = fingerprintGenerator.getFingerprint();
const injector = new FingerprintInjector();

let page = await browser.newPage();
await injector.attachFingerprintToPuppeteer(page, fingerprint);
await page.goto("...");

4 comments

MM. Shahzeb

Saving scraped data from dynamic URLs using Crawlee in an Express Server?

Hello all.
I've been trying to build an app that triggers a scraping job when the api is hit.

The initial endpoint hits a crawlee router which has 2 handlers. one for the url-list scraping and the other for scraping the detail from each of the detail-page. (the url-list handler enqueues the next url-list page to url-list handler too btw)

I'm saving the data from each of these scrapes inside a KVstore but I want a way to save all the data in the KV store related to a particular job into database.
The attached screenshots are the MRE snippets from my code.

3 comments

pperson

All requests from the queue have been processed, the crawler will shut down.

I'm working on news web crawler, and setting purgeOnStart=false so that I don't scrape duplicated news, however sometimes in some cases I got the message "All requests from the queue have been processed, the crawler will shut down." and the crawler don't run, any suggestion to fix this issue??

6 comments

oosama

Crawlee not working with cloudflare

It keeps on returning 403 even with rotating proxy pool

Source code:

Plain Text

import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';
import proxy from './proxy_config.js';

// PlaywrightCrawler crawls the web using a headless browser controlled by the Playwright library.
const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [`http://${proxy.username}:${proxy.password}@${proxy.host}:${proxy.port}`]
});
const crawler = new PlaywrightCrawler({
    // Use the requestHandler to process each of the crawled pages.
    proxyConfiguration,
    async requestHandler({ request, page, enqueueLinks, pushData, log }) {
        const title = await page.title();
        log.info(`Title of ${request.loadedUrl} is '${title}'`);

        // Save results as JSON to `./storage/datasets/default` directory.
        await pushData({ title, url: request.loadedUrl });

        // Extract links from the current page and add them to the crawling queue.
        await enqueueLinks();
    },

    // Uncomment this option to see the browser window.
    // headless: false,

    // Comment this option to scrape the full website.
    maxRequestsPerCrawl: 20,
});

// Add first URL to the queue and start the crawl.
await crawler.run(['https://nopecha.com/demo/cloudflare']);

// Export the whole dataset to a single file in `./result.csv`.
await crawler.exportData('./result.csv');

// Or work with the data directly.
const data = await crawler.getData();
console.table(data.items);

3 comments

ssaavedra29

How to define custom delimiter on the Dataset.exporToCSV method?

The default delimiter is "," but i want to use "|" instead. On the "DatasetExportToOptions" that i can use with the "Dataset.exporToCSV" method there is no way i can define the delimiter, the options define other things. Is there another solution for this?

IIAmKing

Express better then node with crawlee? Or is it really not any big difference?

Express better then node with crawlee? Or is it really not any big difference?

Any short comings with express over node with crawlee or apify sdk?

2 comments

JJeno

How to use Playwright's bypassCsp option?

I would like to use:

https://playwright.dev/docs/api/class-testoptions#test-options-bypass-csp

Any idea how I can do that in crawlee?

5 comments

DDavid Cizek

Save a webpage to a PDF file using Actor.setValue()

Hi, I'm new to PuppeteerCrawler. I'm trying to create a simple script to save a webpage as a PDF. For this purpose, I created a new Actor from the Crawlee - Puppeteer - TypeScript template in Apify. This is my main.ts code:

Plain Text

import { Actor } from 'apify';
import { PuppeteerCrawler, Request } from 'crawlee';

await Actor.init();

interface Input {
    urls: Request[];
}

const { urls = ['https://www.google.com/'] } = await Actor.getInput<Input>() ?? {};

const crawler = new PuppeteerCrawler({
    async requestHandler({ page }) {
        const pdfFileName = 'testFile';
        const pdfBuffer = await page.pdf({ format: 'A4', printBackground: true });

        console.log('pdfFileName: ', pdfFileName);
        console.log('pdfBuffer: ', pdfBuffer);
        
        await Actor.setValue(pdfFileName, pdfBuffer, { contentType: 'application/pdf' });
    },
});

await crawler.addRequests(urls);
await crawler.run();

await Actor.exit();

It seems that Actor.setValue doesn't want to consume the sent PDF buffer. What am I doing wrong?
Thanks

2 comments

ggablabelle

Any suggestions for improving the speed of the crawling run?

Hello there!

Beside reducing the scope of what is being crawled, for example of number of pages, etc, what can we do in order to accelerate the run?

Any suggestions are welcomed, I'm simply curious.

3 comments

LLed

Prevent automatic reclaim of failed requests

Hi everyone! Hope you're all doing well. I have a small question about Crawlee.

My use case is a little simpler than a crawler; I just want to scrape a single URL every few seconds.

To do this, I create a RequestList with just one url and start the Crawler. Sometimes, the crawler returns HTTP errors and fails. However, I don't mind as I'm going to run the crawler again after a few seconds and I'd prefer the errors to be ignored rather than automatically reclaimed.

Is there a way of doing this?

3 comments

LLC

How to make sure all external requests have been awaited and intercepted?

I'm scrapping pages of a website as part of a content migration. Some of those pages make some post requests to algolia (3-4 requests) on the client side and I need to intercept those requests, since I need some data that is sent in the request body. One thing that is important to note is that I don't know which pages make the requests and which pages don't. Because of that, I'd need to find a way to await for all the external requests FOR EACH PAGE and just strat crawling the page html after that. That way, if I can ensure I awaited for all the requests and still it didn't intercept the algolia request, it would mean that specific page didn't make a request to algolia.

I created a solution that seemed to be working at first. However, after crawling the pages a few times, I noticed that, sometimes, it wouldn't show the algolia data in the dataset for a few pages but I could confirm in the browser that the page makes the algolia request. So, my guess is that it ends crawiling the page html before intercepting that algolia request (??). Ideally, it would only start crawling the html AFTER all the external requests ended. I used puppeteer because I found the addInterceptRequestHandler in the docs, but I could use Playwright if it's easier. Can someone here help me to understand what I'm doing wrong? Here is a gist with the code I'm using: https://gist.github.com/lcnogueira/d1822287d718731a7f4a36f05d1292fc (I can't post it here, otherwise my message becomes too long)

2 comments

hhunterleung.

How to do multi-task crawling with crawlee? (I have searched many days and can't get the answer)

5 comments

CCrafty

chromium issue in apify/actor-node-playwright-chrome:22

Hi folks, I pulled the latest revision of the actor-node-playwright-chrome:22 docker image but when i tried to run the project I got the "install browsers" playwright error

Plain Text

azure:service-bus:receiver:warning [connection-1|streaming:discovery-8ffea0b6-f055-c04e-88ae-f31f039f2c24] Abandoning the message with id '656b7051a08b4b759087c40d0ecef687' on the receiver 'discovery-8ffea0b6-f055-c04e-88ae-f31f039f2c24' since an error occured: browserType.launch: Executable doesn't exist at /home/myuser/pw-browsers/chromium-1129/chrome-linux/chrome
╔═════════════════════════════════════════════════════════════════════════╗
║ Looks like Playwright Test or Playwright was just installed or updated. ║
║ Please run the following command to download new browsers:              ║
║                                                                         ║
║     npx playwright install                                              ║
║                                                                         ║
║ <3 Playwright Team                                                      ║
╚═════════════════════════════════════════════════════════════════════════╝

I thought all the browser install was pre-handled in the image?

3 comments

EEmma

How to launch a crawlee browser that I can manually pass the cloudfare anti-bot protection

This is my code to launch the browser with headless: false mode. I mannualy input the URL and try to pass the captcha challenges. But the challenges keep failing
This is the code

Plain Text

const { launchPuppeteer } = require('crawlee');
const puppeteerExtra = require('puppeteer-extra');
const stealthPlugin = require('puppeteer-extra-plugin-stealth');

// Use the stealth plugin
puppeteerExtra.use(stealthPlugin());

const main = async () => {
    // Launch the browser without running any crawl
    const browser = await launchPuppeteer({
        // !!! You need to specify this option to tell Crawlee to use puppeteer-extra as the launcher !!!
        launcher: puppeteerExtra,
        launchOptions: {
            // Other puppeteer options work as usual
            headless: false,
        },
    });

    // Create and navigate new page 
    console.log('Open target page');
    const page = await browser.newPage();

    // Now you can play around with the browser
    console.log('Browser launched. You can now interact with it.');
    
    // Keep the script running
    await new Promise((resolve) => {
        console.log('Press Ctrl+C to exit.');
    });
};

main().catch(console.error);

This is the test URL
https://www.nivod.cc/

3 comments

CCrafty

remove uniqueKey from queue blacklist

Hi all,

Im scraping a weird website which has file attachment links which rotate every 5 minutes or so, eg

https://dca-global.org/serve-file/e1725459845/l1714045338/da/c1/Rvfa9Lo-AzHHX0NYJ3f-Tx3FrxSI8-N-Y5ytfS8Prak/1/37/file/1704801502dca_media_publications_2024_v10_lr%20read%20only.pdf

everything between serve-file and file changes regularly.

My strategy to deal with this is to calculate the unique key based on the 'stable' parts of the url. Then when i detect the url has changed, I can remove any queued requests with the unique key and replace them with the new url

My question is, if a request has hit its retry limit and has been 'blacklisted' from the request queue, how can i remove it so the new url can be processed?

Thanks!

4 comments

TTeodor

Apify-cli create new actor error:

Hi everyone I'm trying to use the cli to create a new javascript, crawlee + cheerio project
However I get the error :

Plain Text

 Error: EINVAL: invalid argument, mkdir 'C:\Crawlee-latest\Crawlee\my-new-actor\C:'

It seems that is automatically adds a "C:" at the end of my current terminal path which I believe makes it to fail ( the installation of the project is not complete as I'm missing modules(like cheerio, etc) I have only crawlee and apify in my package.json file of the project.

Currently using node@20.17.0 64-bit , apify-cli@0.20.6 , apify@3.2.5 , crawlee@3.11.2 and Windows 11

3 comments

NNyanmaru

save HTML file using crawlee

Has anybody tried downloading the HTML file of the URL using Crawlee? Was wondering if Crawlee has a capacity of downloading the HTML file of the URL since I've just been using Crawlee and really loving the experience.

4 comments

Apify and Crawlee Official Forum

chromium version error in path

Scrape JSON and HTML responses in different handlers

Playwright with Firefox: New Windows vs Tabs and Chromium-specific Features

crawlee.run only scrap the first URL

Router Class

WebRTC IP leak?

Crawlee Playwright is detected as bot

How can I wait with processing further logic untill all request from batch are proceeded

Puppeteer browser page stuck on redirections

Saving scraped data from dynamic URLs using Crawlee in an Express Server?

All requests from the queue have been processed, the crawler will shut down.

Crawlee not working with cloudflare

How to define custom delimiter on the Dataset.exporToCSV method?

Express better then node with crawlee? Or is it really not any big difference?

How to use Playwright's bypassCsp option?

Save a webpage to a PDF file using Actor.setValue()

Any suggestions for improving the speed of the crawling run?

Prevent automatic reclaim of failed requests

How to make sure all external requests have been awaited and intercepted?

How to do multi-task crawling with crawlee? (I have searched many days and can't get the answer)

chromium issue in apify/actor-node-playwright-chrome:22

How to launch a crawlee browser that I can manually pass the cloudfare anti-bot protection

remove uniqueKey from queue blacklist

Apify-cli create new actor error:

save HTML file using crawlee