Apify

Apify and Crawlee Official Forum

b
F
A
J
A
Hey Playwright creators! 👋

I'm running into a frustrating issue with Playwright and Chromium, and I could really use some help. Here's what's going on:

The Error:
Plain Text
Error processing batch: 
"errorName": "BrowserLaunchError",
"errorMessage": "Failed to launch browser. Please check the following:
- Try installing the required dependencies by running `npx playwright install --with-deps` (https://playwright.dev/docs/browsers).

The original error is available in the `cause` property. Below is the error received when trying to launch a browser:

browserType.launchPersistentContext: Executable doesn't exist at /home/webapp/.cache/ms-playwright/chromium-1117/chrome-linux/chrome


The Situation:
  • Playwright is looking for Chromium in /home/webapp/.cache/ms-playwright/chromium-1117/chrome-linux/chrome
  • But I actually have Chromium installed at /home/webapp/.cache/ms-playwright/chromium-1140/chrome-linux/chrome
My Question:
How the hell can I specify which Chromium version Playwright should use? 🤔

I don't want to specify this in ENV since I want it to work out of the box and use playwright version that it install

Any help would be greatly appreciated. I'm pulling my hair out over here! 😫

Thanks in advance!
6 comments
o
h
I do not know how to scrape a website, that contains JSON and HTML responses

My scraper need to:
  1. Send a request and parse a JSON response which contains a list of URL that I will enqueue.
  2. Scrape those URLs but in HTML using cheerio or whatever is required to do so.
1 comment
o
Hey Playwright community! I've been using Firefox with Playwright because it uses less CPU, but I've run into a couple of issues I'd love some help with:

  1. New Windows Instead of Tabs
    I'm running Firefox in headless: false mode to check how things look, and I've noticed it opens new windows for each URL instead of opening new tabs. Is there a way to configure this behavior? I'd prefer to have new tabs open instead of separate windows.
  1. Chromium-specific Features in Firefox
    I'm getting this warning when using Firefox:
    Plain Text
    WARN Playwright Utils: blockRequests() helper is incompatible with non-Chromium browsers.

    Are there any polyfills or workarounds for the playwrightUtils features that are Chromium-specific? I'd like to use blockRequests() or similar functionality with Firefox if possible.
Any insights or suggestions would be greatly appreciated! Thanks in advance for your help.

#playwright #firefox
Hi my problem is crawler.run(['https://keepa.com/#!product/4-B07GS6ZB7T', 'https://keepa.com/#!product/4-B0BZSWWK48']) only scrap the first URL I think this is because crawlee think they are the same URL , if i replace the "#" with a "?" it works , is there any way to make it work with url like this ?
2 comments
L
F
h
h
·

Router Class

I recently read a blog post about Playwright web scraping (https://blog.apify.com/playwright-web-scraping/#bonus-routing) and implemented its routing concept in my project. However, I'm encountering an issue with handling failed requests. Currently, when a request fails, the application stalls instead of proceeding to the next request. Do you have any suggestions for implementing a failedRequestHandler to address this problem?
1 comment
o
Hi, so for the last couple days I am on a quest to evade detection for a project that proved to be quite challanging. As I researched the issue, I noticed that my real IP leaks through WebRTC with a default Crawlee Playwright CLI project. I see a commit to the fingerprint-suite that I think should prevent that, but based on my tests it doesn't. Does it need special setup or anything?
1 comment
L
Checking on this page, Crawlee Playwright is detected as bot due to CDP.

https://www.browserscan.net/bot-detection

This is a known issue, also discussed on:

https://github.com/berstend/puppeteer-extra/issues/899

Wondering if Crawlee can come up with a solution?
4 comments
S
J
A
L
Hi

I have this code:
Plain Text
  async processBatch(batch){
// requests: {
//     url: string;
//     userData: CrawlerUserData;
// }[]
    const requests = this.generateRequests(batch)
    await this.crawler.addRequests(requests)

    return this.processResults(requests)
  }
...
  async processResults(requests){
    ...
    for (const request of requests) {
      const userData = request.userData as CrawlerUserData
      if (userData.error) {
        this.statistics.incrementErrors()
        continue
      }

      if (userData.results) {
        ...
        await this.saveResults(userData)
      }
    }

    return batchResults
  }


and this is my route handler:

Plain Text
import { createPlaywrightRouter } from 'crawlee'

export const router = createPlaywrightRouter()

router.addDefaultHandler(async ({ page, request, log }) => {
  const userData = request.userData as CrawlerUserData
  try {
    await page.waitForLoadState('networkidle', { timeout: 5000 })

    const analyzer = new AlertsProximityAnalyzer(userData, callbackCheckingIfDataExist)

    await analyzer.analyze(page) // executing callback

    userData.results = analyzer.results
    // Do I need to save the results here?
  } catch (error) {
    ...
  } finally {
    // Instead of closing the page, reset it for the next use
    await page.evaluate(() => window.stop())
    await page.setContent('<html></html>')
  }
})


The problem is the crawling process executes once the whole code in processBatch is done, eg. all batches are added to requestQueue and processResults is executed ( which do not have any data since there is not yet created userData.results so what I want to know it I need to move my logic to saving results to DB to route handler or is there some way to stop executing this function and start executing route handler and then move back to executing processResults

In response I will paste pseudo algorithm what I expect
1 comment
h
when i use puppeteer & fingerprint injector with generator, some redirects make puppeteer on page firefox/chromium stuck
after these redirections the page stops logging my interceptors (they just write the url), the page stops responding to the resize

if I create a new page manually in this browser and follow the link with redirections, it's fine
without injector and generator everything works fine too

I'm not sure I can provide those links for some reason, but the problem is clearly related to the fingerprint injector
the main link has pointers to other links
Plain Text
browser = await puppeteer
    .launch({
        browser: "firefox",
        headless: false,
        devtools: true,
        args: ['--disable-web-security', '--allow-running-insecure-content'],
    });

const fingerprintGenerator = new FingerprintGenerator();
const fingerprint = fingerprintGenerator.getFingerprint();
const injector = new FingerprintInjector();

let page = await browser.newPage();
await injector.attachFingerprintToPuppeteer(page, fingerprint);
await page.goto("...");
4 comments
L
C
Hello all.
I've been trying to build an app that triggers a scraping job when the api is hit.

The initial endpoint hits a crawlee router which has 2 handlers. one for the url-list scraping and the other for scraping the detail from each of the detail-page. (the url-list handler enqueues the next url-list page to url-list handler too btw)

I'm saving the data from each of these scrapes inside a KVstore but I want a way to save all the data in the KV store related to a particular job into database.
The attached screenshots are the MRE snippets from my code.
3 comments
M
I'm working on news web crawler, and setting purgeOnStart=false so that I don't scrape duplicated news, however sometimes in some cases I got the message "All requests from the queue have been processed, the crawler will shut down." and the crawler don't run, any suggestion to fix this issue??
6 comments
p
A
H
It keeps on returning 403 even with rotating proxy pool

Source code:
Plain Text
import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';
import proxy from './proxy_config.js';

// PlaywrightCrawler crawls the web using a headless browser controlled by the Playwright library.
const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [`http://${proxy.username}:${proxy.password}@${proxy.host}:${proxy.port}`]
});
const crawler = new PlaywrightCrawler({
    // Use the requestHandler to process each of the crawled pages.
    proxyConfiguration,
    async requestHandler({ request, page, enqueueLinks, pushData, log }) {
        const title = await page.title();
        log.info(`Title of ${request.loadedUrl} is '${title}'`);

        // Save results as JSON to `./storage/datasets/default` directory.
        await pushData({ title, url: request.loadedUrl });

        // Extract links from the current page and add them to the crawling queue.
        await enqueueLinks();
    },

    // Uncomment this option to see the browser window.
    // headless: false,

    // Comment this option to scrape the full website.
    maxRequestsPerCrawl: 20,
});

// Add first URL to the queue and start the crawl.
await crawler.run(['https://nopecha.com/demo/cloudflare']);

// Export the whole dataset to a single file in `./result.csv`.
await crawler.exportData('./result.csv');

// Or work with the data directly.
const data = await crawler.getData();
console.table(data.items);
3 comments
o
H
The default delimiter is "," but i want to use "|" instead. On the "DatasetExportToOptions" that i can use with the "Dataset.exporToCSV" method there is no way i can define the delimiter, the options define other things. Is there another solution for this?
Express better then node with crawlee? Or is it really not any big difference?

Any short comings with express over node with crawlee or apify sdk?
2 comments
I
L
5 comments
J
L
Hi, I'm new to PuppeteerCrawler. I'm trying to create a simple script to save a webpage as a PDF. For this purpose, I created a new Actor from the Crawlee - Puppeteer - TypeScript template in Apify. This is my main.ts code:
Plain Text
import { Actor } from 'apify';
import { PuppeteerCrawler, Request } from 'crawlee';

await Actor.init();

interface Input {
    urls: Request[];
}

const { urls = ['https://www.google.com/'] } = await Actor.getInput<Input>() ?? {};

const crawler = new PuppeteerCrawler({
    async requestHandler({ page }) {
        const pdfFileName = 'testFile';
        const pdfBuffer = await page.pdf({ format: 'A4', printBackground: true });

        console.log('pdfFileName: ', pdfFileName);
        console.log('pdfBuffer: ', pdfBuffer);
        
        await Actor.setValue(pdfFileName, pdfBuffer, { contentType: 'application/pdf' });
    },
});

await crawler.addRequests(urls);
await crawler.run();

await Actor.exit();


It seems that Actor.setValue doesn't want to consume the sent PDF buffer. What am I doing wrong?
Thanks
2 comments
R
D
Hello there!

Beside reducing the scope of what is being crawled, for example of number of pages, etc, what can we do in order to accelerate the run?

Any suggestions are welcomed, I'm simply curious.
3 comments
R
R
Hi everyone! Hope you're all doing well. I have a small question about Crawlee.

My use case is a little simpler than a crawler; I just want to scrape a single URL every few seconds.

To do this, I create a RequestList with just one url and start the Crawler. Sometimes, the crawler returns HTTP errors and fails. However, I don't mind as I'm going to run the crawler again after a few seconds and I'd prefer the errors to be ignored rather than automatically reclaimed.

Is there a way of doing this?
3 comments
n
L
O
I'm scrapping pages of a website as part of a content migration. Some of those pages make some post requests to algolia (3-4 requests) on the client side and I need to intercept those requests, since I need some data that is sent in the request body. One thing that is important to note is that I don't know which pages make the requests and which pages don't. Because of that, I'd need to find a way to await for all the external requests FOR EACH PAGE and just strat crawling the page html after that. That way, if I can ensure I awaited for all the requests and still it didn't intercept the algolia request, it would mean that specific page didn't make a request to algolia.

I created a solution that seemed to be working at first. However, after crawling the pages a few times, I noticed that, sometimes, it wouldn't show the algolia data in the dataset for a few pages but I could confirm in the browser that the page makes the algolia request. So, my guess is that it ends crawiling the page html before intercepting that algolia request (??). Ideally, it would only start crawling the html AFTER all the external requests ended. I used puppeteer because I found the addInterceptRequestHandler in the docs, but I could use Playwright if it's easier. Can someone here help me to understand what I'm doing wrong? Here is a gist with the code I'm using: https://gist.github.com/lcnogueira/d1822287d718731a7f4a36f05d1292fc (I can't post it here, otherwise my message becomes too long)
2 comments
L
A
Hi folks, I pulled the latest revision of the actor-node-playwright-chrome:22 docker image but when i tried to run the project I got the "install browsers" playwright error

Plain Text
azure:service-bus:receiver:warning [connection-1|streaming:discovery-8ffea0b6-f055-c04e-88ae-f31f039f2c24] Abandoning the message with id '656b7051a08b4b759087c40d0ecef687' on the receiver 'discovery-8ffea0b6-f055-c04e-88ae-f31f039f2c24' since an error occured: browserType.launch: Executable doesn't exist at /home/myuser/pw-browsers/chromium-1129/chrome-linux/chrome
╔═════════════════════════════════════════════════════════════════════════╗
║ Looks like Playwright Test or Playwright was just installed or updated. ║
║ Please run the following command to download new browsers:              ║
║                                                                         ║
║     npx playwright install                                              ║
║                                                                         ║
║ <3 Playwright Team                                                      ║
╚═════════════════════════════════════════════════════════════════════════╝

I thought all the browser install was pre-handled in the image?
3 comments
C
A
This is my code to launch the browser with headless: false mode. I mannualy input the URL and try to pass the captcha challenges. But the challenges keep failing
This is the code

Plain Text
const { launchPuppeteer } = require('crawlee');
const puppeteerExtra = require('puppeteer-extra');
const stealthPlugin = require('puppeteer-extra-plugin-stealth');

// Use the stealth plugin
puppeteerExtra.use(stealthPlugin());

const main = async () => {
    // Launch the browser without running any crawl
    const browser = await launchPuppeteer({
        // !!! You need to specify this option to tell Crawlee to use puppeteer-extra as the launcher !!!
        launcher: puppeteerExtra,
        launchOptions: {
            // Other puppeteer options work as usual
            headless: false,
        },
    });

    // Create and navigate new page 
    console.log('Open target page');
    const page = await browser.newPage();

    // Now you can play around with the browser
    console.log('Browser launched. You can now interact with it.');
    
    // Keep the script running
    await new Promise((resolve) => {
        console.log('Press Ctrl+C to exit.');
    });
};

main().catch(console.error);


This is the test URL
https://www.nivod.cc/
3 comments
E
O
Hi all,

Im scraping a weird website which has file attachment links which rotate every 5 minutes or so, eg

https://dca-global.org/serve-file/e1725459845/l1714045338/da/c1/Rvfa9Lo-AzHHX0NYJ3f-Tx3FrxSI8-N-Y5ytfS8Prak/1/37/file/1704801502dca_media_publications_2024_v10_lr%20read%20only.pdf

everything between serve-file and file changes regularly.

My strategy to deal with this is to calculate the unique key based on the 'stable' parts of the url. Then when i detect the url has changed, I can remove any queued requests with the unique key and replace them with the new url

My question is, if a request has hit its retry limit and has been 'blacklisted' from the request queue, how can i remove it so the new url can be processed?

Thanks!
4 comments
C
P
Hi everyone I'm trying to use the cli to create a new javascript, crawlee + cheerio project
However I get the error :
Plain Text
 Error: EINVAL: invalid argument, mkdir 'C:\Crawlee-latest\Crawlee\my-new-actor\C:'

It seems that is automatically adds a "C:" at the end of my current terminal path which I believe makes it to fail ( the installation of the project is not complete as I'm missing modules(like cheerio, etc) I have only crawlee and apify in my package.json file of the project.


Currently using node@20.17.0 64-bit , apify-cli@0.20.6 , apify@3.2.5 , crawlee@3.11.2 and Windows 11
3 comments
T
P
Has anybody tried downloading the HTML file of the URL using Crawlee? Was wondering if Crawlee has a capacity of downloading the HTML file of the URL since I've just been using Crawlee and really loving the experience.
4 comments
M
N
E