LeMoussel

How to use CrawleeLogFormatter?

I want to adapt Crawlee's log format. From my research, it seems I need to use the CrawleeLogFormatter API. However, I couldn't find any usage examples for this API. Could you explain how to use it?

LLeMoussel

Is it possible to specify files not to be pushed to the Apify cloud?

With apify push [1], this uploads my project to the Apify cloud and builds an actor from it.
In the project directory, there are some files that I don't want to push on the the Apify cloud.
Is it possible to specify files not to be pushed to the Apify cloud like gitignore file?

[1] https://docs.apify.com/cli#push-the-actor-to-the-apify-cloud

2 comments

LLeMoussel

How to get source code for some actors?

For some actors, it's possible to get source code.
For example, jirimoravcik/gpt2-text-generation (https://apify.com/jirimoravcik/gpt2-text-generation)
But when I click on source code Tab, I get blank page => https://apify.com/jirimoravcik/gpt2-text-generation/source-code
Is this normal?

2 comments

LLeMoussel

Do you know a dashboard for Crawleee?

I want to monitor all Crawlee crawlers. To do this I looked for a dashboard to control the crawlers.
I only found this distributed web crawler management platform: Crawlab (https://github.com/crawlab-team/crawlab).

Do you know of any others?

9 comments

LLeMoussel

How to resize Playwright browser window?

There is the method page.setViewportSize() (https://playwright.dev/docs/api/class-page#page-set-viewport-size)
to resize Playwright browser window.

With Crawlee/PlaywrightCrawler, How can I set the size of browser window ?

const crawler = new PlaywrightCrawler({
    // Stop crawling after 5 pages
    maxRequestsPerCrawl: 5,
    // https://crawlee.dev/api/playwright-crawler/interface/PlaywrightLaunchContext
    launchContext: {
        // https://crawlee.dev/api/playwright-crawler/interface/PlaywrightLaunchContext#launchOptions
        launchOptions: {
            stealth: true,
            headless: false,
        },
?????
    }, .....

1 comment

LLeMoussel

How to set depth-breadth first crawl?

I suspect that crawlee will perform a breadth first crawl by default with enqueueLinks(...)
Does exist somme option to perform a depth first?

3 comments

LLeMoussel

How to disable crawlee log?

If I do Example Usage (https://crawlee.dev/api/playwright-crawler#example-usage) with this Url: https://httpbin.org/status/404, I got this output :


2022-10-11 08:52:58.931 WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. page.goto: net::ERR_HTTP_RESPONSE_CODE_FAILURE at https://httpbin.org/status/404
=========================== logs ===========================
navigating to "https://httpbin.org/status/404", waiting until "load"
============================================================ {"id":"sOcDKee4CooEnLF","url":"https://httpbin.org/status/404","retryCount":1}
2022-10-11 08:53:03.429 ERROR PlaywrightCrawler: Request failed and reached maximum retries. page.goto: net::ERR_HTTP_RESPONSE_CODE_FAILURE at https://httpbin.org/status/404
=========================== logs ===========================
navigating to "https://httpbin.org/status/404", waiting until "load"
============================================================
    at gotoExtended (c:\Users\HERNOUX-06523\Desktop\Dev\NodeJS\test-crawlee\node_modules\@crawlee\playwright\internals\utils\playwright-utils.js:149:17)
    at PlaywrightCrawler._navigationHandler (c:\Users\HERNOUX-06523\Desktop\Dev\NodeJS\test-crawlee\node_modules\@crawlee\playwright\internals\playwright-crawler.js:105:52)
    at PlaywrightCrawler._handleNavigation (c:\Users\HERNOUX-06523\Desktop\Dev\NodeJS\test-crawlee\node_modules\@crawlee\browser\internals\browser-crawler.js:268:51)
    at async PlaywrightCrawler._runRequestHandler (c:\Users\HERNOUX-06523\Desktop\Dev\NodeJS\test-crawlee\node_modules\@crawlee\browser\internals\browser-crawler.js:215:17)
    at async PlaywrightCrawler._runRequestHandler (c:\Users\HERNOUX-06523\Desktop\Dev\NodeJS\test-crawlee\node_modules\@crawlee\playwright\internals\playwright-crawler.js:102:9)
  ....

Is it possible to disable this Crawlee/Playwright log?

2 comments

LLeMoussel

How to set the timeout?

With Playwright, It is possible to set the timeout for every method that accepts the timeout setting using: browserContext.setDefaultTimeout(timeout)
If you want a different timeout for navigations than other methods, perhaps when simulating slow connection speeds, you can also set: browserContext.setDefaultNavigationTimeout(timeout)

How can I do this with crawlee/playwright?

Can I use navigationTimeoutSecs? (https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlerOptions#navigationTimeoutSecs)

1 comment

LLeMoussel

How to extend log messages?

Are there any plans to extend Crawlee logger?
Ref: https://crawlee.dev/api/core/class/Logger

I found this to set the skipTime option

Plain Text

const Apify = require('apify');

const { utils: { log } } = Apify;

log.setOptions({
    logger: new log.LoggerText({ skipTime: false }),
});

but it doesn't seem to work. I got Uncaught TypeError TypeError: Cannot read properties of undefined (reading 'log')

My code:

Plain Text

import Apify from 'apify'
const { utils: { log } } = Apify;

log.setOptions({
    logger: new log.LoggerText({ skipTime: false }),
});

// https://crawlee.dev/api/playwright-crawler/class/PlaywrightCrawler
const crawler = new PlaywrightCrawler(
    launchContext: {
        launchOptions: {
            headless: true,
            stealth: true,
            viewport: { width:600, height:300 }
        },
    },
    async requestHandler({ request, page, enqueueLinks, log }) {
        const title = await page.title();
        log.info(`Titre: ${title} Url: ${request.loadedUrl}`);
    }

1 comment

LLeMoussel

How can I add dynamically JS string function into `preNavigationHooks`?

I would like to dynamically add a string (which describes a JS function) to preNavigationHooks array in CheerioCrawlerOptions [1]

Plain Text

const crawlerOptions = {
...  preNavigationHooks: [],
});
const jsFunction = "async ({ page, request }) => { log.info(`preNavigationHook ${request.url}`); }";
crawlerOptions.preNavigationHooks.push( ??? WHAT ???) 
const myCrawler = new CheerioCrawler(crawlerOptions);

If I do crawlerOptions.preNavigationHooks.push(jsFunction);, when I run crawler, I got error:

WARN CheerioCrawler: Reclaiming failed request back to the list or queue. TypeError: hook is not a function
at CheerioCrawler._executeHooks (D:\Developpement\NodeJS\Nowis_Scraper\node_modules@crawlee\basic\internals\basic-crawler.js:834:23)
at CheerioCrawler._handleNavigation (D:\Developpement\NodeJS\Nowis_Scraper\node_modules@crawlee\http\internals\http-crawler.js:326:20)
at CheerioCrawler._runRequestHandler (D:\Developpement\NodeJS\Nowis_Scraper\node_modules@crawlee\http\internals\http-crawler.js:286:24)

[1] https://crawlee.dev/api/cheerio-crawler/interface/CheerioCrawlerOptions#preNavigationHooks

3 comments

LLeMoussel

For use with Hero?

Are there any plans to integrate/use Hero [1] with Crawlee?

[1] https://github.com/ulixee/hero

LLeMoussel

How to pass UserData when executing crawler

When I do await crawler.run(['https://crawlee.dev'], { userData: { depth: 0 } });
I got this error:
Uncaught ArgumentError ArgumentError: Did not expect property userData to exist, got [object Object] in object options

How can I set userData in option?
`

4 comments

Apify Discord Mirror

How to use CrawleeLogFormatter?

Is it possible to specify files not to be pushed to the Apify cloud?

How to get source code for some actors?

Do you know a dashboard for Crawleee?

How to resize Playwright browser window?

How to set depth-breadth first crawl?

How to disable crawlee log?

How to set the timeout?

How to extend log messages?

How can I add dynamically JS string function into `preNavigationHooks`?

For use with Hero?

How to pass UserData when executing crawler