Apify

Apify and Crawlee Official Forum

b
F
A
J
A
Members
LeMoussel
L
LeMoussel
Offline, last seen last month
Joined August 30, 2024
With apify push [1], this uploads my project to the Apify cloud and builds an actor from it.
In the project directory, there are some files that I don't want to push on the the Apify cloud.
Is it possible to specify files not to be pushed to the Apify cloud like gitignore file?

[1] https://docs.apify.com/cli#push-the-actor-to-the-apify-cloud
2 comments
L
A
For some actors, it's possible to get source code.
For example, jirimoravcik/gpt2-text-generation (https://apify.com/jirimoravcik/gpt2-text-generation)
But when I click on source code Tab, I get blank page => https://apify.com/jirimoravcik/gpt2-text-generation/source-code
Is this normal?
2 comments
L
I suspect that crawlee will perform a breadth first crawl by default with enqueueLinks(...)
Does exist somme option to perform a depth first?
3 comments
t
L
If I do Example Usage (https://crawlee.dev/api/playwright-crawler#example-usage) with this Url: https://httpbin.org/status/404, I got this output :
2022-10-11 08:52:58.931 WARN PlaywrightCrawler: Reclaiming failed request back to the list or queue. page.goto: net::ERR_HTTP_RESPONSE_CODE_FAILURE at https://httpbin.org/status/404 =========================== logs =========================== navigating to "https://httpbin.org/status/404", waiting until "load" ============================================================ {"id":"sOcDKee4CooEnLF","url":"https://httpbin.org/status/404","retryCount":1} 2022-10-11 08:53:03.429 ERROR PlaywrightCrawler: Request failed and reached maximum retries. page.goto: net::ERR_HTTP_RESPONSE_CODE_FAILURE at https://httpbin.org/status/404 =========================== logs =========================== navigating to "https://httpbin.org/status/404", waiting until "load" ============================================================ at gotoExtended (c:\Users\HERNOUX-06523\Desktop\Dev\NodeJS\test-crawlee\node_modules\@crawlee\playwright\internals\utils\playwright-utils.js:149:17) at PlaywrightCrawler._navigationHandler (c:\Users\HERNOUX-06523\Desktop\Dev\NodeJS\test-crawlee\node_modules\@crawlee\playwright\internals\playwright-crawler.js:105:52) at PlaywrightCrawler._handleNavigation (c:\Users\HERNOUX-06523\Desktop\Dev\NodeJS\test-crawlee\node_modules\@crawlee\browser\internals\browser-crawler.js:268:51) at async PlaywrightCrawler._runRequestHandler (c:\Users\HERNOUX-06523\Desktop\Dev\NodeJS\test-crawlee\node_modules\@crawlee\browser\internals\browser-crawler.js:215:17) at async PlaywrightCrawler._runRequestHandler (c:\Users\HERNOUX-06523\Desktop\Dev\NodeJS\test-crawlee\node_modules\@crawlee\playwright\internals\playwright-crawler.js:102:9) ....
Is it possible to disable this Crawlee/Playwright log?
2 comments
t
L
With Playwright, It is possible to set the timeout for every method that accepts the timeout setting using: browserContext.setDefaultTimeout(timeout)
If you want a different timeout for navigations than other methods, perhaps when simulating slow connection speeds, you can also set: browserContext.setDefaultNavigationTimeout(timeout)

How can I do this with crawlee/playwright?

Can I use navigationTimeoutSecs? (https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlerOptions#navigationTimeoutSecs)
1 comment
A
Are there any plans to extend Crawlee logger?
Ref: https://crawlee.dev/api/core/class/Logger

I found this to set the skipTime option
Plain Text
const Apify = require('apify');

const { utils: { log } } = Apify;

log.setOptions({
    logger: new log.LoggerText({ skipTime: false }),
});

but it doesn't seem to work. I got Uncaught TypeError TypeError: Cannot read properties of undefined (reading 'log')

My code:
Plain Text
import Apify from 'apify'
const { utils: { log } } = Apify;

log.setOptions({
    logger: new log.LoggerText({ skipTime: false }),
});

// https://crawlee.dev/api/playwright-crawler/class/PlaywrightCrawler
const crawler = new PlaywrightCrawler(
    launchContext: {
        launchOptions: {
            headless: true,
            stealth: true,
            viewport: { width:600, height:300 }
        },
    },
    async requestHandler({ request, page, enqueueLinks, log }) {
        const title = await page.title();
        log.info(`Titre: ${title} Url: ${request.loadedUrl}`);
    }
1 comment
t
I want to monitor all Crawlee crawlers. To do this I looked for a dashboard to control the crawlers.
I only found this distributed web crawler management platform: Crawlab (https://github.com/crawlab-team/crawlab).

Do you know of any others?
9 comments
L
t
A
There is the method page.setViewportSize() (https://playwright.dev/docs/api/class-page#page-set-viewport-size)
to resize Playwright browser window.

With Crawlee/PlaywrightCrawler, How can I set the size of browser window ?

const crawler = new PlaywrightCrawler({ // Stop crawling after 5 pages maxRequestsPerCrawl: 5, // https://crawlee.dev/api/playwright-crawler/interface/PlaywrightLaunchContext launchContext: { // https://crawlee.dev/api/playwright-crawler/interface/PlaywrightLaunchContext#launchOptions launchOptions: { stealth: true, headless: false, }, ????? }, .....
1 comment
L
I suspect that crawlee will perform a breadth first crawl by default with enqueueLinks(...)
Does exist somme option to perform a depth first?
3 comments
t
L
If I do Example Usage (https://crawlee.dev/api/playwright-crawler#example-usage) with this Url: https://httpbin.org/status/404, I got this output :
2022-10-11 08:52:58.931 WARN PlaywrightCrawler: Reclaiming failed request back to the list or queue. page.goto: net::ERR_HTTP_RESPONSE_CODE_FAILURE at https://httpbin.org/status/404 =========================== logs =========================== navigating to "https://httpbin.org/status/404", waiting until "load" ============================================================ {"id":"sOcDKee4CooEnLF","url":"https://httpbin.org/status/404","retryCount":1} 2022-10-11 08:53:03.429 ERROR PlaywrightCrawler: Request failed and reached maximum retries. page.goto: net::ERR_HTTP_RESPONSE_CODE_FAILURE at https://httpbin.org/status/404 =========================== logs =========================== navigating to "https://httpbin.org/status/404", waiting until "load" ============================================================ at gotoExtended (c:\Users\HERNOUX-06523\Desktop\Dev\NodeJS\test-crawlee\node_modules\@crawlee\playwright\internals\utils\playwright-utils.js:149:17) at PlaywrightCrawler._navigationHandler (c:\Users\HERNOUX-06523\Desktop\Dev\NodeJS\test-crawlee\node_modules\@crawlee\playwright\internals\playwright-crawler.js:105:52) at PlaywrightCrawler._handleNavigation (c:\Users\HERNOUX-06523\Desktop\Dev\NodeJS\test-crawlee\node_modules\@crawlee\browser\internals\browser-crawler.js:268:51) at async PlaywrightCrawler._runRequestHandler (c:\Users\HERNOUX-06523\Desktop\Dev\NodeJS\test-crawlee\node_modules\@crawlee\browser\internals\browser-crawler.js:215:17) at async PlaywrightCrawler._runRequestHandler (c:\Users\HERNOUX-06523\Desktop\Dev\NodeJS\test-crawlee\node_modules\@crawlee\playwright\internals\playwright-crawler.js:102:9) ....
Is it possible to disable this Crawlee/Playwright log?
2 comments
t
L
With Playwright, It is possible to set the timeout for every method that accepts the timeout setting using: browserContext.setDefaultTimeout(timeout)
If you want a different timeout for navigations than other methods, perhaps when simulating slow connection speeds, you can also set: browserContext.setDefaultNavigationTimeout(timeout)

How can I do this with crawlee/playwright?

Can I use navigationTimeoutSecs? (https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlerOptions#navigationTimeoutSecs)
1 comment
A
Are there any plans to extend Crawlee logger?
Ref: https://crawlee.dev/api/core/class/Logger

I found this to set the skipTime option
Plain Text
const Apify = require('apify');

const { utils: { log } } = Apify;

log.setOptions({
    logger: new log.LoggerText({ skipTime: false }),
});

but it doesn't seem to work. I got Uncaught TypeError TypeError: Cannot read properties of undefined (reading 'log')

My code:
Plain Text
import Apify from 'apify'
const { utils: { log } } = Apify;

log.setOptions({
    logger: new log.LoggerText({ skipTime: false }),
});

// https://crawlee.dev/api/playwright-crawler/class/PlaywrightCrawler
const crawler = new PlaywrightCrawler(
    launchContext: {
        launchOptions: {
            headless: true,
            stealth: true,
            viewport: { width:600, height:300 }
        },
    },
    async requestHandler({ request, page, enqueueLinks, log }) {
        const title = await page.title();
        log.info(`Titre: ${title} Url: ${request.loadedUrl}`);
    }
1 comment
t
I want to monitor all Crawlee crawlers. To do this I looked for a dashboard to control the crawlers.
I only found this distributed web crawler management platform: Crawlab (https://github.com/crawlab-team/crawlab).

Do you know of any others?
9 comments
t
A
L
There is the method page.setViewportSize() (https://playwright.dev/docs/api/class-page#page-set-viewport-size)
to resize Playwright browser window.

With Crawlee/PlaywrightCrawler, How can I set the size of browser window ?

const crawler = new PlaywrightCrawler({ // Stop crawling after 5 pages maxRequestsPerCrawl: 5, // https://crawlee.dev/api/playwright-crawler/interface/PlaywrightLaunchContext launchContext: { // https://crawlee.dev/api/playwright-crawler/interface/PlaywrightLaunchContext#launchOptions launchOptions: { stealth: true, headless: false, }, ????? }, .....
1 comment
L
I would like to dynamically add a string (which describes a JS function) to preNavigationHooks array in CheerioCrawlerOptions [1]

Plain Text
const crawlerOptions = {
...  preNavigationHooks: [],
});
const jsFunction = "async ({ page, request }) => { log.info(`preNavigationHook ${request.url}`); }";
crawlerOptions.preNavigationHooks.push( ??? WHAT ???) 
const myCrawler = new CheerioCrawler(crawlerOptions);


If I do crawlerOptions.preNavigationHooks.push(jsFunction);, when I run crawler, I got error:
WARN CheerioCrawler: Reclaiming failed request back to the list or queue. TypeError: hook is not a function
at CheerioCrawler._executeHooks (D:\Developpement\NodeJS\Nowis_Scraper\node_modules@crawlee\basic\internals\basic-crawler.js:834:23)
at CheerioCrawler._handleNavigation (D:\Developpement\NodeJS\Nowis_Scraper\node_modules@crawlee\http\internals\http-crawler.js:326:20)
at CheerioCrawler._runRequestHandler (D:\Developpement\NodeJS\Nowis_Scraper\node_modules@crawlee\http\internals\http-crawler.js:286:24)

[1] https://crawlee.dev/api/cheerio-crawler/interface/CheerioCrawlerOptions#preNavigationHooks
3 comments
L
L
H
I would like to dynamically add a string (which describes a JS function) to preNavigationHooks array in CheerioCrawlerOptions [1]

Plain Text
const crawlerOptions = {
...  preNavigationHooks: [],
});
const jsFunction = "async ({ page, request }) => { log.info(`preNavigationHook ${request.url}`); }";
crawlerOptions.preNavigationHooks.push( ??? WHAT ???) 
const myCrawler = new CheerioCrawler(crawlerOptions);


If I do crawlerOptions.preNavigationHooks.push(jsFunction);, when I run crawler, I got error:
WARN CheerioCrawler: Reclaiming failed request back to the list or queue. TypeError: hook is not a function
at CheerioCrawler._executeHooks (D:\Developpement\NodeJS\Nowis_Scraper\node_modules@crawlee\basic\internals\basic-crawler.js:834:23)
at CheerioCrawler._handleNavigation (D:\Developpement\NodeJS\Nowis_Scraper\node_modules@crawlee\http\internals\http-crawler.js:326:20)
at CheerioCrawler._runRequestHandler (D:\Developpement\NodeJS\Nowis_Scraper\node_modules@crawlee\http\internals\http-crawler.js:286:24)

[1] https://crawlee.dev/api/cheerio-crawler/interface/CheerioCrawlerOptions#preNavigationHooks
3 comments
L
L
H
Are there any plans to integrate/use Hero [1] with Crawlee?

[1] https://github.com/ulixee/hero
Are there any plans to integrate/use Hero [1] with Crawlee?

[1] https://github.com/ulixee/hero
When I do await crawler.run(['https://crawlee.dev'], { userData: { depth: 0 } });
I got this error:
Uncaught ArgumentError ArgumentError: Did not expect property userData to exist, got [object Object] in object options

How can I set userData in option?
`
4 comments
G
t
A
H
When I do await crawler.run(['https://crawlee.dev'], { userData: { depth: 0 } });
I got this error:
Uncaught ArgumentError ArgumentError: Did not expect property userData to exist, got [object Object] in object options

How can I set userData in option?
`
4 comments
G
t
A
H