got@14.4.2
that requires node v 20 or higher. When I try to build my Actor on the Apify platform it fails due to Node version 18 not being recent enough. How do I specify the Node version?await Dataset.pushData(result)
log.info(`Result: ${JSON.stringify(result)}`)
and it is correctly logged out, so I would expect it to be stored.null
and thus is defaulting to epoch 0 (1st Jan 1970).enqueueLinks
to pass some URLs to another handler. This works fine when I do:await enqueueLinks({ label: "LIST_PLACES", urls: searchSquareLinks, strategy: "same-domain", userData, })
transformRequestFunction
the label is overridden and it queues the links back to the handler from which is is being queued:await enqueueLinks({ label: "LIST_PLACES", urls: searchSquareLinks, transformRequestFunction: (request) => { request.userData = { ...userData, zoomTarget: request.url.match(zoomRegex)?.[1], } return request }, strategy: "same-domain", userData, })
label
being overridden when the only property of request
being changed is the zoomTarget
?enqueueLinks
with a selector to add more URLs to the queue, with good results - it's nearly instant on pages with around 100 links. I now need to modify the unique ID so I'm looping through the results on the page - even though the links are already loaded, this is very slow. Is there a way to do this faster, while still getting the additional attributes?const processResults = async (locator: Locator) => { const queue: { [key: string]: { name: string | null address: string | null } } = {} for await (const result of await locator.all()) { try { const resultLinkLocator = result.locator(`a[aria-label]`) const addressShortLocator = result.locator( `span[aria-hidden]:has-text("·") + span:not([role="img"])` ) const name = await resultLinkLocator.getAttribute( "aria-label", { timeout: 5_000, } ) log.info(`Result name: ${name}`) const address = await addressShortLocator.textContent({ timeout: 5_000, }) const url = await resultLinkLocator.getAttribute("href", { timeout: 5_000, }) if (!url) { log.info(`No url found for result ${name}`) continue } queue[url] = { name, address, } } catch (e: any) { log.info(`Error queueing result. Error: ${e}`) } } return queue } const urls = Object.keys(linkQueue) await enqueueLinks({ label: "PLACE_DETAIL", urls: urls, transformRequestFunction: (request) => { request.uniqueKey = `${linkQueue[request.url].name}|${ linkQueue[request.url].address ?? location }` return request }, strategy: "same-domain", userData, })
requestHandlerTimeoutSecs
like so:const crawler = new PlaywrightCrawler({ requestHandler: router, browserPoolOptions: { maxOpenPagesPerBrowser: 4, }, requestHandlerTimeoutSecs: 3600, })
preNavigationHooks
to block images, which works for the initial page load but does not block images loaded after a click on the page (i.e. XHR requests). How can these be blocked?preNavigationHooks: [ async ({ page, blockRequests }) => { await page.setViewportSize({ width: 1920, height: 1080 }) await blockRequests({ urlPatterns: [".jpg", ".png", ".gif", ".svg"], }) }, ],
npm run build
on version 3.11.1 of Crawlee, I get the error below:node_modules/.pnpm/@crawlee+playwright@3.11.1_playwright@1.46.1/node_modules/@crawlee/playwright/internals/adaptive-playwright-crawler.d.ts:7:29 - error TS2305: Module '"cheerio"' has no exported member 'Element'. 7 import { type Cheerio, type Element } from 'cheerio'; ~~~~~~~ Found 1 error in node_modules/.pnpm/@crawlee+playwright@3.11.1_playwright@1.46.1/node_modules/@crawlee/playwright/internals/adaptive-playwright-crawler.d.ts:7
await crawler.addRequests([ "https://www.foo.bar/page", ])
RequestQueueOperationOptions
with addRequests, not the same EnqueueLinksOptions
that I can use with enqueueLinks()
.