Apify and Crawlee Official Forum

Home
Members
cdslash
c
cdslash
Offline, last seen 4 months ago
Joined August 28, 2024
I have upgraded to the latest version of Crawlee (v3.11.1) which has a dependency on got@14.4.2 that requires node v 20 or higher. When I try to build my Actor on the Apify platform it fails due to Node version 18 not being recent enough. How do I specify the Node version?
1 comment
P
I have an actor that is working perfectly locally - storing output in JSON files in the default dataset - but when I run it on the Apify platform, the run shows no results even though run was successful and results were found.

I am attempting to store the results with:
Plain Text
await Dataset.pushData(result)

I have also checked that the object is correctly produced using log.info(`Result: ${JSON.stringify(result)}`) and it is correctly logged out, so I would expect it to be stored.

Is there any way to debug this, since there are no errors?
1 comment
L
Build duration for all actors is showing an incorrect value in the Apify dashboard. Based on time elapsed, looks like it's using a start time of 0 or null and thus is defaulting to epoch 0 (1st Jan 1970).
1 comment
L
I have an actor that returns a simple list of IDs. It's possible that during a run, concurrent processes can overlap and produce duplicate results. Is there any accepted way of avoiding this?

At the most basic level I'd hoped that I could do something simple like using the returned ID as the key in the dataset (i.e. a duplicate result would write the same entry so a duplicate would not be created), but this doesn't seem to work, presumably because each result is actually a separate JSON file in the dataset.

I've also thought about opening the dataset and getting the full list of IDs, then only pushing IDs not present - this could work but adds overhead and also seems to introduce the possibility of race conditions.

So, is there any way to push only unique values to the dataset?
1 comment
H
I am using enqueueLinks to pass some URLs to another handler. This works fine when I do:
Plain Text
await enqueueLinks({
    label: "LIST_PLACES",
    urls: searchSquareLinks,
    strategy: "same-domain",
    userData,
})

but as soon as I add a transformRequestFunction the label is overridden and it queues the links back to the handler from which is is being queued:
Plain Text
await enqueueLinks({
    label: "LIST_PLACES",
    urls: searchSquareLinks,
    transformRequestFunction: (request) => {
        request.userData = {
            ...userData,
            zoomTarget: request.url.match(zoomRegex)?.[1],
        }
        return request
    },
    strategy: "same-domain",
    userData,
})

Why is the label being overridden when the only property of request being changed is the zoomTarget?
5 comments
b
c
L
I have previously used enqueueLinks with a selector to add more URLs to the queue, with good results - it's nearly instant on pages with around 100 links. I now need to modify the unique ID so I'm looping through the results on the page - even though the links are already loaded, this is very slow. Is there a way to do this faster, while still getting the additional attributes?

Plain Text
const processResults = async (locator: Locator) => {
  const queue: {
      [key: string]: {
          name: string | null
          address: string | null
      }
  } = {}
  for await (const result of await locator.all()) {
      try {
          const resultLinkLocator = result.locator(`a[aria-label]`)
          const addressShortLocator = result.locator(
              `span[aria-hidden]:has-text("·") + span:not([role="img"])`
          )
          const name = await resultLinkLocator.getAttribute(
              "aria-label",
              {
                  timeout: 5_000,
              }
          )
          log.info(`Result name: ${name}`)
          const address = await addressShortLocator.textContent({
              timeout: 5_000,
          })
          const url = await resultLinkLocator.getAttribute("href", {
              timeout: 5_000,
          })

          if (!url) {
              log.info(`No url found for result ${name}`)
              continue
          }
          queue[url] = {
              name,
              address,
          }
      } catch (e: any) {
          log.info(`Error queueing result. Error: ${e}`)
      }
  }
  return queue
}
const urls = Object.keys(linkQueue)
await enqueueLinks({
    label: "PLACE_DETAIL",
    urls: urls,
    transformRequestFunction: (request) => {
        request.uniqueKey = `${linkQueue[request.url].name}|${
            linkQueue[request.url].address ?? location
        }`
        return request
    },
    strategy: "same-domain",
    userData,
})
1 comment
L
I can set an overall timeout for a crawler with requestHandlerTimeoutSecs like so:
Plain Text
const crawler = new PlaywrightCrawler({
    requestHandler: router,
    browserPoolOptions: {
        maxOpenPagesPerBrowser: 4,
    },
    requestHandlerTimeoutSecs: 3600,
})


Is there a way to set a timeout per page, rather than an overall timeout? E.g. if there is an exception being caught and the crawl then hangs, I was to timeout after 30 seconds on that page, but have a much longer timeout for the run overall
1 comment
H
I am using preNavigationHooks to block images, which works for the initial page load but does not block images loaded after a click on the page (i.e. XHR requests). How can these be blocked?

Plain Text
preNavigationHooks: [
    async ({ page, blockRequests }) => {
        await page.setViewportSize({ width: 1920, height: 1080 })
        await blockRequests({
            urlPatterns: [".jpg", ".png", ".gif", ".svg"],
        })
    },
],
2 comments
L
After running npm run build on version 3.11.1 of Crawlee, I get the error below:

Plain Text
node_modules/.pnpm/@crawlee+playwright@3.11.1_playwright@1.46.1/node_modules/@crawlee/playwright/internals/adaptive-playwright-crawler.d.ts:7:29 - error TS2305: Module '"cheerio"' has no exported member 'Element'.

7 import { type Cheerio, type Element } from 'cheerio';
                              ~~~~~~~


Found 1 error in node_modules/.pnpm/@crawlee+playwright@3.11.1_playwright@1.46.1/node_modules/@crawlee/playwright/internals/adaptive-playwright-crawler.d.ts:7

I have seen other posts in channels here with the same issue, but there doesn't seem to be a known workaround or solution. Can anyone advise what has broken in the latest version? This error seems to be coming from the package itself, not from user code. I am not using any Cheerio crawling.
6 comments
c
W
M
I am adding a page as the initial crawl target, but would like to add a label to ensure it routes to the correct processor. Is there a way to do this?
Plain Text
await crawler.addRequests([  "https://www.foo.bar/page",
])

It seems I can only add RequestQueueOperationOptions with addRequests, not the same EnqueueLinksOptions that I can use with enqueueLinks().
2 comments
H
N