cdslash

Use Node >= 20 on Apify platform?

I have upgraded to the latest version of Crawlee (v3.11.1) which has a dependency on got@14.4.2 that requires node v 20 or higher. When I try to build my Actor on the Apify platform it fails due to Node version 18 not being recent enough. How do I specify the Node version?

1 comment

ccdslash

Actor runs do not store output in dataset on Apify

I have an actor that is working perfectly locally - storing output in JSON files in the default dataset - but when I run it on the Apify platform, the run shows no results even though run was successful and results were found.

I am attempting to store the results with:

Plain Text

await Dataset.pushData(result)

I have also checked that the object is correctly produced using log.info(`Result: ${JSON.stringify(result)}`) and it is correctly logged out, so I would expect it to be stored.

Is there any way to debug this, since there are no errors?

1 comment

ccdslash

Build duration shows incorrectly

Build duration for all actors is showing an incorrect value in the Apify dashboard. Based on time elapsed, looks like it's using a start time of 0 or null and thus is defaulting to epoch 0 (1st Jan 1970).

1 comment

ccdslash

De-duplicate dataset results

I have an actor that returns a simple list of IDs. It's possible that during a run, concurrent processes can overlap and produce duplicate results. Is there any accepted way of avoiding this?

At the most basic level I'd hoped that I could do something simple like using the returned ID as the key in the dataset (i.e. a duplicate result would write the same entry so a duplicate would not be created), but this doesn't seem to work, presumably because each result is actually a separate JSON file in the dataset.

I've also thought about opening the dataset and getting the full list of IDs, then only pushing IDs not present - this could work but adds overhead and also seems to introduce the possibility of race conditions.

So, is there any way to push only unique values to the dataset?

1 comment

ccdslash

Using `transformRequestFunction` in `enqueueLinks` overrides `label`

I am using enqueueLinks to pass some URLs to another handler. This works fine when I do:

Plain Text

await enqueueLinks({
    label: "LIST_PLACES",
    urls: searchSquareLinks,
    strategy: "same-domain",
    userData,
})

but as soon as I add a transformRequestFunction the label is overridden and it queues the links back to the handler from which is is being queued:

Plain Text

await enqueueLinks({
    label: "LIST_PLACES",
    urls: searchSquareLinks,
    transformRequestFunction: (request) => {
        request.userData = {
            ...userData,
            zoomTarget: request.url.match(zoomRegex)?.[1],
        }
        return request
    },
    strategy: "same-domain",
    userData,
})

Why is the label being overridden when the only property of request being changed is the zoomTarget?

5 comments

ccdslash

Async link parsing for faster results

I have previously used enqueueLinks with a selector to add more URLs to the queue, with good results - it's nearly instant on pages with around 100 links. I now need to modify the unique ID so I'm looping through the results on the page - even though the links are already loaded, this is very slow. Is there a way to do this faster, while still getting the additional attributes?

Plain Text

const processResults = async (locator: Locator) => {
  const queue: {
      [key: string]: {
          name: string | null
          address: string | null
      }
  } = {}
  for await (const result of await locator.all()) {
      try {
          const resultLinkLocator = result.locator(`a[aria-label]`)
          const addressShortLocator = result.locator(
              `span[aria-hidden]:has-text("·") + span:not([role="img"])`
          )
          const name = await resultLinkLocator.getAttribute(
              "aria-label",
              {
                  timeout: 5_000,
              }
          )
          log.info(`Result name: ${name}`)
          const address = await addressShortLocator.textContent({
              timeout: 5_000,
          })
          const url = await resultLinkLocator.getAttribute("href", {
              timeout: 5_000,
          })

          if (!url) {
              log.info(`No url found for result ${name}`)
              continue
          }
          queue[url] = {
              name,
              address,
          }
      } catch (e: any) {
          log.info(`Error queueing result. Error: ${e}`)
      }
  }
  return queue
}
const urls = Object.keys(linkQueue)
await enqueueLinks({
    label: "PLACE_DETAIL",
    urls: urls,
    transformRequestFunction: (request) => {
        request.uniqueKey = `${linkQueue[request.url].name}|${
            linkQueue[request.url].address ?? location
        }`
        return request
    },
    strategy: "same-domain",
    userData,
})

1 comment

ccdslash

Per request timeout

I can set an overall timeout for a crawler with requestHandlerTimeoutSecs like so:

Plain Text

const crawler = new PlaywrightCrawler({
    requestHandler: router,
    browserPoolOptions: {
        maxOpenPagesPerBrowser: 4,
    },
    requestHandlerTimeoutSecs: 3600,
})

Is there a way to set a timeout per page, rather than an overall timeout? E.g. if there is an exception being caught and the crawl then hangs, I was to timeout after 30 seconds on that page, but have a much longer timeout for the run overall

1 comment

ccdslash

Blocking requests after click

I am using preNavigationHooks to block images, which works for the initial page load but does not block images loaded after a click on the page (i.e. XHR requests). How can these be blocked?

Plain Text

preNavigationHooks: [
    async ({ page, blockRequests }) => {
        await page.setViewportSize({ width: 1920, height: 1080 })
        await blockRequests({
            urlPatterns: [".jpg", ".png", ".gif", ".svg"],
        })
    },
],

2 comments

ccdslash

Error in crawlee import: `Module '"cheerio"' has no exported member 'Element'`

After running npm run build on version 3.11.1 of Crawlee, I get the error below:

Plain Text

node_modules/.pnpm/@crawlee+playwright@3.11.1_playwright@1.46.1/node_modules/@crawlee/playwright/internals/adaptive-playwright-crawler.d.ts:7:29 - error TS2305: Module '"cheerio"' has no exported member 'Element'.

7 import { type Cheerio, type Element } from 'cheerio';
                              ~~~~~~~


Found 1 error in node_modules/.pnpm/@crawlee+playwright@3.11.1_playwright@1.46.1/node_modules/@crawlee/playwright/internals/adaptive-playwright-crawler.d.ts:7

I have seen other posts in channels here with the same issue, but there doesn't seem to be a known workaround or solution. Can anyone advise what has broken in the latest version? This error seems to be coming from the package itself, not from user code. I am not using any Cheerio crawling.

6 comments

ccdslash

Add label to pages via `crawler.addRequests()`?

I am adding a page as the initial crawl target, but would like to add a label to ensure it routes to the correct processor. Is there a way to do this?

Plain Text

await crawler.addRequests([  "https://www.foo.bar/page",
])

It seems I can only add RequestQueueOperationOptions with addRequests, not the same EnqueueLinksOptions that I can use with enqueueLinks().

2 comments

Apify Discord Mirror

Use Node >= 20 on Apify platform?

Actor runs do not store output in dataset on Apify

Build duration shows incorrectly

De-duplicate dataset results

Using `transformRequestFunction` in `enqueueLinks` overrides `label`

Async link parsing for faster results

Per request timeout

Blocking requests after click

Error in crawlee import: `Module '"cheerio"' has no exported member 'Element'`

Add label to pages via `crawler.addRequests()`?