Apify and Crawlee Official Forum

J
Jerome
Offline, last seen last week
Joined August 29, 2024
Hey, in the dataset schema view, we can either show all data or apply a transformation like "flatten".
Is it possible to do both: extract certain fields that may be more important to their own columns, but still have an object field for the full data (some field are less important and potentially empty)?
When I do it at the moment the data is shown as undefined.
I'm adding my code and the resulting views as screenshot, hope this helps understanding what I'm describing.
Relevant docs I've used for reference: https://docs.apify.com/platform/actors/development/actor-definition/dataset-schema
2 comments
M
J
Hi, i couldn't find it in the docs, do you have a list of countries for which each type of proxy in available?
I'm trying create something similar to the drop-down menu on the actor page when selecting the proxy country.
2 comments
J
O
In most applications (like github), adding a tag like "latest" to a new version/release removes that tag from the previous one, so that only one version/release is tagged as "latest".
This is not the case on Apify where all my versions are tagged as "latest", which obviously makes no sense.

So i have 2 questions:
  • Would you consider automating the removal of previous tags on the platform?
  • What do you recommend to achieve this behavior using the API?
2 comments
P
J
Hi, I'm trying to understand how to bump the version of my actors when deploying programatically.

On the one hand it's not possible using the API (https://docs.apify.com/platform/actors/development/actor-definition/actor-json#reference)
Actor name, version, buildTag, and environmentVariables are currently only used when you deploy your Actor using the Apify CLI and not when deployed, for example, via GitHub integration. There, it serves for informative purposes only.

On the other hand you recommend not using the CLI for deployment (https://docs.apify.com/academy/deploying-your-code/deploying#with-apify-cli)
The apify push command should only really be used for quickly pushing and testing Actors on the platform during development. If you are ready to make your Actor public, use a Git repository instead, as you will reap the benefits of using Git and others will be able to contribute to the project.

Creating a new build increments the PATCH version, but i also want to set the MAJOR and MINOR versions.
Is there a way I'm missing?
5 comments
A
J
M
Hi, my actors communicate with an external service I'm developing and I want to minimise request time, which mean deploying that service as close as possible to where the actors run.
Can you share info about where the actors are running: region, cloud provider?
2 comments
J
v
J
Jerome
·

Output schema

Hi, I'm having 2 issues with output schema definition:
  • I can't make transformations.unwind work, the table only shows undefined values -> Is there a working example I can check somewhere?
  • When instead using transformations.flatten, it seems the table can't show fields with array or object format, it shows undefined instead. Am I missing something here?
For reference, my result is formatted like this, and i want to unwind/flatten the data part:
Plain Text
[
  {
    "metadata": {
      "timestamp": "2024-07-08T09:51:31.942Z",
      "run_id": "lWmckTaBAlbeKTM33"
    },
    "data": {
      "url": "some_url.com",
      "title": "some title",
      "attributes": ["a", "b"]
    }
  },
  {...}
]

And I'd like the table to show the columns:
  • "metadata" as object (this works)
  • "url" as link (works only when using flatten, not unwind)
  • "title" as text (same)
  • "attributes" as array (shows undefined)
Here is the docs I've been using https://docs.apify.com/platform/actors/development/actor-definition/output-schema
9 comments
J
O
This is probably a long shot, but I wonder if you can provide more information on how the log is shown when running an actor.
Is this the docker terminal? Or just some observability for the logs?
I would like to have a small TUI to monitor the crawl, it works locally but the Apify logs don't show anything.
Is there a way to use the log screen as a TUI?
2 comments
P
A
Not using JS or python so I have to interact with the Apify API directly.
I need to store arbitrary data in request items in the queue, I've seen I can use the userData field when posting a request, but when getting a request from the head (https://docs.apify.com/api/v2#/reference/request-queues/queue-head/get-head) the response does not contain this userData field. Instead i get the request id from this response, and have to make a second API call to get the details for a specific request (https://docs.apify.com/api/v2#/reference/request-queues/request/get-request) based on the request ID.
That's 2 API calls for 1 item from the queue, is there a better way?
Why doesn't the Get head endpoint return the complete request (including userData)?
I probably missed something there, thanks for your help.
4 comments
v
J
Hi, I'm develping an actor in rust and I'm trying to access resources utilisation (CPU and memory). I've seen that Crawlee uses os.cpus() from node, but I'm looking for a rust equivalent.
I can make it work locally by mounting the docker socket on the container (docker run -v /var/run/docker.sock:/var/run/docker.sock <image_tag>), but it does not work on the Apify platform.
Are there any resources/pointers I could check on how Apify runs an actor's container? And how I could read these resources utilisation?
Any help would be appreciated.
7 comments
c
J
v
A
Using the new adaptative Playwright crawler, is it possible programmatically decide when to render JS?
For example using HTTP crawling by default, but if some condition is met (for example, finding the word 'captcha' in the loaded url), switch to JS rendering and try to unblock the page.

A similar question, for which I didn't find any answer in the docs, is how does the AdaptivePlaywrightCrawler decide to render JS or not?
4 comments
J
A
I want to extract the text from all <li> elements inside an unordered list <ul>.
Trying await page.locator("div.my_class > ul > li").textContent(); causes an error: strict mode violation: locator('div.my_class > ul > li') resolved to x elements. The presence of multiple elements is expected since this is a list.
Playwright itself doesn't seem to have an issue with selectors that return multiple elements, and I did find the strictSelectors parameter in the crawlee docs, but didn't manage to set it to false (if that is even the solution).
In scrapy item.add_css("list", "div.my_class > ul > li::text") returns a list of the text for each list item, which is what I'm looking for.
Does anyone know how to solve this?
2 comments
J
H