Apify

Apify and Crawlee Official Forum

b
F
A
J
A
Members
Honza Javorek
H
Honza Javorek
Offline, last seen 2 days ago
Joined August 30, 2024
From since I added browser to my stack, the builds take 2 and more minutes. I'm trying to make the builds more efficient, but I'm no expert in setting up the image, so I'd appreciate any help. This is what I do right now:
Plain Text
FROM apify/actor-python:3.12
ARG ACTOR_PATH_IN_DOCKER_CONTEXT

RUN rm -rf /usr/src/app/*
WORKDIR /usr/src/app

COPY . ./

RUN echo "Python version:" \
 && python --version \
 && echo "Pip version:" \
 && pip --version \
 && echo "Installing Poetry:" \
 && pip install --no-cache-dir poetry~=1.7.1 \
 && echo "Installing dependencies:" \
 && poetry config cache-dir /tmp/.poetry-cache \
 && poetry config virtualenvs.in-project true \
 && poetry install --only=main --no-interaction --no-ansi \
 && rm -rf /tmp/.poetry-cache \
 && echo "All installed Python packages:" \
 && pip freeze \
 && echo "Installing Playwright dependencies:" \
 && poetry run playwright install chromium --with-deps

RUN python3 -m compileall -q ./jg/plucker

ENV ACTOR_PATH_IN_DOCKER_CONTEXT="${ACTOR_PATH_IN_DOCKER_CONTEXT}"
CMD ["poetry", "run", "plucker", "--debug", "crawl", "--apify"]

Is there a way to make it faster?
I'm trying to integrate my Scrapy actor with Playwright, so I attempted to figure out what is the actual format of the proxy input from Apify, so that I could somehow pass it over to Playwright.

I printed out what my spider gets and this is what it prints:
Plain Text
APIFY_PROXY_SETTINGS: {'apifyProxyGroups': [], 'useApifyProxy': True}

Empty list. Is that expected or a bug? Does it mean that my scraper runs without proxies despite the fact I have the "Datacenter" option turned on? I'm really confused now.

The spider has some problems with getting blocked. If I thought I'm using proxies, but in fact there are none, then it's no surprise I'm getting blocked.
3 comments
A
H
P
I've got some scrapers failing, but I learned only because my production had no data. Why? Because I've relied on emails from Apify, and they didn't come. Why? Because you've disabled them without telling me πŸ˜€

C'mon. I'm fine about disabling email feature you want to revamp, but then please email me about the change, because I don't spend my life on the notifiactions tab to learn about the change from a little notification box πŸ€¦β€β™‚οΈ

I don't consider the notifications perfect, but they are the only global way to get notified about the fact that actors failed to finish. Alternative is to setup alert, but then I have to manually click on all my ten actors to set it up (prone to errors), and, more importantly there is no way to setup alert for a failed actor. In the metric drop down, there is no exit status is 0 option or equlvalent, as far as I can see.

Hence, after your change, I'm left with no monitoring over my actors unless I immediately go and fix up some automated code checking the actors over your API, which kind of defeats the reasons why I recently migrated my scrapers to Apify - to get these things out of the box, as part of your platform.
2 comments
H
O
I found https://apify.com/change-log and I'd like to subscribe using RSS. Is that possible?

Reverse-engineersing I see the page makes requests to https://cms.apify.com/api/change-log-items?pagination[limit]=-1&populate=deep, but I'm not sure what cms.apify.com is and whether it's able to give me a good old RSS or Atom feed.
1 comment
P
There are these two (see attachment) settings which I realized I first didn't understand correctly. I thought I can uncheck report about all my actor runs, and get only failed actor runs by leaving the other checkbox checked. But it seems that "Actor Issues" is something completely different, probably related to the actor marketplace.

To get notified about failed actors, I either have to setup monitoring on all of them manually (e.g. alert every time there is 0 items in the dataset), or I have to check the top checkbox and get notified about all runs, even the successful ones (which is noisy if I have daily schedule).

So I suggest there could be a checkbox which notifies me only when the actors fail, while not sending anything when the actors are successful.
4 comments
H
P
H
Honza Javorek
Β·

Broken links

FYI, these links on https://docs.apify.com/platform/storage/dataset are broken (see image).
2 comments
R
H
When I click on the radio button Automatic builds, I get an error. Different browser doesn't help. The browser console says:

Plain Text
ERROR Failed to handle request 'POST - /github-app/setup-webhook/...' {"request":{"method":"post","url":"https://console-backend.apify.com/github-app/setup-webhook/...","headers":{"x-idempotency-key":"...","Authorization":"Bearer ..."},"params":{},"data":{"version":"0.0","enabled":true}},"response":{"status":"error","statusCode":500,"isClientSafe":true,"errorCode":"internal-server-error","errorMessage":"Cannot destructure property 'token' of '(intermediate value)' as it is null.","path":"/github-app/setup-webhook/..."}}
  [object Object]
18 comments
H
L
A
C
Cheers, I just noticed that my custom scraper makes quite different number of requests locally and through Apify, while the code, URLs, parameters, everything is the same. The same Scrapy spider produces 720 items locally, but 370 through Apify

Anyone has any clue what could be the root cause, where to look? Just from the logs I can't see anything. The only clue I noticed is that on Apify the scraper makes no POST requests, but that probably isn't enough to debug the root cause πŸ€”

Is there a way I can raise logging or something on Apify? How can I best approach this? How to debug this?
6 comments
H
V
Any idea why everything is logged twice? Is this a known issue of the Scrapy template, or is it a desired behavior?
Plain Text
...
[scrapy.core.engine] INFO  Spider closed (finished) ({"spider": "<Spider 'startupjobs' at 0x1072ab890>"})
[scrapy.core.engine] INFO  Spider closed (finished) ({"spider": "<Spider 'startupjobs' at 0x1072ab890>", "message": "Spider closed (finished)"})
[twisted] INFO  (TCP Port 6023 Closed)
[twisted] INFO  (TCP Port 6023 Closed) ({"message": "(TCP Port 6023 Closed)"})
[apify] INFO  Exiting actor ({"exit_code": 0})
[apify] INFO  Exiting actor ({"exit_code": 0, "message": "Exiting actor"})
9 comments
H
V
A
I'm not that far in my proof of concept and perhaps I ask about something which would be clear later in my progress, but one question arises in my head when trying to architect my future solution.

I think I'd like to have many actors in one repository so it's easy to manage and contribute to them. But is it able to connect Apify actor with such (monorepo?) architecture?

E.g. I could have a Python package with a few scrapers to scrape jobs, then another package with a few scrapers to scrape meetup.com events, etc. Having them separated by topic and by setup (schedulers, proxies, etc.). But I'd like to have all scrapers in one repo.

In theory, would it be possible to have one repo with many actors, have many actor projects created in Apify GUI, and then somehow specify that particular Actor is being ran certain way (command line parameter, environment variable, subfolder, you name it)? What would be an idiomatic way to solve this?
4 comments
H
A
L
H
Honza Javorek
Β·

Using Poetry

What would be a recommended way to use Poetry to install dependencies instead of pure pip? Anyone has this working in their Apify Dockerfile?
3 comments
H
V
A
Cheers, I just noticed that my custom scraper makes quite different number of requests locally and through Apify, while the code, URLs, parameters, everything is the same. The same Scrapy spider produces 720 items locally, but 370 through Apify

Anyone has any clue what could be the root cause, where to look? Just from the logs I can't see anything. The only clue I noticed is that on Apify the scraper makes no POST requests, but that probably isn't enough to debug the root cause πŸ€”

Is there a way I can raise logging or something on Apify? How can I best approach this? How to debug this?
6 comments
H
V
Any idea why everything is logged twice? Is this a known issue of the Scrapy template, or is it a desired behavior?
Plain Text
...
[scrapy.core.engine] INFO  Spider closed (finished) ({"spider": "<Spider 'startupjobs' at 0x1072ab890>"})
[scrapy.core.engine] INFO  Spider closed (finished) ({"spider": "<Spider 'startupjobs' at 0x1072ab890>", "message": "Spider closed (finished)"})
[twisted] INFO  (TCP Port 6023 Closed)
[twisted] INFO  (TCP Port 6023 Closed) ({"message": "(TCP Port 6023 Closed)"})
[apify] INFO  Exiting actor ({"exit_code": 0})
[apify] INFO  Exiting actor ({"exit_code": 0, "message": "Exiting actor"})
9 comments
H
V
A
I'm not that far in my proof of concept and perhaps I ask about something which would be clear later in my progress, but one question arises in my head when trying to architect my future solution.

I think I'd like to have many actors in one repository so it's easy to manage and contribute to them. But is it able to connect Apify actor with such (monorepo?) architecture?

E.g. I could have a Python package with a few scrapers to scrape jobs, then another package with a few scrapers to scrape meetup.com events, etc. Having them separated by topic and by setup (schedulers, proxies, etc.). But I'd like to have all scrapers in one repo.

In theory, would it be possible to have one repo with many actors, have many actor projects created in Apify GUI, and then somehow specify that particular Actor is being ran certain way (command line parameter, environment variable, subfolder, you name it)? What would be an idiomatic way to solve this?
4 comments
H
A
L
H
Honza Javorek
Β·

Using Poetry

What would be a recommended way to use Poetry to install dependencies instead of pure pip? Anyone has this working in their Apify Dockerfile?
3 comments
H
V
A