Honza Javorek

Making my Docker image more efficient

From since I added browser to my stack, the builds take 2 and more minutes. I'm trying to make the builds more efficient, but I'm no expert in setting up the image, so I'd appreciate any help. This is what I do right now:

Plain Text

FROM apify/actor-python:3.12
ARG ACTOR_PATH_IN_DOCKER_CONTEXT

RUN rm -rf /usr/src/app/*
WORKDIR /usr/src/app

COPY . ./

RUN echo "Python version:" \
 && python --version \
 && echo "Pip version:" \
 && pip --version \
 && echo "Installing Poetry:" \
 && pip install --no-cache-dir poetry~=1.7.1 \
 && echo "Installing dependencies:" \
 && poetry config cache-dir /tmp/.poetry-cache \
 && poetry config virtualenvs.in-project true \
 && poetry install --only=main --no-interaction --no-ansi \
 && rm -rf /tmp/.poetry-cache \
 && echo "All installed Python packages:" \
 && pip freeze \
 && echo "Installing Playwright dependencies:" \
 && poetry run playwright install chromium --with-deps

RUN python3 -m compileall -q ./jg/plucker

ENV ACTOR_PATH_IN_DOCKER_CONTEXT="${ACTOR_PATH_IN_DOCKER_CONTEXT}"
CMD ["poetry", "run", "plucker", "--debug", "crawl", "--apify"]

Is there a way to make it faster?

3 comments

HHonza Javorek

Getting no proxies on input - expected or a bug?

I'm trying to integrate my Scrapy actor with Playwright, so I attempted to figure out what is the actual format of the proxy input from Apify, so that I could somehow pass it over to Playwright.

I printed out what my spider gets and this is what it prints:

Plain Text

APIFY_PROXY_SETTINGS: {'apifyProxyGroups': [], 'useApifyProxy': True}

Empty list. Is that expected or a bug? Does it mean that my scraper runs without proxies despite the fact I have the "Datacenter" option turned on? I'm really confused now.

The spider has some problems with getting blocked. If I thought I'm using proxies, but in fact there are none, then it's no surprise I'm getting blocked.

3 comments

HHonza Javorek

We have disabled the system which sent notifications about finished runs

I've got some scrapers failing, but I learned only because my production had no data. Why? Because I've relied on emails from Apify, and they didn't come. Why? Because you've disabled them without telling me 😀

C'mon. I'm fine about disabling email feature you want to revamp, but then please email me about the change, because I don't spend my life on the notifiactions tab to learn about the change from a little notification box 🤦‍♂️

I don't consider the notifications perfect, but they are the only global way to get notified about the fact that actors failed to finish. Alternative is to setup alert, but then I have to manually click on all my ten actors to set it up (prone to errors), and, more importantly there is no way to setup alert for a failed actor. In the metric drop down, there is no exit status is 0 option or equlvalent, as far as I can see.

Hence, after your change, I'm left with no monitoring over my actors unless I immediately go and fix up some automated code checking the actors over your API, which kind of defeats the reasons why I recently migrated my scrapers to Apify - to get these things out of the box, as part of your platform.

2 comments

HHonza Javorek

Subscribing to changelog using RSS?

I found https://apify.com/change-log and I'd like to subscribe using RSS. Is that possible?

Reverse-engineersing I see the page makes requests to https://cms.apify.com/api/change-log-items?pagination[limit]=-1&populate=deep, but I'm not sure what cms.apify.com is and whether it's able to give me a good old RSS or Atom feed.

1 comment

HHonza Javorek

Suggestion: Notify me only when my actor fails

There are these two (see attachment) settings which I realized I first didn't understand correctly. I thought I can uncheck report about all my actor runs, and get only failed actor runs by leaving the other checkbox checked. But it seems that "Actor Issues" is something completely different, probably related to the actor marketplace.

To get notified about failed actors, I either have to setup monitoring on all of them manually (e.g. alert every time there is 0 items in the dataset), or I have to check the top checkbox and get notified about all runs, even the successful ones (which is noisy if I have daily schedule).

So I suggest there could be a checkbox which notifies me only when the actors fail, while not sending anything when the actors are successful.

4 comments

HHonza Javorek

Broken links

FYI, these links on https://docs.apify.com/platform/storage/dataset are broken (see image).

2 comments

HHonza Javorek

Auto builds don't work at all, getting HTTP 500 errors from Apify both in UI and webhooks

When I click on the radio button Automatic builds, I get an error. Different browser doesn't help. The browser console says:

Plain Text

[31mERROR[39m Failed to handle request 'POST - /github-app/setup-webhook/...'[90m {"request":{"method":"post","url":"https://console-backend.apify.com/github-app/setup-webhook/...","headers":{"x-idempotency-key":"...","Authorization":"Bearer ..."},"params":{},"data":{"version":"0.0","enabled":true}},"response":{"status":"error","statusCode":500,"isClientSafe":true,"errorCode":"internal-server-error","errorMessage":"Cannot destructure property 'token' of '(intermediate value)' as it is null.","path":"/github-app/setup-webhook/..."}}[39m
  [object Object]

18 comments

HHonza Javorek

Scrapy integration silently throws away redirects

Cheers, I just noticed that my custom scraper makes quite different number of requests locally and through Apify, while the code, URLs, parameters, everything is the same. The same Scrapy spider produces 720 items locally, but 370 through Apify

Anyone has any clue what could be the root cause, where to look? Just from the logs I can't see anything. The only clue I noticed is that on Apify the scraper makes no POST requests, but that probably isn't enough to debug the root cause 🤔

Is there a way I can raise logging or something on Apify? How can I best approach this? How to debug this?

6 comments

HHonza Javorek

Everything logged twice?

Any idea why everything is logged twice? Is this a known issue of the Scrapy template, or is it a desired behavior?

Plain Text

...
[scrapy.core.engine] INFO  Spider closed (finished) ({"spider": "<Spider 'startupjobs' at 0x1072ab890>"})
[scrapy.core.engine] INFO  Spider closed (finished) ({"spider": "<Spider 'startupjobs' at 0x1072ab890>", "message": "Spider closed (finished)"})
[twisted] INFO  (TCP Port 6023 Closed)
[twisted] INFO  (TCP Port 6023 Closed) ({"message": "(TCP Port 6023 Closed)"})
[apify] INFO  Exiting actor ({"exit_code": 0})
[apify] INFO  Exiting actor ({"exit_code": 0, "message": "Exiting actor"})

9 comments

HHonza Javorek

More actors in one repository

I'm not that far in my proof of concept and perhaps I ask about something which would be clear later in my progress, but one question arises in my head when trying to architect my future solution.

I think I'd like to have many actors in one repository so it's easy to manage and contribute to them. But is it able to connect Apify actor with such (monorepo?) architecture?

E.g. I could have a Python package with a few scrapers to scrape jobs, then another package with a few scrapers to scrape meetup.com events, etc. Having them separated by topic and by setup (schedulers, proxies, etc.). But I'd like to have all scrapers in one repo.

In theory, would it be possible to have one repo with many actors, have many actor projects created in Apify GUI, and then somehow specify that particular Actor is being ran certain way (command line parameter, environment variable, subfolder, you name it)? What would be an idiomatic way to solve this?

4 comments

HHonza Javorek

Using Poetry

What would be a recommended way to use Poetry to install dependencies instead of pure pip? Anyone has this working in their Apify Dockerfile?

3 comments

Apify Discord Mirror

Making my Docker image more efficient

Getting no proxies on input - expected or a bug?

We have disabled the system which sent notifications about finished runs

Subscribing to changelog using RSS?

Suggestion: Notify me only when my actor fails

Broken links

Auto builds don't work at all, getting HTTP 500 errors from Apify both in UI and webhooks

Scrapy integration silently throws away redirects

Everything logged twice?

More actors in one repository

Using Poetry