Apify

Apify and Crawlee Official Forum

b
F
A
J
A
I am building a scraper for an Australian website but apify uses US ips so all the requests are getting blocked, does apify support using an Australian ip? Or do I have to supply my own?
From since I added browser to my stack, the builds take 2 and more minutes. I'm trying to make the builds more efficient, but I'm no expert in setting up the image, so I'd appreciate any help. This is what I do right now:
Plain Text
FROM apify/actor-python:3.12
ARG ACTOR_PATH_IN_DOCKER_CONTEXT

RUN rm -rf /usr/src/app/*
WORKDIR /usr/src/app

COPY . ./

RUN echo "Python version:" \
 && python --version \
 && echo "Pip version:" \
 && pip --version \
 && echo "Installing Poetry:" \
 && pip install --no-cache-dir poetry~=1.7.1 \
 && echo "Installing dependencies:" \
 && poetry config cache-dir /tmp/.poetry-cache \
 && poetry config virtualenvs.in-project true \
 && poetry install --only=main --no-interaction --no-ansi \
 && rm -rf /tmp/.poetry-cache \
 && echo "All installed Python packages:" \
 && pip freeze \
 && echo "Installing Playwright dependencies:" \
 && poetry run playwright install chromium --with-deps

RUN python3 -m compileall -q ./jg/plucker

ENV ACTOR_PATH_IN_DOCKER_CONTEXT="${ACTOR_PATH_IN_DOCKER_CONTEXT}"
CMD ["poetry", "run", "plucker", "--debug", "crawl", "--apify"]

Is there a way to make it faster?
I'm trying to by some Add RAM but could not find any option. How Can I buy it?
3 comments
D
M
Plain Text
2024-10-12T21:25:30.211Z ACTOR: Pulling Docker image of build IDpW06CSNrW8Wjb9L from repository.
2024-10-12T21:25:30.433Z ACTOR: Starting Docker container.
2024-10-12T21:25:31.942Z   File "/usr/src/app/src/main.py", line 76, in main
2024-10-12T21:25:31.944Z     site = web.TCPSite(runner, '0.0.0.0', Actor.config.standby_port)
2024-10-12T21:25:31.945Z                                           ^^^^^^^^^^^^
2024-10-12T21:25:31.947Z   File "/usr/local/lib/python3.12/site-packages/apify/_actor.py", line 65, in __init__
2024-10-12T21:25:31.949Z     self._configuration = configuration or Configuration.get_global_configuration()
2024-10-12T21:25:31.950Z                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-10-12T21:25:31.952Z   File "/usr/local/lib/python3.12/site-packages/crawlee/configuration.py", line 216, in get_global_configuration
2024-10-12T21:25:31.954Z     service_container.set_configuration(cls())
2024-10-12T21:25:31.955Z                                         ^^^^^
2024-10-12T21:25:31.957Z   File "/usr/local/lib/python3.12/site-packages/pydantic_settings/main.py", line 152, in __init__
2024-10-12T21:25:31.958Z     super().__init__(
2024-10-12T21:25:31.960Z   File "/usr/local/lib/python3.12/site-packages/pydantic/main.py", line 209, in __init__
2024-10-12T21:25:31.962Z     validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
2024-10-12T21:25:31.963Z                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-10-12T21:25:31.965Z pydantic_core._pydantic_core.ValidationError: 1 validation error for Configuration
2024-10-12T21:25:31.967Z actor_timeout_at
2024-10-12T21:25:31.968Z   Input should be a valid datetime or date, input is too short [type=datetime_from_date_parsing, input_value='', input_type=str]
2024-10-12T21:25:31.970Z     For further information visit https://errors.pydantic.dev/2.9/v/datetime_from_date_parsing
2 comments
o
c
I use to be able to put image on README complete with width and height

<img src="https://example.com/image.jpg" width="200" height="300">

but now it seem the width and height is ignored , which makes my README uglier than ever before.
2 comments
o
!
Hello guys I have a question. Is it worth it to be apify actor developer I mean is it profitable ? I developed 2 actors with 100 users but still not much at all .
3 comments
A
S
Hi guys! I am wondering if I am able to put in multiple username fields into the input variable in this Actor or do I need to loop through to generate outputs for multiple usernames?
Plain Text
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with API token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "username": [
        "zelenskyy_official"
    ],
    "resultsLimit": 30
};

(async () => {
    // Run the Actor and wait for it to finish
    const run = await client.actor("xMc5Ga1oCONPmWJIa").call(input);

    // Fetch and print Actor results from the run's dataset (if any)
    console.log('Results from dataset');
    const { items } = await client.dataset(run.defaultDatasetId).listItems();
    items.forEach((item) => {
        console.dir(item);
    });
})();
1 comment
L
when I call the start method in actor.py of the apify_client, the HTTP request method is 'POST', but it throws an error 'This API end-point can only be accessed using the following HTTP methods: OPTIONS, GET, PUT, DELETE'. Has anyone encountered this issue? I would like to know how to resolve it.
1 comment
L
As far as I can see, when a user manually stops the actor from running, this decreases the success rate. I think the platform recognizes this as a fail. Do manual stops by the user affect the success rate? Is there any way to prevent this or will there be a solution for this?
1 comment
M
As title said

I’m using chromium currently but it is cpu heavy in usage

Killing browser do not kill the process and because of that it’s easy to get 100% cpu usage pretty quickly

(I’m crawling thousands of websites where on each I’m looking for different data) I already try to load pure html without css, images and other assets, that helped a lot but issue is still there
2 comments
h
L
My Actor saves data using Prisma client. But when I run the actor, Crawler says is needed to run prisma generate, although I generated after the build.

Any tips on how to solve it? The files are all in /myuser/ folder.

my package.json: https://pastebin.com/KqMYk7Ae
Apify build log: https://pastebin.com/71LrxCWN
Apify run log: https://pastebin.com/fg2dUW0C
2 comments
d
o
I introduced a new version of my actor, but how do I make it the latest version? I assume that the README is also taken from the latest version?
I’m not sure if anyone else has experienced this or if it's just me.

When I don’t use an Actor for a long time, about a month or so, I sometimes notice that when I revisit the Actor console, the input fields are pre-filled by someone else, and I don’t know who. This is my own Actor.

These inputs may contain sensitive data like passwords, as shown in the screenshot.

I can't reproduce this issue, but my guess is that when my saved inputs 'expire,' they are automatically replaced by someone else’s inputs.

I hope someone from @Apifyteam can look into this.

Thank You!
7 comments
S
!
A
The table Headers blocking the dropdown.
2 comments
A
I
Can anybody advise how to set DEFAULT memory when creating actor, because when I set "minMemoryMbytes": 128,
"maxMemoryMbytes": 256, Actor always complains that prefilled 1GB is too much. I would like know how to change those default 1GB in Run options to something lower. Tried to set "memoryMbytes": 128, but that doesn't work. Thanks
I am unclear about the advantages of storing scraping results to Apify's native platform storage options. Why would I do that instead of just posting each result to my MogoDB collection as it comes in?
1 comment
!
I have created 5 Actor codebases. Each stored in their own private repo under my company's organization on GitHub.

I have successfully instantiated 4 Actors from these in my organization account on the Apify platform. They run and work fine. 💪

However the 5th does not show up when I got to "Create New Actor from GitHub". The others do. My company account on Apify is the $49/month Starter package.

Whats confusing is that when I go to my (free) Personal Apify account, all 5 GitHub repos show up as I would expect. I am able to create all 5 Actors and run them.

I have tried many things, but I can't figure out how to get it to work under my Organization account.

If anyone has any idea on what to try, I'd be super grateful. I have no idea how to get this fixed!
How to easily scrape email addresses present in an input list of websites? I cannot find this function which is probably a well sought one so easily! And how to limit the output to only email addresses found, per domain
Can actors handle rust code?
Hi, I'm new to Apify. Apologies if this question has been answered before or if I'm posting in the wrong channel.

I'm running a task actor via API and retrieving the results once I receive a webhook indicating the run has completed. I want to pass additional parameters that are not part of the input and then retrieve them along with the results.

Is there a way to achieve this? If so, how?
H team 👋 I'd like to know how is the success % which is displayed on the actor's page calculated and how often it gets updated
2 comments
a
A
I've developed an actor which I would like to publish as it is finished. However, in order to scrape sufficient data, proxies would be necessary. How does this work and who pays for the proxies when the actor is published? I'm currently doing local development on a free account.
1 comment
R