Apify and Crawlee Official Forum

Hi, I've seen mentions of a "pay per event" pricing model https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event & https://apify.com/mhamas/pay-per-event-example, but can't find how to use it for one of my actor, i only see rental or pay per result options.
How can we use this pay per event pricing model?
8 comments
M
A
S
D
J
I’m working on a project using PlaywrightCrawler to scrape links from a dynamic JavaScript-rendered website. The challenge is that the <a> tags don’t have href attributes, so I need to click on them and capture the resulting URLs.

  • Delayed Link Rendering: Links are dynamically rendered with JavaScript, often taking time due to a loader. How can I ensure all links are loaded before clicking?
  • Navigation Issues: Some links don’t navigate as expected or fail when trying to open in a new context.
  • Memory Overload: I get the warning "Memory is critically overloaded" during crawls
I've attached images of my code (it was too long so I couldn't paste it)

How can I handle these issues more efficiently, especially for dynamic and JavaScript-heavy sites?
I would appreciate any help
2 comments
A
b
Hello, I would like to ask if any Apify tool can, for example, find a similar image - https://i.postimg.cc/KzRHFKQc/55.jpg and extract the product name from the links to CSV. We can use Google Lens? I want to use this to automatically name antique products.

Thanks for the all informations and help! 👋
1 comment
A
Hi,
I need help with finding an actor or setting up the settings of the website content crawler to extract all the URLs from a site but not the content from the URL, I want to filter the URLs by keywords to find the one Im looking for, but dont need the content from the URL

Thanks for your help

I have created a scraper but am having issues posting it to the store. I opened my account 2 days ago and would like to start earning money on my scraper

When I build actor and run it, I get the following error:
2025-01-10T18:47:43.475Z Traceback (most recent call last):
2025-01-10T18:47:43.476Z File "<frozen runpy>", line 198, in _run_module_as_main
2025-01-10T18:47:43.477Z File "<frozen runpy>", line 88, in _run_code
2025-01-10T18:47:43.478Z File "/usr/src/app/src/main.py", line 3, in <module>
2025-01-10T18:47:43.479Z from .main import main
2025-01-10T18:47:43.479Z File "/usr/src/app/src/main.py", line 9, in <module>
2025-01-10T18:47:43.480Z from apify import Actor
2025-01-10T18:47:43.481Z File "/usr/local/lib/python3.12/site-packages/apify/init.py", line 7, in <module>
2025-01-10T18:47:43.482Z from apify._actor import Actor
2025-01-10T18:47:43.483Z File "/usr/local/lib/python3.12/site-packages/apify/_actor.py", line 16, in <module>
2025-01-10T18:47:43.483Z from crawlee import service_container
2025-01-10T18:47:43.484Z ImportError: cannot import name 'service_container' from 'crawlee' (/usr/local/lib/python3.12/site-packages/crawlee/init.py)

I did not change anything in my docker file:
FROM apify/actor-python:3.12
COPY requirements.txt ./
...


In requirements.txt I install the following module:
apify ~= 2.0.0

Anyone else facing the same issue ?
3 comments
M
H
A
I want to use the AdaptivePlaywrightCrawler, but it seems like it wants to crawl the entire web.
Here is my code.

const crawler = new AdaptivePlaywrightCrawler({ renderingTypeDetectionRatio: 0.1, maxRequestsPerCrawl: 50, async requestHandler({ request, enqueueLinks, parseWithCheerio, querySelector, log, urls }) { console.log(request.url, request.uniqueKey); await enqueueLinks(); } }); crawler.run(['https://crawlee.dev']);
3 comments
N
E
Hi everyone! I am creating a crawler using crawlee for python. I noticed the Parsel crawler makes the requests in much higher frequency than the Beautiful soup crawler. Is there a way to make the Parsel crawler slower, so we avoid getting blocked better? Thanks!
5 comments
E
R
M
I'm attempting to validate that the proxy works and am not having luck, should I expect the following to work?

Plain Text
~ λ curl --proxy http://proxy.apify.com:8000  -U 'groups-RESIDENTIAL,country-US:apify_proxy_redacted' -H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"  https://httpbin.org/ip
curl: (56) CONNECT tunnel failed, response 403
3 comments
a
-
my account: https://apify.com/wudizhangzhi
actors: https://console.apify.com/actors/KAkfFaz8JVdvOQQ5F/source

Error: Operation failed! (You currently don’t have the necessary permissions to publish an Actor. This is expected behavior. Please contact support for assistance in resolving the issue.)

@Saurav Jain
2 comments
P
y
Are there actually any ressources on building a scraper with crawlee except the one in the docs?
Where do I set all the browser context for example?

Plain Text
const launchPlaywright = async () => {
  const browser = await playwright["chromium"].launch({
    headless: true,
    args: ["--disable-blink-features=AutomationControlled"],
  });

  const context = await browser.newContext({
    viewport: { width: 1280, height: 720 },
    userAgent:
      "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
    geolocation: { longitude: 7.8421, latitude: 47.9978 },
    permissions: ["geolocation"],
    locale: "en-US",
    storageState: "playwright/auth/user.json",
  });
  return await context.newPage();
};
2 comments
a
Guys im new to apify and i want to publish my newly built job scraper but when i setup monetization there are two business id and personal id option, where can i get this?
1 comment
S
Hi everyone,
I recently ran a Google Maps scraper (https://apify.com/compass/crawler-google-places) to collect place data, and I've discovered that there are many more places available than what was initially collected in my first run.
Current Situation:
  • Successfully completed an initial scrape
  • Have collected data for X places
  • Discovered there are significantly more places available
  • Already have a dataset from the first run
Questions:
Is it possible to increase the place limit on my existing run configuration?
If I need to create a new run, what's the best way to:
  • Import/merge my existing scraped data
  • Avoid duplicating places already collected
  • Continue from where the previous run stopped
Any guidance on the most efficient approach would be greatly appreciated.
Thanks in advance!
4 comments
P
C
S
Hello, I'm seeing (https://playwright.dev/python/docs/library#incompatible-with-selectoreventloop-of-asyncio-on-windows) that there is an incompatibility between Playwright and Windows SelectorEventLoop -- which Crawlee seems to require? Can you confirm whether it is possible to use a PlaywrightCrawlingContext in a Windows environment? I'm running into am asyncio NotImplementedError when trying to run the crawler, which suggests to me that there might be an issue. Thanks for the help.
1 comment
M
How can I export the data acquired from the Instagram Profile Scraper (Apify) on autopilot back into google sheets everytime an actor runs whilst only importing specific data (such as Full Name, Biography). I just started my journy in coding and its incredible! thank you so much in advance
1 comment
P
What is the best way to compile all the datasets into a single dataset from multiple agents running and their individual tasks, each tasks has its own set of runs producing multiple datasets.

Found this to be very confusing, zapier etc don't really process data like its needed - needs additional transformation. Thought the platform would have an option as there is an option under storage to create your own dataset but found interesting that there is not way to internally link any existing datasets to it ... possible to explain and advice? Thanks
8 comments
D
p
m
I am building the linkedin email scraper actor and having some issue requesting if anyone to help me on these: Scraped data: {
name: 'Join LinkedIn',
title: 'Not found',
email: 'Not found',
location: 'Not found'
}
INFO PuppeteerCrawler: All requests from the queue have been processed, the crawler will shut down.
3 comments
P
H
I need to scrape content on multiple pages in one social network (x.com) that requires auth. Where to implement the login mechanism in order to it happened before following urls and persisted to use it until it is valid?
1 comment
!
Hello 👋

Apify already takes 20% of the revenue earned, so it doesn’t make much sense for a developer to bear 100% of the bank transfer fees. These fees should be split between Apify and the developer (SHA/BEN/OUR) if not entirely by Apify, given they already take a 20% cut.

FYI: This is how several other platforms operate to incentive developers.

For example, if someone earns $125, Apify takes 20%, leaving $100. After bank transfer fees (around 40-50) , the developer ends up with only about $55 (before local taxes, so you can imagine..)

This setup is definitely not encouraging for developers.
12 comments
p
M
a
S
В
When running apify create and installing a template, got the following error:
Error: connect ETIMEDOUT 20.205.243.166:443
6 comments
P
s
Hey everyone. :perfecto: :crawlee:
Currently, I am working on scraping one website where new content (pages) is added frequently (as an example we can say like a blog). So when I run my scraper it scrapes all pages successfully but when I run it for example tomorrow (when new pages are added to websites) it will start scraping everything again.

I would be thankful if you could give me some advice, ideas, solutions, or examples out there of efficiently re-scraping without crawling the entire site again.

Thank you in advance. 🙏🏻
3 comments
m
a
Is there a way to ensure that successive requests are made using the same session (with the same cookies, etc.) in the Python API? I am scraping a very fussy site that seems to have strict session continuity requirements so I need to ensure that for main page A, all requests to sub pages linked from there, A-1, A-2, A-3, etc. (as well as A-1-1, A-1-2, etc.,) are made within the same session as the original request.

Thanks as always.
17 comments
M
u
A
D
Hi there. Whenever I try to use residential proxies ('HTTP://groups-RESIDENTIAL:/...') I run into this error:

httpx.ConnectError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1129)

The 'auto' group seems to work fine. Can anyone tell me what I'm doing wrong here?

Thanks!
7 comments
u
M
H
Hani
·

instagram

hi how to connect APIFY with Firebase and Instagram to extract events info from Instagram photo posts with OCR and publish them automatically into my Website in firwbase?
1 comment
M
I'm trying to run Crawlee for production use and try to scale where we can have a cluster of worker nodes who will be ready for crawling pages based on the request. How can achieve this.

The RequestQueue is basically writing requests to files and not utilizing any queueing system. I couldn't find doc that said how i can utilise Redis queue or something similar.
7 comments
M
d
A