Apify and Crawlee Official Forum

Where can we see actor reviews ?
I get the following error: The session has been lost.
1 comment
A
Hi I have a (noob) question. So I want to crawl many different urls from different pages so they need their own crawler implementation. Some can use the same also. How can I achieve this in crawlee such that they run in parallel and can be lal executed with a single command or also in isolation?

Input and example repos etc. would be highly appreciated
1 comment
S
just like below code in crawlee-js:
launchContext: {
// Native Puppeteer options
launchOptions: {
args: ['--disable-features=TrackingProtection3pcd', '--disable-web-security', '--no-sandbox'],
}
},
Hi, I'm trying to plan out some automations using quite a few scrapers. If I have 10 different actors I want to run on the first of every month and I want to scrape the entire previous month, and have this as a recurring task, how would I do it? My biggest question is about configuring the dates to be relative. The rest I think I can likely figure out using integrations with Make. TIA!
2 comments
a
A
I asked the AI bot to transfer me to a human, and it's been a week, and no human has responded to me yet.

Since the AI is somewhat useful, I want to return to chatting with the AI, but I cannot create new chats, end chats, or return to the AI bot. I'm literally stuck here in the support chat.
2 comments
A
T
Hello πŸ‘‹

I reached out to Apify on Linked about something, they instructed me to contact support via email. I did but I didn't hear back from them (2 days now).

Here's the email I used to contact support aziz50607080@gmail.com

Thank you : πŸ™
I keep getting this error message
"The Actor hit an OOM (out of memory) condition. You can resurrect it with more memory to continue where you left off."

It keeps resurrecting from failed status and then running into the same issue
however out of 32 Gb memory it only uses 1-2 GB before the error is brought up

failing no matter what I try

Any help would be awesome.


I have a list of industries of businesses that I need to scrape from Google Maps based in Ontario. One thing to note is I'm trying to tackle small/medium sized businesses only, and NO franchises businesses. I'm new to Apify so I'm not sure if there's a way to filter out franchise businesses from a scrape
I want to adapt Crawlee's log format. From my research, it seems I need to use the CrawleeLogFormatter API. However, I couldn't find any usage examples for this API. Could you explain how to use it?
Hi All,

I have a pre navigation hook that listens for requests and if they return images saves them to the cloud

Plain Text
  return async (context) => {
    if (context.request.label == requestLabels.article) {
      context.page.on('request', async (req) => {
        if (req.resourceType() == 'image') {
          const response = await req.response();
          // extra processing and save to cloud


This works in 95% of cases however there are some where the main request hander completes before each res.response() can be collected. This causes an error since the browser context closes with the main request handler. My question is how can i best get around this? One idea I had was that I could put a promise in the page userdata that resolves when the page has no outstanding images. After reading the docs however im not sure if this is possible since userdata needs to be serialisable? Has anyone else encountered this type of issue and how have they got around it?
2 comments
P
C
Hello,
We have a developer account running for more than a years with several actors / clients. Since Monday, we can’t access any page listing payout / statistics about our actors usage. Is it normal ? What can we do ? Thanks
7 comments
v
!
_
A
P
I'm looking at the generative-bayesian-network package part of the fingerprint suite. https://www.npmjs.com/package/generative-bayesian-network
However, I cant find any kind of documentation whatsoever on this package. It looks interesting and I want to figure out how to use it. Are there docs anywhere for this?
Hello Apify Support Team,

I hope this message finds you well. I would like to inquire if it's possible to use Apify to scrape data from the following page: https://fr.iherb.com/specials. Specifically, my objective is to:

Scrape all listed products on the specials page.
Click on each product link.
Extract detailed information from each individual product page.
Could you please let me know if this is feasible with Apify? Additionally, any guidance on how to set up this specific flow, or if there are existing actors suitable for this task, would be highly appreciated.

Thank you in advance for your assistance.
1 comment
m
hi, I'm using the instagram scraper for the first time, and I'm confused as to what childPosts are. I understand that in a carousel post there are multiple media files and those are children, and can have their own captions, but what do childPosts/2/likesCount, childPosts/1/firstComment, childPosts/2/ownerId refer to? They would have the same #likes, #comments and owners as the parent right?
1 comment
P
Does crawlee support sock5 proxies with authentication?

I am building a crawler based in crawlee with playwright.
And it's need to use sock5 proxies with authentication.
But I don't find the anything about that in the crawlee document .

The playwright is support sock5 proxies with authentication. But I don't know how to use it in the cralwee.
https://playwright.dev/docs/api/class-browser

Somebody could tell me? Thanks
1 comment
h
Hi,

when I use these URL https://api.apify.com/v2/actor-tasks/<TASK_ID>/runs?token=<YOUR_API_TOKEN> with the credentials from my apify dashboard I get the following error:
{
"error": {
"type": "page-not-found",
"message": "We have bad news: there is no API endpoint at this URL. Did you specify it correctly?"
}
}
I did everything like described in https://docs.apify.com/api/v2#section/Authentication

Can you help me and tell me what am I doing wrong? #apify-platform
1 comment
P
Hi people, I am having this problem with Docker, it runs reursively and fails, it is on Platform. I can't find an error and every single file of the project seems to be ok. Any idea?

  • Pulling Docker image of build XXXXX frpm repository
  • Creating Docker container
  • Starting Docker container
  • Pulling Docker image of build XXXXX frpm repository
  • Creating Docker container
  • Starting Docker container
  • Pulling Docker image of build XXXXX frpm repository
  • Creating Docker container
  • Starting Docker container
  • ERROR: We've encountered an unexpected system error. If the issue persists, please contact support.
13 comments
P
J
A
Hi guys! I'm new here, so my question may be quite stupid.Β 

Quick context - I need to modify an "Instagram Scraper" actor to use my specific Instagram accounts for parsing. I've heard that it's possible to duplicate actors on Apify to change the code.

Please tell me if it's possible to duplicate actors on Apify (or somehow get access to a source code of one to modify) and why this "Duplicate" button is blocked.

Thanks
1 comment
H
For some reason, when I push data to user_data (str type in this case), when I get user_data variables in another handler I get different values.
In this case the error is on tab_number. When I push tab_number to user_data the values seems to be good (values ranged from 1 to 100). But when I get tab_number through tab_handler I get a different value.
For example, for values from 1 to 19, I get tab_number 1, instead of the correct tab_number: tab_number pushed to user_data: "19", tab_number requested from user_data: "1".
I cannot find the error. Here is the code:

@router.handler('tabs')
async def tabs_handler(context: PlaywrightCrawlingContext) -> None:

tab_id = context.request.user_data["tab_id"]

await context.page.wait_for_selector('#tabsver > ul > li > a')

tabs = await context.page.locator('#tabsver > ul > li > a').all()

for tab in tabs:
tab_name = await tab.text_content()
tab_number = tab_name.replace("Tab number ", "").strip()
if tab_name:
await context.enqueue_links(
selector = f'#tabsver > ul > li > a:has-text("{tab_name}")',
label = "tab",
user_data = {"tab_id": tab_id,
"tab_number": tab_number
},
)

@router.handler('tab')
async def tab_handler(context: PlaywrightCrawlingContext) -> None:

tab_id = context.request.user_data["tab_id"]
tab_number = context.request.user_data["tab_number"]
Hi, I'm using the HttpCrawler to scrape a static list of URLs. However, when I do get a 403 response as a result of CloudFlare challenge, the request is not retried with retryOnBlocked: true. However, if I remove retryOnBlocked, I see my errorHandler getting invoked and the request is retried. Do I understand retryOnBlocked wrong?
5 comments
P
C
t
I migrated my scraping code from Crawlee to Hero (see https://github.com/ulixee/hero). It works. Everything that worked with Crawlee - works with Hero.

Why I migrated: can not handle the over-engineered Crawlee API more (and bugs related to this).
It was just too much APIs (different APIs!) for my simple case.
Hero has about 5 times simpler API.

In both cases (Crawlee and Hero) I am using only scraping library, no additional (cloud) services, no docker containers.

I am not manipulating DOM, not doing retries, not doing any complex things in Typescript. I am just accessing the URL (in some cases the URL1 and after this the URL2 to pretentd I'm normal user), grab the rendered HTML and that's it. All the HTML manipulations (extracting the data from the HTML) done in completely different program (written in different programming language, not in Typescript).
Re-try logic -> again, this is implemented in that different program.

I use beanstalkd (see https://github.com/beanstalkd/beanstalkd/) message queue between that "different program" and the scraper. So I just replaced the Crawlee-based-scraper with Hero-based-scraper without touching other parts of the system. Usage of beanstalkd was already discussed in this forum: use search to find these discussions.

Goodbye Crawlee.
my crawler with PlaywrightCrawler works just fine but I have issue when adding proxy !!!
this is the code

Plain Text
import { PlaywrightCrawler, ProxyConfiguration } from "crawlee";

const startUrls = ['http://quotes.toscrape.com/js/'];

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, parseWithCheerio }) => {
        await page.waitForSelector("div.quote span.text", { "timeout": 60000 });
        const $ = await parseWithCheerio()

        const quotes = $("div.quote span.text")
        quotes.each((_, element) => { console.log($(element).text()) });
    },
});

await crawler.run(startUrls);


however when I add my proxy port I always get timeout erros !!!

Plain Text
const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: ["url-to-proxy-port-im-using"]
})

// and the add it to crawler
const crawler = new PlaywrightCrawler({
  proxyConfiguration,
  ...


and also the same code with the proxy configuration works with CheerioCrawler !!!!
can anyone help with this issue !?
5 comments
a
P
On integration editor menu : "Edit Integration" point to wrong URL (non existed actor)
6 comments
S
!
When there is an error in the response-handler, then it does not show an error, but it will fail unnoticed, and Crawlee will retry 9 more times.

To illustrate the following syntax-error:
element = await context.page.query_selector('div[id="main"')

It should be with a closing bracket after "main":
element = await context.page.query_selector('div[id="main"]')

However, instead of failing, Crawlee will proceed and keep trying to query the page and it will not show any message on where it failed.
Is there any way to troubleshoot these kind of issues?
Good evening! I'm having problems with my new account. I signed up using my Google account, and Apify simply doesn't work. This error occurs for everything I try to do. Has anyone else experienced this?
12 comments
S
A
A