Apify and Crawlee Official Forum

Hi guys! I'm new here, so my question may be quite stupid

Quick context - I need to modify an "Instagram Scraper" actor to use my specific Instagram accounts for parsing. I've heard that it's possible to duplicate actors on Apify to change the code.

Please tell me if it's possible to duplicate actors on Apify (or somehow get access to a source code of one to modify) and why this "Duplicate" button is blocked.

Thanks
1 comment
H
For some reason, when I push data to user_data (str type in this case), when I get user_data variables in another handler I get different values.
In this case the error is on tab_number. When I push tab_number to user_data the values seems to be good (values ranged from 1 to 100). But when I get tab_number through tab_handler I get a different value.
For example, for values from 1 to 19, I get tab_number 1, instead of the correct tab_number: tab_number pushed to user_data: "19", tab_number requested from user_data: "1".
I cannot find the error. Here is the code:

@router.handler('tabs')
async def tabs_handler(context: PlaywrightCrawlingContext) -> None:

tab_id = context.request.user_data["tab_id"]

await context.page.wait_for_selector('#tabsver > ul > li > a')

tabs = await context.page.locator('#tabsver > ul > li > a').all()

for tab in tabs:
tab_name = await tab.text_content()
tab_number = tab_name.replace("Tab number ", "").strip()
if tab_name:
await context.enqueue_links(
selector = f'#tabsver > ul > li > a:has-text("{tab_name}")',
label = "tab",
user_data = {"tab_id": tab_id,
"tab_number": tab_number
},
)

@router.handler('tab')
async def tab_handler(context: PlaywrightCrawlingContext) -> None:

tab_id = context.request.user_data["tab_id"]
tab_number = context.request.user_data["tab_number"]
HI all, in order to promote my Actor, I started a small google ads campaign. However, to track the viewers on the Actor page: https://apify.com/invideoiq/video-transcript-scraper I need to add a <script> html tag. But I don't think I can do that in Readme or the actor settings. Maybe in the description ? I appreciate your help
Hi, I'm using the HttpCrawler to scrape a static list of URLs. However, when I do get a 403 response as a result of CloudFlare challenge, the request is not retried with retryOnBlocked: true. However, if I remove retryOnBlocked, I see my errorHandler getting invoked and the request is retried. Do I understand retryOnBlocked wrong?
I migrated my scraping code from Crawlee to Hero (see https://github.com/ulixee/hero). It works. Everything that worked with Crawlee - works with Hero.

Why I migrated: can not handle the over-engineered Crawlee API more (and bugs related to this).
It was just too much APIs (different APIs!) for my simple case.
Hero has about 5 times simpler API.

In both cases (Crawlee and Hero) I am using only scraping library, no additional (cloud) services, no docker containers.

I am not manipulating DOM, not doing retries, not doing any complex things in Typescript. I am just accessing the URL (in some cases the URL1 and after this the URL2 to pretentd I'm normal user), grab the rendered HTML and that's it. All the HTML manipulations (extracting the data from the HTML) done in completely different program (written in different programming language, not in Typescript).
Re-try logic -> again, this is implemented in that different program.

I use beanstalkd (see https://github.com/beanstalkd/beanstalkd/) message queue between that "different program" and the scraper. So I just replaced the Crawlee-based-scraper with Hero-based-scraper without touching other parts of the system. Usage of beanstalkd was already discussed in this forum: use search to find these discussions.

Goodbye Crawlee.
my crawler with PlaywrightCrawler works just fine but I have issue when adding proxy !!!
this is the code

Plain Text
import { PlaywrightCrawler, ProxyConfiguration } from "crawlee";

const startUrls = ['http://quotes.toscrape.com/js/'];

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, parseWithCheerio }) => {
        await page.waitForSelector("div.quote span.text", { "timeout": 60000 });
        const $ = await parseWithCheerio()

        const quotes = $("div.quote span.text")
        quotes.each((_, element) => { console.log($(element).text()) });
    },
});

await crawler.run(startUrls);


however when I add my proxy port I always get timeout erros !!!

Plain Text
const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: ["url-to-proxy-port-im-using"]
})

// and the add it to crawler
const crawler = new PlaywrightCrawler({
  proxyConfiguration,
  ...


and also the same code with the proxy configuration works with CheerioCrawler !!!!
can anyone help with this issue !?
1 comment
a
On integration editor menu : "Edit Integration" point to wrong URL (non existed actor)
6 comments
S
!
When there is an error in the response-handler, then it does not show an error, but it will fail unnoticed, and Crawlee will retry 9 more times.

To illustrate the following syntax-error:
element = await context.page.query_selector('div[id="main"')

It should be with a closing bracket after "main":
element = await context.page.query_selector('div[id="main"]')

However, instead of failing, Crawlee will proceed and keep trying to query the page and it will not show any message on where it failed.
Is there any way to troubleshoot these kind of issues?
Good evening! I'm having problems with my new account. I signed up using my Google account, and Apify simply doesn't work. This error occurs for everything I try to do. Has anyone else experienced this?
8 comments
S
A
A
It refused with error. Actor ID: rzyXiN9Abtpy4tkXs
8 comments
S
!
I am trying to scrape an ecommerce site and would like to scrape only 20 items. How can I stop the process when this many items are scraped.
2 comments
S
I'm writing my first Actor using Crawlee and Playwright crawler to scrape website https://sreality.cz.

I wrote a crawler using as much as possible from the examples in the documentation. It works like this:

  1. Start on the first page of search, for example this one.
  2. Skip ad dialog, if it shows.
  3. Find all links to next pages and add them to the queue with enqueueLinks().
  4. Find all links to individual items (apartments, houses, whatever) and add them to the queue with enqueueLinks().
  5. If next page to process is an item page, scrape the data and save with pushData(). Otherwise, if it's another page, repeat from 3.
In theory, this is all I need to scrape the entire search result list. However what I experience is that it will enqueue all the links (around 185) but only process around 30 of them before finishing. Very strange.

I tried to set maxRequestsPerCrawl: 1000, didn't help.

Maybe I'm missing something but I don't see why it would just stop after around 30 pages. Is there another config somewhere that controls this?

Even more strange, it then logs the final statistic where it says something like "requestsFinished":119. A number that doesn't make sense at all. Less than the number of actually enqueued links but a lot more than the number of actuall processed pages.
4 comments
H
g
k
When using crawlee-js its working fine, but when using python 429 is not getting retried. Is there anything I am missing
I am using BeautifulSoupCrawler. Please help.
I've emailed Apify Support about my issue, but they haven't responded or provided any comments. To hello@apify.com, I sent.
1 comment
S
Hi there. I want to add a secret using the Apify CLI but got an error saying "Error: Nonexistent flag". Are there special characters that need to be escape? I'm using this command apify secrets:add aPrivateKey '-----BEGIN PRIVATE KEY-----_base64content_-----END PRIVATE KEY-----'. Thanks for the help.
3 comments
M
e
A
Hi All,

im running a playwright crawler and am running into a bit of an issue with crawler stability. Have a look at these two log messages


Plain Text
{
  "service": "AutoscaledPool",
  "time": "2024-10-30T16:42:17.049Z",
  "id": "cae4950d568a4b8bac375ffa5a40333c",
  "jobId": "9afee408-42bf-4194-b17c-9864db707e5c",
  "currentConcurrency": "4",
  "desiredConcurrency": "5",
  "systemStatus": "{\"isSystemIdle\":true,\"memInfo\":{\"isOverloaded\":false,\"limitRatio\":0.2,\"actualRatio\":0},\"eventLoopInfo\":{\"isOverloaded\":false,\"limitRatio\":0.6,\"actualRatio\":0},\"cpuInfo\":{\"isOverloaded\":false,\"limitRatio\":0.4,\"actualRatio\":0},\"clientInfo\":{\"isOverloaded\":false,\"limitRatio\":0.3,\"actualRatio\":0}}"
}


autoscaled pool is trying to increase its concurrency from 4 to 5 since it was in its view idle. 20 seconds later though

Plain Text
{
  "rejection": "true",
  "date": "Wed Oct 30 2024 16:42:38 GMT+0000 (Coordinated Universal Time)",
  "process": "{\"pid\":1,\"uid\":997,\"gid\":997,\"cwd\":\"/home/myuser\",\"execPath\":\"/usr/local/bin/node\",\"version\":\"v22.9.0\",\"argv\":[\"/usr/local/bin/node\",\"/home/myuser/FIDO-Scraper-Discovery\"],\"memoryUsage\":{\"rss\":337043456,\"heapTotal\":204886016,\"heapUsed\":168177928,\"external\":30148440,\"arrayBuffers\":14949780}}",
  "os": "{\"loadavg\":[3.08,3.38,3.68],\"uptime\":312222.44}",
  "stack": "response.headerValue: Target page, context or browser has been closed\n    at Page.<anonymous> (/home/myuser/FIDO-Scraper-Discovery/dist/articleImagesPreNavHook.js:15:60)"
}

which suggests memory was much tighter than autoscaledpool was considering, likley due to the additional ram that chromium was using. Crawlee was running in a k8 pod with a 4GB ram limit. Is this behaviour intended and how might i improve my performance? Does autoscaled pool account for how much ram is actually in use or just how much the node process uses?
3 comments
C
M
Hello, do you have some nice tips how to get rid out of 429? I am not exactly how parrarellism working here, but I am afraid, even I am putting in the sleep the process, the other parallel request as considered as request and they can lead to 429, is there any nice tip/best practice how can I defend against it ? 😄
1 comment
M
Flaw in tutorial on basic POST-functionality:
https://crawlee.dev/python/docs/examples/fill-and-submit-web-form

It makes an actual POST-request, but the data is not reaching the server, tried on various endpoints.

Two questions:
1) What is broken here and how to fix it?
2) My biggest concern using Crawlee is that I have no clue how to troubleshoot these kind of bugs.

Where can one check what goes wrong, for example how to check under the hood if CURL (?) or whatever library makes the actual request is populating the payload correctly, etc.
It has many benefits this framework, but due to all the abstractions, its very hard to troubleshoot. Probably my mistake and inexperience with the framework, but any guidance on how to troubleshoot would be great as simple things not working without anyway to troubleshoot makes using this Crawlee-framework quite cumbersome.


Plain Text
import asyncio
import json

from crawlee import Request
from crawlee.http_crawler import HttpCrawler, HttpCrawlingContext


async def main() -> None:
    crawler = HttpCrawler()

    # Define the default request handler, which will be called for every request.
    @crawler.router.default_handler
    async def request_handler(context: HttpCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')
        response = context.http_response.read().decode('utf-8')
        context.log.info(f'Response: {response}')  # To see the response in the logs.

    # Prepare a POST request to the form endpoint.
    request = Request.from_url(
        url='https://httpbin.org/post',
        method='POST',
        payload=json.dumps(
            {
                'custname': 'John Doe',
            }
        ).encode(),
    )

    # Run the crawler with the initial list of requests.
    await crawler.run([request])


if __name__ == '__main__':
    asyncio.run(main())
3 comments
M
f
The site I'm scraping uses fingerprint.com bot protection. Locally my code passes the protection 95% of the time, but when running the actor on Apify it never does. How is that possible?

To pass this protection I've implemented the following measures (complete code in next message), this was a bit of trial and error, so all feedback welcome:
  • Browser Configuration
    • Using Firefox instead of Chrome/Chromium
    • Using incognito pages (useIncognitoPages: true)
    • Enabled fingerprint randomization (useFingerprints: true)
  • Random Viewport/Screen Properties
    • Random window dimensions (1280-1920 x 720-1080)
    • Random device scale factor (1, 1.25, 1.5, or 2)
    • Random mobile/touch settings
    • Random color scheme (light/dark)
  • Locale and Timezone Randomization
    • Random locale from 8 different options
    • Random timezone from 8 different global locations
  • Browser Property Spoofing
    • Removing navigator.webdriver flag
    • Random navigator.plugins array
    • Random navigator.platform
    • Random navigator.hardwareConcurrency (4-16)
    • Random navigator.deviceMemory (2-16GB)
    • Random navigator.languages
    • Random navigator.maxTouchPoints
  • Chrome Detection Evasion
    • Removing Chrome DevTools Protocol (CDP) detection properties (cdcadoQpoasnfa76pfcZLmcfl*)
  • Performance Timing Randomization
    • Modifying performance.getEntries() to add random timing offsets
    • Randomizing both startTime and duration of performance entries
  • Proxy Usage
    • Using residential proxies (groups: ['residential'])
6 comments
L
M
After following the tutorial on scraping using crawlee, I cannot figure out how to add specific cookies (key-value pair) to the request. E.g., sid=1234

There is something like a session, and a session-pool, but how to reach these objects?

Then max_pool_size of session-pool has default size of 1000, should one then iterate through the sessions in the session-pool to set the session-id to the session.cookies (dict)?

Imagine the below from the tutorial, default handler handles the incoming request, it wants to enqueue requests to the category-pages. Lets say these category-pages require the sid-cookie to be set, how to achieve this?

Any help is very much appreciated, as no examples can be found via Google / ChatGPT / Perplexity.

Plain Text
@router.default_handler
async def default_handler(context: PlaywrightCrawlingContext) -> None:
    # This is a fallback route which will handle the start URL.
    context.log.info(f'default_handler is processing {context.request.url}')
    
    await context.page.wait_for_selector('.collection-block-item')

    await context.enqueue_links(
        selector='.collection-block-item',
        label='CATEGORY',
    )
17 comments
M
c
A
https://console.apify.com/view/runs/4rGZolVyfYVQesyvO
ArgumentError: Expected property maxRequestsPerCrawl to be of type number but received type string in object PlaywrightCrawlerOptions
Plain Text
maxRequestsPerCrawl: process.env.ACTOR_MAX_PAID_DATASET_ITEMS || input.limit,

Plain Text
typeof crawler.options.maxRequestsPerCrawl
'number'

???
2 comments
H
T
I am getting this error message, how to best deal with it?
Reclaiming failed request back to the list or queue. Redirected 10 times. Aborting.
Can I increase the max number of redirects for my CheerioCrawler?
The structure i am doing idoes not look like the best.

I am basically creating several routers and then doing something like:

Plain Text
const crawler = new PlaywrightCrawler({
  // proxyConfiguration: new ProxyConfiguration({ proxyUrls: ['...'] }),
  requestHandler: async (ctx) => {
    if (ctx.request.url.includes("url1")) {
      await url1Router(ctx);
    }

    if (ctx.request.url.includes("url2")) {
      await url2Router(ctx);
    }

    if (ctx.request.url.includes("url3")) {
      await url3Router(ctx);
    }
    await Dataset.exportToJSON("data.json");
  },

  // Comment this option to scrape the full website.

  //   maxRequestsPerCrawl: 20,
});


This does not seem correct. Anyone with a better way?
8 comments
M
O
i
ira
·

comments

When using https://api.apify.com/v2/datasets?token=***
api, it shows that the number of datasets are 0 but there are 12 datasets.
I received feedback from a user who couldn't be using my Actor, he provided me with the id of the runs but I can't see them, it shows

Oops, the run you are looking for could not be found.
Is this page broken? Let us know at support@apify.com.
10 comments
!
d
v