Apify

Apify and Crawlee Official Forum

b
F
A
J
A
Members
Jeno
J
Jeno
Offline, last seen 6 days ago
Joined August 30, 2024
Hi, so for the last couple days I am on a quest to evade detection for a project that proved to be quite challanging. As I researched the issue, I noticed that my real IP leaks through WebRTC with a default Crawlee Playwright CLI project. I see a commit to the fingerprint-suite that I think should prevent that, but based on my tests it doesn't. Does it need special setup or anything?
1 comment
L
Checking on this page, Crawlee Playwright is detected as bot due to CDP.

https://www.browserscan.net/bot-detection

This is a known issue, also discussed on:

https://github.com/berstend/puppeteer-extra/issues/899

Wondering if Crawlee can come up with a solution?
4 comments
S
J
A
L
5 comments
J
L
I have forks in my script and if certain conditions are met, I would like to stop the script. How should I do that? page.close creates issues, especially if I run concurrently.
I am having a hard time understanding sessions and proxies. I have the following crawler setup:

Plain Text
const crawler = new PuppeteerCrawler({
    requestList,
    useSessionPool: true,
    persistCookiesPerSession: true,
    proxyConfiguration,
    requestHandler: router,
    requestHandlerTimeoutSecs: 100,
    headless: false,
    minConcurrency: 20,
    maxConcurrency: 30,
    launchContext: {
        launcher: PuppeteerExtra,
        useIncognitoPages: true
    },
})


Basically I want to run the same task concurrently with different proxies. Unless I set useIncognitoPages: true, only one session is used concurrently with one proxy. Is this how it should work? What is the point of having a session pool if only one is used?
5 comments
A
A
J
I have forks in my script and if certain conditions are met, I would like to stop the script. How should I do that? page.close creates issues, especially if I run concurrently.
I am having a hard time understanding sessions and proxies. I have the following crawler setup:

Plain Text
const crawler = new PuppeteerCrawler({
    requestList,
    useSessionPool: true,
    persistCookiesPerSession: true,
    proxyConfiguration,
    requestHandler: router,
    requestHandlerTimeoutSecs: 100,
    headless: false,
    minConcurrency: 20,
    maxConcurrency: 30,
    launchContext: {
        launcher: PuppeteerExtra,
        useIncognitoPages: true
    },
})


Basically I want to run the same task concurrently with different proxies. Unless I set useIncognitoPages: true, only one session is used concurrently with one proxy. Is this how it should work? What is the point of having a session pool if only one is used?
5 comments
A
A
J
I am running a script that needs concurrency. I have 64 GB of RAM available and I want to use it to the max. I am running my script on a server so there is not much else running. The problem is, at around 15GB I always get memory overloaded error.

I have tried:
Plain Text
config.set('memoryMbytes', 50_000)
config.set('availableMemoryRatio', 0.95)

Nothing seems to change this behavior. Anything else I can try?
1 comment
P
I am running a script that needs concurrency. I have 64 GB of RAM available and I want to use it to the max. I am running my script on a server so there is not much else running. The problem is, at around 15GB I always get memory overloaded error.

I have tried:
Plain Text
config.set('memoryMbytes', 50_000)
config.set('availableMemoryRatio', 0.95)

Nothing seems to change this behavior. Anything else I can try?
1 comment
P