Jeno

Using BrightData's socks5h proxies

BrightData's datacenter proxies can be used with socks5 but only with remote dns resolution, thus the protocol should be given like socks5h://...

Testing it with curl works, but using it in crawlee it doesn't work. Just keeps hanging.

Plain Text

    proxyConfiguration: new ProxyConfiguration({
      newUrlFunction: () => {
        return 'socks5h://brd-customer-...-zone-...:...@brd.superproxy.io:22228';
      },
    })

Any idea how it should work?

Edit: since I use CamouFox, I tried:

Plain Text

        firefoxUserPrefs: {
          'network.proxy.socks_remote_dns': true,  // Enable remote DNS resolution
        },

But it still just hangs.

2 comments

JJeno

Crawlee Playwright is detected as bot

Checking on this page, Crawlee Playwright is detected as bot due to CDP.

https://www.browserscan.net/bot-detection

This is a known issue, also discussed on:

https://github.com/berstend/puppeteer-extra/issues/899

Wondering if Crawlee can come up with a solution?

15 comments

JJeno

WebRTC IP leak?

Hi, so for the last couple days I am on a quest to evade detection for a project that proved to be quite challanging. As I researched the issue, I noticed that my real IP leaks through WebRTC with a default Crawlee Playwright CLI project. I see a commit to the fingerprint-suite that I think should prevent that, but based on my tests it doesn't. Does it need special setup or anything?

2 comments

JJeno

How to use Playwright's bypassCsp option?

I would like to use:

https://playwright.dev/docs/api/class-testoptions#test-options-bypass-csp

Any idea how I can do that in crawlee?

5 comments

JJeno

Sessions and proxies?

I am having a hard time understanding sessions and proxies. I have the following crawler setup:

Plain Text

const crawler = new PuppeteerCrawler({
    requestList,
    useSessionPool: true,
    persistCookiesPerSession: true,
    proxyConfiguration,
    requestHandler: router,
    requestHandlerTimeoutSecs: 100,
    headless: false,
    minConcurrency: 20,
    maxConcurrency: 30,
    launchContext: {
        launcher: PuppeteerExtra,
        useIncognitoPages: true
    },
})

Basically I want to run the same task concurrently with different proxies. Unless I set useIncognitoPages: true, only one session is used concurrently with one proxy. Is this how it should work? What is the point of having a session pool if only one is used?

5 comments

JJeno

How to stop Puppeteer crawler without causing error?

I have forks in my script and if certain conditions are met, I would like to stop the script. How should I do that? page.close creates issues, especially if I run concurrently.

JJeno

How to increase max memory?

I am running a script that needs concurrency. I have 64 GB of RAM available and I want to use it to the max. I am running my script on a server so there is not much else running. The problem is, at around 15GB I always get memory overloaded error.

I have tried:

Plain Text

config.set('memoryMbytes', 50_000)
config.set('availableMemoryRatio', 0.95)

Nothing seems to change this behavior. Anything else I can try?

1 comment

Apify Discord Mirror

Using BrightData's socks5h proxies

Crawlee Playwright is detected as bot

WebRTC IP leak?

How to use Playwright's bypassCsp option?

Sessions and proxies?

How to stop Puppeteer crawler without causing error?

How to increase max memory?