Apify

Apify and Crawlee Official Forum

b
F
A
J
A

Crawlee vs bot detection systems - Plugins length is not OK

I tested PlaywrightCrawler on three bot detection sites (see [1], [2], [3] and the attached screenshots).
In all cases these sites complains about "0 plugins" or "Plugins length".

If I open these sites with browser I use every day (Firefox on Linux, by the way - the same as
used in PlaywrightCrawler settings) - these sites say "5 plugins" and the field is green.

Is it something in my code?
Can Crawlee emulate these plugins attributes?

[1] - https://infosimples.github.io/detect-headless/
[2] - https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html
[3] - https://webscraping.pro/wp-content/uploads/2021/02/testresult2.html

and here - part of the PlaywrightCrawler:
Plain Text
const crawler = new PlaywrightCrawler({
    ...
    browserPoolOptions: {
        useFingerprints: true,

        fingerprintOptions: {
            fingerprintGeneratorOptions: {
                browsers: ['firefox'],
                operatingSystems: ['linux'],
            },
        },
    },

    launchContext: {
        launcher: firefox
    },

});


Screenshots:
Attachments
01-infosimples.github.io-19b9a46843518680ccc72bada5fe8b69.png
02-intoli.com-44d20f5d8ce2747086171e4aeecca746.png
03-webscraping.pro-f1fceabcc55af4353c0da1cddf3e72d7.png
3
L
n
A
37 comments
On https://bot.sannysoft.com/
It's OK with my code (different from yours). I get this:
What Url do you use to test?
Attachment
image.png
just advanced to level 2! Thanks for your contributions! 🎉
Well, here is the code I used to get the "Plugins length" error:
Plain Text
import { firefox, webkit } from 'playwright';
import { PlaywrightCrawler, Dataset, ProxyConfiguration, Request, log, sleep } from 'crawlee';
import { launchPlaywright, playwrightUtils } from 'crawlee';
import * as crypt from 'crypto';

const crawler = new PlaywrightCrawler({
    autoscaledPoolOptions: {
        minConcurrency: 2,
        maxConcurrency: 4,
        loggingIntervalSecs: null,

    },

    maxRequestRetries: 0,
    navigationTimeoutSecs: 130,
    requestHandlerTimeoutSecs: 110,
    useSessionPool: false,
    persistCookiesPerSession: false,
    headless: true,

    browserPoolOptions: {
        useFingerprints: true,
        operationTimeoutSecs: 40,
        fingerprintOptions: {
            fingerprintGeneratorOptions: {
                browsers: ['firefox'],
                operatingSystems: ['linux'],
            },
        },
    },

    launchContext: {
        useIncognitoPages: true,
        launcher: firefox
    },

    async requestHandler( {request, response, page, enqueueLinks, log, proxyInfo} )
    {
        const uniqueKey = crypt.randomBytes(16).toString("hex");
        let url = new URL(request.url);
        let host = url.host;
        let scrFile = `${host}-${uniqueKey}.png`;

        log.info(`GET ${request.url}  Wait1 ...`);
        await sleep(40*1000);

        log.info(`GET ${request.url}  Wait2, Pressing Enter ...`);
        await page.keyboard.press('Enter');
        await sleep(40*1000);

        log.info(`GET ${request.url}  Writing into ${scrFile} ...`);
        await page.screenshot( {path:scrFile, fullPage:true} );
        log.info(`GET ${request.url}  DONE`);
    },
});

await crawler.run([
    "https://infosimples.github.io/detect-headless/",
    "https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html",
    "https://webscraping.pro/wp-content/uploads/2021/02/testresult2.html"
]);
what I want to achieve - to have code/scraper having no "red flags" on bot detection systems like the three sites above AND passing this check: https://nowsecure.nl/ (as far as I understand nowsecure.nl implements a variant of Cloudflare protection).

I'm using Firefox as launcher - it seems, only with Firefox I can pass the nowsecure.nl check
Thanks for this. Can you try with session pool on, not sure if there isn't anything bound to that.

please look into this
just changed useSessionPool to:

Plain Text
useSessionPool: true,


same thing - "Plugins Length: 0"
With the use of chromium instead of firefox as launcher, There is no "Plugins length" error.
Attachment
image.png
I do this hook, for Firefox as launcher, with fingerprint-injector & Playwright [1],

Thus, there are no more "Plugins length" errors.

[1] https://github.com/apify/fingerprint-suite/blob/master/docs/guides/fingerprint-injector.md#usage-with-playwright
Attachment
image.png
Great, so this can be fixed!

But for somebody who is new JS/TS (like me)... would be better to have some example code starting with
Plain Text
 crawler = new PlaywrightCrawler({
 ...
 });

it is possible, isn't it?
just advanced to level 3! Thanks for your contributions! 🎉
Yes, it's up to you to do the job 😉
- many thanks for the code!!!

It works, it really works!!!
Even with my ugly JS code (please suggest how to improve it) -- it works!!!

I put the JS code creating plugins in the preNavigationHooks - not sure this is the optimal solution...
Attachments
02-webscraping.pro-311c7989ef9ed30b6407d4498b811594.png
01-infosimples.github.io-584ec122ec20527d880ef7ec3805d68c.png
Thanks for the debug. and will eventually check this and see how it can be implemented to Crawlee best
by the way - when fixing "plugin length" - please also fix "0 mime types".
Several sites are checking "mime types length":

https://infosimples.github.io/detect-headless/
under "Mime"

https://browserleaks.com/javascript
search for "mimeTypes"

attached - screenshot from https://browserleaks.com/javascript - made with code above, you can see "mimeTypes: 0"
Attachment
browserleaks.com-mime-types.png
You can do with this
Plain Text
    const pluginContent = `
    Object.defineProperty(navigator, 'plugins', {
        get: () => {
            const PDFPlugin = Object.create(Plugin.prototype, {
                description: { value: 'Portable Document Format', enumerable: false },
                filename: { value: 'internal-pdf-viewer', enumerable: false },
                name: { value: 'PDF Plugin', enumerable: false },
            });
            return Object.create(PluginArray.prototype, {
                length: { value: 1 },
                0: { value: PDFPlugin },
            });
        },
    });
    Object.defineProperty(navigator, 'mimeTypes', {
        get: () => {
            const PDFMimeTypeTxt = Object.create(MimeType.prototype, {
                type: { value: 'text/pdf', enumerable: false },
                suffixes: { value: 'pdf', enumerable: false },
                description: { value: 'Portable Document Format', enumerable: false },
                enabledPlugin: { value: 'PDF Plugin', enumerable: false },
            });
            return Object.create(MimeTypeArray.prototype, {
                length: { value: 1 },
                0: { value: PDFMimeTypeTxt },
            });
        },
    });
    `

attached - screenshot from https://browserleaks.com/javascript - made with code above, you can see mimeTypes: text/pdf, pdf, Portable Document Format
Attachment
image.png
works like a charm!

thanks !!!
Just curious to know how are you generating those plugins? I am using puppeter but getting failed check in bot tests. See screenshot
Attachment
bottest.png
what is interesting: in some cases this code should be in preLaunchHooks and in some cases - in prePageCreateHooks
do not ask me what happens there, I just played a bit ))))

Anyway, attached is my super-mega-PlaywrightCrawler ))) producing 1km of logs (printf debugging, yes) but demonstrating green results for "plugin length" and "mimeTypes"
Thanks a lot. I was able to make it work using Puppeteer.
code:
Plain Text
preNavigationHooks: [
        async ({ page, request }) => {
            log.info(`preNavigationHook: GET=${request.url} START`);
            const preloadFile = fs.readFileSync('./preload.js', 'utf8');
            await page.evaluateOnNewDocument(preloadFile);
            log.info(`preNavigationHook: GET=${request.url} END`);
        }
    ],

preload.js:
Plain Text
Object.defineProperty(navigator, 'plugins', {
    get: () => {
        const PDFPlugin = Object.create(Plugin.prototype, {
            description: { value: 'Portable Document Format', enumerable: false },
            filename: { value: 'internal-pdf-viewer', enumerable: false },
            name: { value: 'PDF Plugin', enumerable: false },
        });
        return Object.create(PluginArray.prototype, {
            length: { value: 1 },
            0: { value: PDFPlugin },
        });
    },
});
Object.defineProperty(navigator, 'mimeTypes', {
    get: () => {
        const PDFMimeTypeTxt = Object.create(MimeType.prototype, {
            type: { value: 'text/pdf', enumerable: false },
            suffixes: { value: 'pdf', enumerable: false },
            description: { value: 'Portable Document Format', enumerable: false },
            enabledPlugin: { value: 'PDF Plugin', enumerable: false },
        });
        return Object.create(MimeTypeArray.prototype, {
            length: { value: 1 },
            0: { value: PDFMimeTypeTxt },
        });
    },
});
just advanced to level 3! Thanks for your contributions! 🎉
I ran your script on local with proxy servers but I still see these red flags any idea how are you doing to resolve them? I am also figuring out samething.
Attachment
image.png
Attachment
image.png
Well, this JS code:
https://discord.com/channels/801163717915574323/1059483872271798333/1060501044456607774
is fixing only "Plugin length" and "Mime types".


Nothing else.
I was able to resolve all the bot checks using this plugin: https://discord.com/channels/801163717915574323/1051917834290200608/1052147143508500490

only webdriver in frignprint tests and hairline feature test failed rest all passed.
Well... actually code attached to this message https://discord.com/channels/801163717915574323/1059483872271798333/1060959263641567354
has green "webdriver" flag and many other bot checks are also green

Yes, hairline feature... can we ignore it?
I am not sure about hairline feature but I have seen in many youtube videos and few blogs most of them ignore it
With code provided in the following link https://intoli.com/blog/making-chrome-headless-undetectable/, which looks as follows:
Plain Text
    const webGLContent = `
    const getParameter = WebGLRenderingContext.getParameter;
    WebGLRenderingContext.prototype.getParameter = function(parameter) {
      // UNMASKED_VENDOR_WEBGL
      if (parameter === 37445) {
        return 'Intel Open Source Technology Center';
      }
      // UNMASKED_RENDERER_WEBGL
      if (parameter === 37446) {
        return 'Mesa DRI Intel(R) Ivybridge Mobile ';
      }

      return getParameter(parameter);
    };
    `
......
await page.addInitScript({ content: webGLContent });
......

returns the desired values for the renderer and vendor like this
Attachment
image.png
And as indicated in the article, you can also set Retina/HiDPI Hairline Feature.
But as mentioned, "This is another test that doesn’t really make a ton of sense because the majority of people don’t have HiDPI screens and most users’ browsers won’t support this feature. "
const webGLContent = ...
Excellent! what we really need is a list of 100-200 such strings and a piece of JS code randomly returning a "webGL string"... (in other words - this functionality should be in the next version of Crawlee)
Thanks a lot for sharing 🙂
Great research guys, once our team gets more time, we will make sure all of this is implemented by default to Crawlee
any news about this plugin problem?
Hi There is currently PR for this.
I am sorry bad thread, this one is for https://discord.com/channels/801163717915574323/1059916802446073957
Add a reply
Sign up and join the conversation on Discord
Join