Apify Discord Mirror

Updated 5 months ago

Blocking network requests with crawlee PuppeteerCrawler

At a glance

The community member is trying to block network requests from specific domains within PuppeteerCrawler, but is encountering issues. They have tried adding a request interceptor in the preNavigationHooks, but this returns an error saying the request is already handled. The comments suggest that when using multiple intercept handlers, the community member needs to check if the request has already been handled, and provide a link to the relevant Puppeteer documentation. Another comment warns that request interception can negatively impact performance for large crawls. The comments also suggest using the blockRequests method from the PuppeteerCrawlerContext, which allows blocking requests based on URL patterns.

Useful resources
I'm trying to block network requests from specific domains within PuppeteerCrawler but can't get it to work.

I'd like to run something like this:
Plain Text
page.on('request', (req) => {
                // If the URL doesn't include our keyword, ignore it
                if (req.url().includes('bouncex')) {
                    req.abort();
                    return;
                };
                req.continue();
            });

But it has to be initiated before page.goto.

I tried adding it to preNavigationHooks like so:
Plain Text
preNavigationHooks: [
        async ({ page }, goToOptions) => {
            goToOptions!.waitUntil = "networkidle2";
            goToOptions!.timeout = 3600000;
            await blocker.enableBlockingInPage(page);
            page.on('request', (req) => {
                // If the URL doesn't include our keyword, ignore it
                if (req.url().includes('bouncex')) {
                    req.abort();
                    return;
                };
                req.continue();
            });
            await page.setViewport(viewportConfig);
        },
    ],

But this returns Error: Request is already handled!

Is there a way to do this with PuppeteerCrawler?
o
L
P
3 comments
Hey, when you're using multiple Intercept Handlers, you need to check if a request has already been handled: if (interceptedRequest.isInterceptResolutionHandled()) return; . Take a look at this: https://pptr.dev/guides/network-interception#multiple-intercept-handlers-and-asynchronous-resolutions.
Just be aware that request interception disables cache which makes large crawls much worse performance wise
Also you can check blockRequest method from PuppeteerCrawlerContext:

Plain Text
preNavigationHooks: [
    async ({ blockRequests }) => {
        await blockRequests({
            rlPatterns: [
                'yandex.ru',
                'google-analytics.com',
            ]
        });
    }
]
Add a reply
Sign up and join the conversation on Discord