Apify Discord Mirror

Updated 2 years ago

Saving bandwith using PlaywrightCrawler: to block googletagmanager, google-analytics etc...

At a glance

The community member is looking to block certain domains, such as Google Tag Manager, Facebook, Google Analytics, and Google Fonts, to save bandwidth. They have already blocked images and are now seeking an example of code to block these domains, potentially using the PlaywrightCrawler.preNavigationHooks. In the comments, another community member suggests using the blockRequests method from the Playwright utils, but notes that it is only available in Chromium, and for Firefox, the community member would need to use the Playwright routing, which is less optimized.

Useful resources
I already block images as described in [1] and this helps to save some bandwith.
Next step: looking at statistics in my proxy service I see a significant number of requests like these:

Plain Text
https://www.googletagmanager.com/gtag/js?id=...
https://connect.facebook.net/en_US/fbevents.js
https://www.google-analytics.com/analytics.js
https://fonts.googleapis.com/css?family=Lato


Can somebody show me an example of code blocking these domains? (better: to block all domains from a given list)

I assume it should be something in PlaywrightCrawler.preNavigationHooks, right?
Prerequisites: PlaywrightCrawler, Firefox as launcher (Chrome-specific hacks probably would not work)

(I'm not good at writing Javascript from scratch, so need some help)

[1] https://discord.com/channels/801163717915574323/1060986956961546320
L
1 comment
https://crawlee.dev/api/playwright-crawler/namespace/playwrightUtils#blockRequests but it is only available in Chromium.

For Firefox only you need to use the Playwright routing which is less optimized since it disables cache and that can backfire
https://playwright.dev/docs/api/class-page#page-route
Add a reply
Sign up and join the conversation on Discord