Apify Discord Mirror

Updated 2 years ago

Exclude query parameter URLs from crawl jobs

At a glance

The community member is researching methods to exclude URLs with specific patterns, such as "https://domain[.]com/path?query1=test&query2=test2". They have tried using the enqueueLinks options with regular expressions, but it seems to be matching allowable URLs instead of excluding them. The community member is using the PlayrightCrawler via crawlee, but believes this could be applicable across all crawler engines. Another community member suggests using the transformRequestFunction option of enqueueLinks to skip URLs that match a specific regex. The original community member confirms that this solution works for their use case.

Useful resources
Hello,

I'm researching currently methods to exclude URLs with, for example: https://domain[.]com/path?query1=test&query2=test2

I've tried hooking into the enqueueLinks options like:

Plain Text
await enqueueLinks({ regexps: [ new RegExp('^'+[websiteURL]+'[^?]+') ]});

However, it seems like it still matches, because this isn't necessarily excluding, rather matching allowables based on RegEx.

I"m using PlayrightCrawler via crawlee, but I think this would just be something I can do across all crawler engines. Please let me know of how I might achieve this or guide me to more research. Thanks Team!
L
c
3 comments
The regexes like this will be matching ones. To do skipping ones, you can do it with transformRequestFunction option of enqueueLinks.
https://crawlee.dev/api/core/interface/EnqueueLinksOptions#transformRequestFunction

Plain Text
 transformRequestFunction: (request) => {
   if (request.url.match(mySkipRegex)) {
    return null;
}
return request;
}
Thanks Lukas! I'll try this out πŸ™‚
thanks again I've tested this and its working as I needed πŸ₯³
Add a reply
Sign up and join the conversation on Discord