Exclude query parameter URLs from crawl jobs

At a glance

The community member is researching methods to exclude URLs with specific patterns, such as "https://domain[.]com/path?query1=test&query2=test2". They have tried using the enqueueLinks options with regular expressions, but it seems to be matching allowable URLs instead of excluding them. The community member is using the PlayrightCrawler via crawlee, but believes this could be applicable across all crawler engines. Another community member suggests using the transformRequestFunction option of enqueueLinks to skip URLs that match a specific regex. The original community member confirms that this solution works for their use case.

Useful resources

ccryptorex

Hello,

I'm researching currently methods to exclude URLs with, for example: https://domain[.]com/path?query1=test&query2=test2

I've tried hooking into the enqueueLinks options like:

Plain Text

await enqueueLinks({ regexps: [ new RegExp('^'+[websiteURL]+'[^?]+') ]});

However, it seems like it still matches, because this isn't necessarily excluding, rather matching allowables based on RegEx.

I"m using PlayrightCrawler via crawlee, but I think this would just be something I can do across all crawler engines. Please let me know of how I might achieve this or guide me to more research. Thanks Team!

3 comments

LLukas Krivka

The regexes like this will be matching ones. To do skipping ones, you can do it with transformRequestFunction option of enqueueLinks.
https://crawlee.dev/api/core/interface/EnqueueLinksOptions#transformRequestFunction

Plain Text

 transformRequestFunction: (request) => {
   if (request.url.match(mySkipRegex)) {
    return null;
}
return request;
}

ccryptorex

Thanks Lukas! I'll try this out 🙂

ccryptorex

thanks again I've tested this and its working as I needed 🥳

Add a reply

Apify Discord Mirror

Exclude query parameter URLs from crawl jobs