Apify Discord Mirror

Updated last week

How to stop following delayed javascript redirects?

I'm using the AdaptivePlaywrightCrawler with the same-domain strategy in enqueueLinks. The page I'm trying to crawl has delayed JavaScript redirects to other pages, such as Instagram. Sometimes, the crawler mistakenly thinks it's still on the same domain after a redirect and starts adding Instagram URLs to the main domain, like example.com/account/... and example.com/member/..., which don't actually exist, so, how can I stop following these delayed JavaScript redirects?
P
N
A
5 comments
Hi @Nth , Can you send us an example of how you call enqueueLinks?
Hey @Pepa J, here it's:

router.addDefaultHandler(async (ctx) => { const { request, enqueueLinks, parseWithCheerio, querySelector, log, pushData, page } = ctx; log.info(Running request handler for ${request.url}); await enqueueLinks({ strategy: 'same-domain', globs: ['http?(s)://example.com/**', 'http?(s)://**.example.com/**'], transformRequestFunction: (req) => { // Skip pdf files if (request.url.endsWith('.pdf')) { log.warning(* Skipping (${req.url}) - PDF); return false; } return req; }, }); });
@Nth just advanced to level 1! Thanks for your contributions! πŸŽ‰
No issues with PlaywrightCrawler, but it sometimes happens with AdaptivePlaywrightCrawler
Thank you @Nth, I believe there might be and issue/bug that shows up happens on a specific website, would it be possible to put together minimal reproducible example with "real urls"?
Add a reply
Sign up and join the conversation on Discord