Apify and Crawlee Official Forum

Updated 3 months ago

Scrape redirect links gracefully?

Hey folks,

I have a site which gives out internal links which then redirects to the actual page and I want to scrape the redirected link, I tried using the sendRequest api but the response's url only had the internal url and not the actual link, so now I am doing it via playwright like this
Plain Text
        const context = page.context();
        const newPagePromise = context.waitForEvent('page');
        await applyNowBtn.click();
        const newPage = await newPagePromise;
        await newPage.waitForTimeout(3000)
        log.info(`url is ${newPage.url()}`);
        newPage.close();

issue with this approach is (aside from having to use a timeout), crawlee logs these cant set cookie errors which adds a lot of unnecessary noise to my logs, I just wanted to know if there was a better approach to handle this?
O
A
4 comments
Might be usefull discussion:
https://stackoverflow.com/questions/64657669/how-to-get-redirected-final-url-with-got-in-node-js
But each case is unique ofc.

afaik sendRequest uses got package and by default option followRedirect is true, so it must catch redirection.
But it doesn't work if redirection event goes from some javascript.
So try to understand the type of your redirection , check discussion above.

+ here isan example how to handle redirection in a better way:
Plain Text
await Promise.all([
  await page.click('my-button'),
  await page.waitForURL(`**/preventivo${pageNumber}.html`, {
      waitUntil: 'networkidle', // or any other 
      timeout: 20000,
  }),
]);


docs:
https://playwright.dev/docs/api/class-frame#frame-wait-for-url
I looked at the docs and the issue with the regex approach is, we dont have any clue about which site the redirect will be, each poster can include a different link for the redirect, its just the site doesnt make the link available in the html, once you click on the button, only then the redirect will be processed and we get the correct URL

Thanks for the answer, GOT can't pick up any JS redirects which is why it wasnt working. it looks like getting the url from page.url() is the only way
Maybe then just use regExp that fits for all urls.
Anyway, there is waitUntil option to fait for.

Or You can use waitForNavigation()
https://playwright.dev/docs/api/class-page#page-wait-for-navigation
It's DEPRECATED but still works.

Also try to check html of the page that you get with plain http request. Sometimes there could be somewhere URL to final (redirected) page (I had such case once).
otherwise, I guess, in your case using Playwright is the only way to catch redirection.
I'll check those out thanks and yeah, playwright/browser is the only way to handle this, my curent way works fine its just that those cant set cookies debug logs are really creating a lot of noise
Add a reply
Sign up and join the conversation on Discord