Scrape redirect links gracefully?

At a glance

The community member has a site that provides internal links that redirect to the actual page, and they want to scrape the redirected link. They tried using the sendRequest API, but the response's URL only had the internal URL, not the actual link. They then tried using Playwright, which worked, but they encountered "can't set cookie" errors in the logs, creating unnecessary noise.

The community members discussed several approaches, including using the waitForURL method in Playwright, and checking the HTML of the page to see if the final URL is available. However, there is no explicitly marked answer, as the issue seems to be specific to the community member's use case, and the best approach may depend on the type of redirection being used.

Useful resources

AAltairSama2

Hey folks,

I have a site which gives out internal links which then redirects to the actual page and I want to scrape the redirected link, I tried using the sendRequest api but the response's url only had the internal url and not the actual link, so now I am doing it via playwright like this

Plain Text

        const context = page.context();
        const newPagePromise = context.waitForEvent('page');
        await applyNowBtn.click();
        const newPage = await newPagePromise;
        await newPage.waitForTimeout(3000)
        log.info(`url is ${newPage.url()}`);
        newPage.close();

issue with this approach is (aside from having to use a timeout), crawlee logs these cant set cookie errors which adds a lot of unnecessary noise to my logs, I just wanted to know if there was a better approach to handle this?

4 comments

OOleg V.

Might be usefull discussion:
https://stackoverflow.com/questions/64657669/how-to-get-redirected-final-url-with-got-in-node-js
But each case is unique ofc.

afaik sendRequest uses got package and by default option followRedirect is true, so it must catch redirection.
But it doesn't work if redirection event goes from some javascript.
So try to understand the type of your redirection , check discussion above.

+ here isan example how to handle redirection in a better way:

Plain Text

await Promise.all([
  await page.click('my-button'),
  await page.waitForURL(`**/preventivo${pageNumber}.html`, {
      waitUntil: 'networkidle', // or any other 
      timeout: 20000,
  }),
]);

docs:
https://playwright.dev/docs/api/class-frame#frame-wait-for-url

AAltairSama2

I looked at the docs and the issue with the regex approach is, we dont have any clue about which site the redirect will be, each poster can include a different link for the redirect, its just the site doesnt make the link available in the html, once you click on the button, only then the redirect will be processed and we get the correct URL

Thanks for the answer, GOT can't pick up any JS redirects which is why it wasnt working. it looks like getting the url from page.url() is the only way

OOleg V.

Maybe then just use regExp that fits for all urls.
Anyway, there is waitUntil option to fait for.

Or You can use waitForNavigation()
https://playwright.dev/docs/api/class-page#page-wait-for-navigation
It's DEPRECATED but still works.

Also try to check html of the page that you get with plain http request. Sometimes there could be somewhere URL to final (redirected) page (I had such case once).
otherwise, I guess, in your case using Playwright is the only way to catch redirection.

AAltairSama2

I'll check those out thanks and yeah, playwright/browser is the only way to handle this, my curent way works fine its just that those cant set cookies debug logs are really creating a lot of noise

Add a reply

Apify Discord Mirror

Scrape redirect links gracefully?