infinite_scroll | how to get the updated page

Hey, I created this simple script:

Plain Text

import asyncio

# Instead of BeautifulSoupCrawler let's use Playwright to be able to render JavaScript.
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
import urllib.parse


async def main(terms: str) -> None:
    crawler = PlaywrightCrawler(headless=False)

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        # Wait for the collection cards to render on the page. This ensures that
        # the elements we want to interact with are present in the DOM.
        await context.page.wait_for_load_state("networkidle")
        await context.infinite_scroll()

    url = f"https://www.youtube.com/results?search_query={urllib.parse.quote(terms)}&sp=EgIwAQ%253D%253D"
    await crawler.run([url])


if __name__ == "__main__":
    asyncio.run(main("music"))

`

But I want to get the content of the page while infinite_scroll scroll the page, like that I can see the new content and I can make action according to them, but await context.infinite_scroll() never stop so I can't put an action behind it to run the thinkg I want, how can I manage that? (I want to get the new link of youtube video)

2 comments

OOleg V.

In general, infinite scroll is not reliable, as it can be really infinite, what can lead to memory leaks.

On websites with lazy-loading pagination, if API scraping is a viable option, it is a much better approach due to reliability and performance.

AFAIK, youtube has API endpoints for scrolling results. Try to Check it in devTools.

Otherwise, I'd advice to create your own scrolling logic with processing new items an a loop.
E.q. here (it's JS, but logic remains the same):
https://docs.apify.com/academy/puppeteer-playwright/common-use-cases/paginating-through-results#lazy-loading-pagination

JJourdelune

Thanks for the answer! I found a workaround (context.page.on("request", track_request)) but I will try to use the API endpoint for that

Add a reply

Apify and Crawlee Official Forum

infinite_scroll | how to get the updated page