Jourdelune

·

infinite_scroll | how to get the updated page

Hey, I created this simple script:

Plain Text

import asyncio

# Instead of BeautifulSoupCrawler let's use Playwright to be able to render JavaScript.
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
import urllib.parse


async def main(terms: str) -> None:
    crawler = PlaywrightCrawler(headless=False)

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        # Wait for the collection cards to render on the page. This ensures that
        # the elements we want to interact with are present in the DOM.
        await context.page.wait_for_load_state("networkidle")
        await context.infinite_scroll()

    url = f"https://www.youtube.com/results?search_query={urllib.parse.quote(terms)}&sp=EgIwAQ%253D%253D"
    await crawler.run([url])


if __name__ == "__main__":
    asyncio.run(main("music"))

`

But I want to get the content of the page while infinite_scroll scroll the page, like that I can see the new content and I can make action according to them, but await context.infinite_scroll() never stop so I can't put an action behind it to run the thinkg I want, how can I manage that? (I want to get the new link of youtube video)

2 comments

O

J

JJourdelune

·

Robots.txt

Hey, do you have any idea how to respect robots.txt? We must code that ourself?

5 comments

S

J

Apify and Crawlee Official Forum

infinite_scroll | how to get the updated page

Robots.txt