Best Crawler for Youtube JS?

MMike Bruggs

I'm looking to scrape youtube captions. There is a timedtext url in the page js... but it doesn't seem to show through standard http request through curl etc.

Can you please suggest a method?

Here is my routes.ts file:

import { Dataset, createPuppeteerRouter } from 'crawlee';

export const router = createPuppeteerRouter();

router.addDefaultHandler(async ({ request, page, log }) => {
log.info(Default handler triggered for: ${request.url});

// Wait for the page to fully load
await page.waitForSelector('ytd-watch-flexy', { timeout: 10000 }); // Adjust the selector and timeout as necessary

log.info(Page is ready for processing: ${request.url});

// Execute JS in the page context to find the caption URL
const captionUrl = await page.evaluate(() => {
const scripts = Array.from(document.querySelectorAll('script'));
for (const script of scripts) {
const match = script.textContent?.match(/https://www.youtube.com/api/timedtext?[^"]+/);
if (match) {
return match[0];
}
}
return null;
});

if (captionUrl) {
log.info(Found caption URL: ${captionUrl});

await Dataset.pushData({
url: request.loadedUrl,
title: await page.title(),
captionUrl,
});
} else {
log.warning(No caption URL found on ${request.loadedUrl});
}
});

2 comments

PPepa J

Hi @Mike Bruggs
Probably Discord damaged your code, but I tried:

Plain Text

// await page.waitForSelector('ytd-watch-flexy', { timeout: 10000 }); // Adjust the selector and timeout as necessary

console.log(`Page is ready for processing:`);

console.log("fdsfds");
// Execute JS in the page context to find the caption URL
const captionUrl = await page.evaluate(() => {
    const scripts = Array.from(document.querySelectorAll('script'));
    for (const script of scripts) {
        const match = script.textContent?.match(/https:\/\/www.youtube.com\/api\/timedtext?[^"]+/);
        if (match) {
            return match[0];
        }
    }
    return null;
});

console.log("fdsfds", captionUrl);

if (captionUrl) {
    console.log(`Found caption URL: ${captionUrl}`);

    console.log("jhdgfjdfdsfds");

    console.log({
        url: request.loadedUrl,
        title: await page.title(),
        captionUrl,
    });
    console.log("g7fd8sgfdsfds");
} else {
    console.log(`No caption URL found on ${request.loadedUrl}`);
}

In my Puppeteer extension and it woked well

kkaramelo

why use puppetteer or playwright? it consumes more time and ressouces. unless you scrape an embeded video, rendering JS in this case is not necessary and doesn't need YouTube's internal api or public api.

Add a reply

Apify and Crawlee Official Forum

Best Crawler for Youtube JS?