Apify and Crawlee Official Forum

Updated 3 months ago

Best Crawler for Youtube JS?

I'm looking to scrape youtube captions. There is a timedtext url in the page js... but it doesn't seem to show through standard http request through curl etc.

Can you please suggest a method?

Here is my routes.ts file:

import { Dataset, createPuppeteerRouter } from 'crawlee';

export const router = createPuppeteerRouter();

router.addDefaultHandler(async ({ request, page, log }) => {
log.info(Default handler triggered for: ${request.url});

// Wait for the page to fully load
await page.waitForSelector('ytd-watch-flexy', { timeout: 10000 }); // Adjust the selector and timeout as necessary

log.info(Page is ready for processing: ${request.url});

// Execute JS in the page context to find the caption URL
const captionUrl = await page.evaluate(() => {
const scripts = Array.from(document.querySelectorAll('script'));
for (const script of scripts) {
const match = script.textContent?.match(/https://www.youtube.com/api/timedtext?[^"]+/);
if (match) {
return match[0];
}
}
return null;
});

if (captionUrl) {
log.info(Found caption URL: ${captionUrl});

await Dataset.pushData({
url: request.loadedUrl,
title: await page.title(),
captionUrl,
});
} else {
log.warning(No caption URL found on ${request.loadedUrl});
}
});
P
k
2 comments
Hi @Mike Bruggs
Probably Discord damaged your code, but I tried:
Plain Text
// await page.waitForSelector('ytd-watch-flexy', { timeout: 10000 }); // Adjust the selector and timeout as necessary

console.log(`Page is ready for processing:`);

console.log("fdsfds");
// Execute JS in the page context to find the caption URL
const captionUrl = await page.evaluate(() => {
    const scripts = Array.from(document.querySelectorAll('script'));
    for (const script of scripts) {
        const match = script.textContent?.match(/https:\/\/www.youtube.com\/api\/timedtext?[^"]+/);
        if (match) {
            return match[0];
        }
    }
    return null;
});

console.log("fdsfds", captionUrl);

if (captionUrl) {
    console.log(`Found caption URL: ${captionUrl}`);

    console.log("jhdgfjdfdsfds");

    console.log({
        url: request.loadedUrl,
        title: await page.title(),
        captionUrl,
    });
    console.log("g7fd8sgfdsfds");
} else {
    console.log(`No caption URL found on ${request.loadedUrl}`);
}

In my Puppeteer extension and it woked well
why use puppetteer or playwright? it consumes more time and ressouces. unless you scrape an embeded video, rendering JS in this case is not necessary and doesn't need YouTube's internal api or public api.
Add a reply
Sign up and join the conversation on Discord