smasher

·

Crawler with playwright doesn't stop

I developing a playwright scraper to do some basic stuffs, after it finish with the urls, it doesn't stop, like my terminal get stuck until I press CTRL + C.
Any flag I should enable?

1 comment

A

ssmasher

·

Router for what? How it works?

I created a new crawler using the npx crawlee create project command, that creates some folders and files, it creates me a router.js file, which it has an instance of createPlaywrightRouter

Plain Text

export const router = createPlaywrightRouter();

router.addDefaultHandler(async ({ enqueueLinks, log }) => {
    log.info(`enqueueing new URLs`);
    await enqueueLinks({
        globs: ['https://crawlee.dev/**'],
        label: 'detail',
    });
});

router.addHandler('detail', async ({ request, page, log }) => {
    const title = await page.title();
    log.info(`${title}`, { url: request.loadedUrl });

    await Dataset.pushData({
        url: request.loadedUrl,
        title,
    });
});

as I understand, you are creating the default handler which is kinda the "main" listener, so later you are calling/invoking your route "detail", for the enqueLinks function, this could be interesting to split your process in more "routes"/steps, so it can be more clean and decoupled later.
My question is, how to call or invoke this without the enqueList function?
I was expecting something like:

Plain Text

router.addDefaultHandler(async (ctx) => {
    await ctx.invoke('extract-meta-data')
    await ctx.invoke('extract-detail')
    await ctx.invoke('download-files')
});

Where can I see the functions this CTX admit or maybe I understood the router totally different.
Thanks 🙂

4 comments

P

s

v

Apify Discord Mirror

Crawler with playwright doesn't stop

Router for what? How it works?