Apify

Apify and Crawlee Official Forum

b
F
A
J
A

Router for what? How it works?

I created a new crawler using the npx crawlee create project command, that creates some folders and files, it creates me a router.js file, which it has an instance of createPlaywrightRouter
Plain Text
export const router = createPlaywrightRouter();

router.addDefaultHandler(async ({ enqueueLinks, log }) => {
    log.info(`enqueueing new URLs`);
    await enqueueLinks({
        globs: ['https://crawlee.dev/**'],
        label: 'detail',
    });
});

router.addHandler('detail', async ({ request, page, log }) => {
    const title = await page.title();
    log.info(`${title}`, { url: request.loadedUrl });

    await Dataset.pushData({
        url: request.loadedUrl,
        title,
    });
});

as I understand, you are creating the default handler which is kinda the "main" listener, so later you are calling/invoking your route "detail", for the enqueLinks function, this could be interesting to split your process in more "routes"/steps, so it can be more clean and decoupled later.
My question is, how to call or invoke this without the enqueList function?
I was expecting something like:
Plain Text
router.addDefaultHandler(async (ctx) => {
    await ctx.invoke('extract-meta-data')
    await ctx.invoke('extract-detail')
    await ctx.invoke('download-files')
});

Where can I see the functions this CTX admit or maybe I understood the router totally different.
Thanks πŸ™‚
s
v
P
4 comments
Router for what? How it works?
Hi , the idea behind router is to have different routes for different types of requests depending on the label you enqueue the route with. It is pretty much equivalent of this code:
Plain Text
const crawler = new BasicCrawler({
    requestHandler: ({ request }) => {
      const label = reuqest.userData;

      switch(label) {
        case 'route1':
            doSomething1();
            break;
        case 'route2':
            doSomething2();
            break;
        default:
            doSomethingDefault();
            break;
        }
    },
});


If you want to create a flow like in your code snippet, you should just in each route enqueue the next request with the next route label.
do you have some real example on github using this route feature with crawlee?
To understand the CrawlerRouters you need to understand also the RequestQueue:

Let's say that RequestQueue is just queue of Request information. One of there information would be url, but also the name of the RouterHandler that is going to be used to process this request. The name of the RequestHandler is provided on the label attribute.

It is not really related to the event oriented approach as you suggest. Processing of Request starts and ends always only in one single RequestHandler. You may of course enqueue the same request again with different label (and uniqueKey).

So the Router is only processing Requests from RequestQueue. Once the Request is successfully proceeded its lifetime ends. Once there are no Requests in the RequestQueue the Crawler ends.

You may of course build your own state machine implementation inside the RequestHandler similar to what suggested, but I am not sure if it is what the original question was about.
Add a reply
Sign up and join the conversation on Discord
Join