Apify Discord Mirror

Updated 5 months ago

Router for what? How it works?

At a glance

The community member created a new crawler using the Crawlee library, which generated a router.js file with a default handler that enqueues new URLs with the 'detail' label. The community member then added a handler for the 'detail' route, which logs the page title and saves the URL and title to a dataset.

The community member's question is how to call or invoke the router without using the enqueueLinks function, and where they can see the functions that the 'ctx' object in the default handler admits.

In the comments, other community members explain that the router is used to handle different types of requests based on the label assigned to them. They suggest that to create a flow like the one in the code snippet, the community member should enqueue the next request with the next route label in each route handler.

One community member also provides more details on how the RequestQueue and Router work together, explaining that the Router only processes requests from the RequestQueue, and that the processing of a request starts and ends within a single RequestHandler.

There is no explicitly marked answer in the comments.

Useful resources
I created a new crawler using the npx crawlee create project command, that creates some folders and files, it creates me a router.js file, which it has an instance of createPlaywrightRouter
Plain Text
export const router = createPlaywrightRouter();

router.addDefaultHandler(async ({ enqueueLinks, log }) => {
    log.info(`enqueueing new URLs`);
    await enqueueLinks({
        globs: ['https://crawlee.dev/**'],
        label: 'detail',
    });
});

router.addHandler('detail', async ({ request, page, log }) => {
    const title = await page.title();
    log.info(`${title}`, { url: request.loadedUrl });

    await Dataset.pushData({
        url: request.loadedUrl,
        title,
    });
});

as I understand, you are creating the default handler which is kinda the "main" listener, so later you are calling/invoking your route "detail", for the enqueLinks function, this could be interesting to split your process in more "routes"/steps, so it can be more clean and decoupled later.
My question is, how to call or invoke this without the enqueList function?
I was expecting something like:
Plain Text
router.addDefaultHandler(async (ctx) => {
    await ctx.invoke('extract-meta-data')
    await ctx.invoke('extract-detail')
    await ctx.invoke('download-files')
});

Where can I see the functions this CTX admit or maybe I understood the router totally different.
Thanks πŸ™‚
s
v
P
4 comments
Router for what? How it works?
Hi , the idea behind router is to have different routes for different types of requests depending on the label you enqueue the route with. It is pretty much equivalent of this code:
Plain Text
const crawler = new BasicCrawler({
    requestHandler: ({ request }) => {
      const label = reuqest.userData;

      switch(label) {
        case 'route1':
            doSomething1();
            break;
        case 'route2':
            doSomething2();
            break;
        default:
            doSomethingDefault();
            break;
        }
    },
});


If you want to create a flow like in your code snippet, you should just in each route enqueue the next request with the next route label.
do you have some real example on github using this route feature with crawlee?
To understand the CrawlerRouters you need to understand also the RequestQueue:

Let's say that RequestQueue is just queue of Request information. One of there information would be url, but also the name of the RouterHandler that is going to be used to process this request. The name of the RequestHandler is provided on the label attribute.

It is not really related to the event oriented approach as you suggest. Processing of Request starts and ends always only in one single RequestHandler. You may of course enqueue the same request again with different label (and uniqueKey).

So the Router is only processing Requests from RequestQueue. Once the Request is successfully proceeded its lifetime ends. Once there are no Requests in the RequestQueue the Crawler ends.

You may of course build your own state machine implementation inside the RequestHandler similar to what suggested, but I am not sure if it is what the original question was about.
Add a reply
Sign up and join the conversation on Discord