Structure Crawlers to scrape multiple sites

AAltairSama2

Hey everyone,

what is the recommended way to structure your code to scrape multiple sites? I looked at a few questions here and it seems that the recommended way is to use a single crawler but multiple routers to handle this case. the issue, I am facing from this is that when you enqueue links, you'll add site-1 and then site-2 initally before starting the crawler and then the crawler will dynamically add in the links as needed but this will mess up the logs since we are using a Queue and its FIFO, so first it'll crawl the first link, add the extracted links to the queue and then crawl the second link and its links to the queue and like this it'll keep switching contexts between the two sites which will make the logs a mess. Also routers, dont seem to have a url parameter, its just a category and then the request, so we will have to basically define handlers for each site in a single router right? which will just bloat up a single file.

Is there a better way I can structure this? usecase is to setup crawlers for 10+ sites and crawl them sequentially or in parallel but having sane logging for them.

21 comments

PPepa J

Not sure if I fully understand your question. You may use one requestHandler for everything and decide which function you want to use based on the request url.

Plain Text

        requestHandler: async (context) => {
            const { request } = context;

            if (/mydomain1\.com/.test(request.url)) {
                await processSite1(context);
            } else (/mydomain2\.com/.test(request.url)) {
                await processSite2(context);
            }
        }

This will not solve the logs "issue", but you might run the scraper as a two different instances with different input or implement own logger to put logs into different files.

AAltairSama2

ah got it, and yeah the logs will still be an issue, would you recommend running them in a serial manner? like I want concurrent scraping for a site but not have all of these run at once due to logging

AAltairSama2

what I was thinking of doing was to run the scraper for a site, close the scraper and then reinitialize it for the second site and so forth

AAltairSama2

or would it be better to put it as a script in package.json and have npm handle it?

AAltairSama2

just one last thing, can we do nested routers? for e,g, in the example above, since we have different funcs based on the url, can I instead swap out processSite* funcs with a router and have it handle it on a case by case basis?

AAltairSama2

or will it have to be a if/else based syntax as mentioned in the basic tutorial?

AAltairSama2

I would really love if nested routers would somehow work, since the router syntax is much more palatable than a huge 400-500 LOC if/else

PPepa J

Do you mean by using create***Route() function? You should be able to create several of these. Calling create***Route() will return a function and you need to pass the context to it. Then in the requestHandler you need to decide which of there routers you want to use, I don't know what are your requirements for this.

AAltairSama2

requirements are just as you stated, have different route handlers for different sites, and in the RequestHandler call the relevant route handler according to the site, in the snippet you linked await processSite1(context); I wanted to know if I could instead import the route from some other place and then pass in the context to it, because in the doc example

AApifyBot

just advanced to level 2! Thanks for your contributions! 🎉

AAltairSama2

its like this

Plain Text

requestHandler:router

AAltairSama2

and I didnt find any functions in the docs which are doing the same

AAltairSama2

and btw what do you think about me explicitly closing the crawler and then re-initializing it for the next site?

AAltairSama2

will this kind of work?

Plain Text

await crawler.run('site:1')
await crawler.run('site:2')

or do I have to explicitly close the crawler or anything?

PPepa J

I think you are currently trying to create your own pattern that doesn't not follow the recommendations for scrapers made with Crawlee - so the only way for you is test it find out. 🙂

AAltairSama2

gotcha thanks, can you clarify just one last thing? https://discord.com/channels/801163717915574323/1176976837528797245/1177561697288982608

I think I pretty much got it and just need this one thing, and it should work pretty well for my usecase

PPepa J

not sure if I understand, but you should be able to export and import these functions just like anything else

Plain Text

export const myRoutes = createRout();

myRoutes.addHandler("LABEL", (context) => {
  c
})

and import it in anther file

Plain Text

import { myRoutes } from './routes/my-routes.js';

AAltairSama2

Let's say I have two sites site1 and site2 and they both have separate routes like
site1_routes and site2_routes,

now in the crawler requestHandler I want to do something like this

Plain Text

requestHandler: async (context) => {
            const { request } = context;

            if (/mydomain1\.com/.test(request.url)) {
                await site1_routes(context);
            } else (/mydomain2\.com/.test(request.url)) {
                await site2_routes(context);
            }
        }

basically dont define all of the possible permutations for both the sites in a single route but rather have them in separate routes which I can switch on the fly depending on which url it is

AAltairSama2

In the docs createPlaywrightRouter does take in a context option but I havent seen it used with explicitly passing in context anywhere

PPepa J

I just tested it worked fine:

Plain Text

await Actor.init();

const startUrls = ['https://apify.com', 'https://google.com'];

export const apifyRouter = createPuppeteerRouter();

apifyRouter.addDefaultHandler(async ({ log }) => {
    log.info(`Hello from Apify!`);
});

export const googleRouter = createPuppeteerRouter();

googleRouter.addDefaultHandler(async ({ log }) => {
    log.info(`Hello from Google!`);
});

const crawler = new PuppeteerCrawler({
    requestHandler: async (context) => {
        if (/apify\.com/.test(context.request.url)) {
            await apifyRouter(context);
        } else if (/google\.com/.test(context.request.url)) {
            await googleRouter(context);
        }
    },
});

await crawler.run(startUrls);

await Actor.exit();

Of course you may export googleRouter and apifyRouter out of different files.

AAltairSama2

thanks for the code snippet, I was trying out something similar but it errored out so I just dropped. really appreciate the help, thanks!

Add a reply

Join on Discord

Apify and Crawlee Official Forum

Structure Crawlers to scrape multiple sites