Apify

Apify and Crawlee Official Forum

b
F
A
J
A

Structure Crawlers to scrape multiple sites

Hey everyone,

what is the recommended way to structure your code to scrape multiple sites? I looked at a few questions here and it seems that the recommended way is to use a single crawler but multiple routers to handle this case. the issue, I am facing from this is that when you enqueue links, you'll add site-1 and then site-2 initally before starting the crawler and then the crawler will dynamically add in the links as needed but this will mess up the logs since we are using a Queue and its FIFO, so first it'll crawl the first link, add the extracted links to the queue and then crawl the second link and its links to the queue and like this it'll keep switching contexts between the two sites which will make the logs a mess. Also routers, dont seem to have a url parameter, its just a category and then the request, so we will have to basically define handlers for each site in a single router right? which will just bloat up a single file.

Is there a better way I can structure this? usecase is to setup crawlers for 10+ sites and crawl them sequentially or in parallel but having sane logging for them.
P
A
A
21 comments
Not sure if I fully understand your question. You may use one requestHandler for everything and decide which function you want to use based on the request url.

Plain Text
        requestHandler: async (context) => {
            const { request } = context;

            if (/mydomain1\.com/.test(request.url)) {
                await processSite1(context);
            } else (/mydomain2\.com/.test(request.url)) {
                await processSite2(context);
            }
        }

This will not solve the logs "issue", but you might run the scraper as a two different instances with different input or implement own logger to put logs into different files.
ah got it, and yeah the logs will still be an issue, would you recommend running them in a serial manner? like I want concurrent scraping for a site but not have all of these run at once due to logging
what I was thinking of doing was to run the scraper for a site, close the scraper and then reinitialize it for the second site and so forth
or would it be better to put it as a script in package.json and have npm handle it?
just one last thing, can we do nested routers? for e,g, in the example above, since we have different funcs based on the url, can I instead swap out processSite* funcs with a router and have it handle it on a case by case basis?
or will it have to be a if/else based syntax as mentioned in the basic tutorial?
I would really love if nested routers would somehow work, since the router syntax is much more palatable than a huge 400-500 LOC if/else
Do you mean by using create***Route() function? You should be able to create several of these. Calling create***Route() will return a function and you need to pass the context to it. Then in the requestHandler you need to decide which of there routers you want to use, I don't know what are your requirements for this.
requirements are just as you stated, have different route handlers for different sites, and in the RequestHandler call the relevant route handler according to the site, in the snippet you linked await processSite1(context); I wanted to know if I could instead import the route from some other place and then pass in the context to it, because in the doc example
just advanced to level 2! Thanks for your contributions! πŸŽ‰
its like this
Plain Text
requestHandler:router
and I didnt find any functions in the docs which are doing the same
and btw what do you think about me explicitly closing the crawler and then re-initializing it for the next site?
will this kind of work?

Plain Text
await crawler.run('site:1')
await crawler.run('site:2')

or do I have to explicitly close the crawler or anything?
I think you are currently trying to create your own pattern that doesn't not follow the recommendations for scrapers made with Crawlee - so the only way for you is test it find out. πŸ™‚
gotcha thanks, can you clarify just one last thing? https://discord.com/channels/801163717915574323/1176976837528797245/1177561697288982608

I think I pretty much got it and just need this one thing, and it should work pretty well for my usecase
not sure if I understand, but you should be able to export and import these functions just like anything else
Plain Text
export const myRoutes = createRout();

myRoutes.addHandler("LABEL", (context) => {
  c
})

and import it in anther file
Plain Text
import { myRoutes } from './routes/my-routes.js';
Let's say I have two sites site1 and site2 and they both have separate routes like
site1_routes and site2_routes,

now in the crawler requestHandler I want to do something like this
Plain Text
requestHandler: async (context) => {
            const { request } = context;

            if (/mydomain1\.com/.test(request.url)) {
                await site1_routes(context);
            } else (/mydomain2\.com/.test(request.url)) {
                await site2_routes(context);
            }
        }

basically dont define all of the possible permutations for both the sites in a single route but rather have them in separate routes which I can switch on the fly depending on which url it is
In the docs createPlaywrightRouter does take in a context option but I havent seen it used with explicitly passing in context anywhere
I just tested it worked fine:
Plain Text
await Actor.init();

const startUrls = ['https://apify.com', 'https://google.com'];

export const apifyRouter = createPuppeteerRouter();

apifyRouter.addDefaultHandler(async ({ log }) => {
    log.info(`Hello from Apify!`);
});

export const googleRouter = createPuppeteerRouter();

googleRouter.addDefaultHandler(async ({ log }) => {
    log.info(`Hello from Google!`);
});

const crawler = new PuppeteerCrawler({
    requestHandler: async (context) => {
        if (/apify\.com/.test(context.request.url)) {
            await apifyRouter(context);
        } else if (/google\.com/.test(context.request.url)) {
            await googleRouter(context);
        }
    },
});

await crawler.run(startUrls);

await Actor.exit();

Of course you may export googleRouter and apifyRouter out of different files.
thanks for the code snippet, I was trying out something similar but it errored out so I just dropped. really appreciate the help, thanks!
Add a reply
Sign up and join the conversation on Discord
Join