Not sure if I fully understand your question. You may use one
requestHandler
for everything and decide which function you want to use based on the request url.
requestHandler: async (context) => {
const { request } = context;
if (/mydomain1\.com/.test(request.url)) {
await processSite1(context);
} else (/mydomain2\.com/.test(request.url)) {
await processSite2(context);
}
}
This will not solve the logs "issue", but you might run the scraper as a two different instances with different input or implement own logger to put logs into different files.
ah got it, and yeah the logs will still be an issue, would you recommend running them in a serial manner? like I want concurrent scraping for a site but not have all of these run at once due to logging
what I was thinking of doing was to run the scraper for a site, close the scraper and then reinitialize it for the second site and so forth
or would it be better to put it as a script in package.json and have npm handle it?
just one last thing, can we do nested routers? for e,g, in the example above, since we have different funcs based on the url, can I instead swap out processSite*
funcs with a router and have it handle it on a case by case basis?
or will it have to be a if/else
based syntax as mentioned in the basic tutorial?
I would really love if nested routers would somehow work, since the router syntax is much more palatable than a huge 400-500 LOC if/else
Do you mean by using create***Route()
function? You should be able to create several of these. Calling create***Route()
will return a function and you need to pass the context
to it. Then in the requestHandler
you need to decide which of there routers you want to use, I don't know what are your requirements for this.
requirements are just as you stated, have different route handlers for different sites, and in the RequestHandler
call the relevant route handler according to the site, in the snippet you linked await processSite1(context);
I wanted to know if I could instead import the route from some other place and then pass in the context to it, because in the doc example
just advanced to level 2! Thanks for your contributions! π
and I didnt find any functions in the docs which are doing the same
and btw what do you think about me explicitly closing the crawler and then re-initializing it for the next site?
will this kind of work?
await crawler.run('site:1')
await crawler.run('site:2')
or do I have to explicitly
close the crawler or anything?
I think you are currently trying to create your own pattern that doesn't not follow the recommendations for scrapers made with Crawlee - so the only way for you is test it find out. π
not sure if I understand, but you should be able to export and import these functions just like anything else
export const myRoutes = createRout();
myRoutes.addHandler("LABEL", (context) => {
c
})
and import it in anther file
import { myRoutes } from './routes/my-routes.js';
Let's say I have two sites
site1
and
site2
and they both have separate routes like
site1_routes
and
site2_routes
,
now in the crawler
requestHandler
I want to do something like this
requestHandler: async (context) => {
const { request } = context;
if (/mydomain1\.com/.test(request.url)) {
await site1_routes(context);
} else (/mydomain2\.com/.test(request.url)) {
await site2_routes(context);
}
}
basically dont define all of the possible permutations for both the sites in a
single route but rather have them in separate routes which I can switch on the fly depending on which url it is
In the docs createPlaywrightRouter
does take in a context option but I havent seen it used with explicitly passing in context anywhere
I just tested it worked fine:
await Actor.init();
const startUrls = ['https://apify.com', 'https://google.com'];
export const apifyRouter = createPuppeteerRouter();
apifyRouter.addDefaultHandler(async ({ log }) => {
log.info(`Hello from Apify!`);
});
export const googleRouter = createPuppeteerRouter();
googleRouter.addDefaultHandler(async ({ log }) => {
log.info(`Hello from Google!`);
});
const crawler = new PuppeteerCrawler({
requestHandler: async (context) => {
if (/apify\.com/.test(context.request.url)) {
await apifyRouter(context);
} else if (/google\.com/.test(context.request.url)) {
await googleRouter(context);
}
},
});
await crawler.run(startUrls);
await Actor.exit();
Of course you may export
googleRouter
and
apifyRouter
out of different files.
thanks for the code snippet, I was trying out something similar but it errored out so I just dropped. really appreciate the help, thanks!