Apify Discord Mirror

Updated 5 months ago

How to launch playwrightcrawler inside basiccrawler?

At a glance

The community member has a code that uses a BasicCrawler to fetch data from a URL, and then passes the URLs from the response to a PlaywrightCrawler. However, the community member is experiencing an issue where the BasicCrawler is being called again for the URLs passed to the PlaywrightCrawler.

The comments suggest that this is happening because both crawlers are using the same default RequestQueue. The solution proposed is to create a separate RequestQueue for one of the crawlers.

Additionally, the community members discuss ways to limit the number of tabs in a window, and how to add a delay between requests using the postNavigationHooks option in the PlaywrightCrawler constructor.

There is no explicitly marked answer in the comments.

Useful resources
So i have this code:

Plain Text
const cookieJar = new CookieJar();
export const basicCrawler = new BasicCrawler({
    async requestHandler({ sendRequest, request, log }) {
        try {
            const res = await sendRequest({
                url: request.url,
                method: 'GET',
                cookieJar
            });
            const json = destr(res.body);
             const urls = json.map(v => v.url);
            await playCrawler.run(urls);
        } catch (error) {
            console.log(error);

        }
     
    },
});

//code for playwright crawler here

I start the crawler by calling the basicCrawler.run(['url']);
The problem is it seems to call the basicCrawler again for the urls i pass to playCrawler. how is that possible?
1
N
H
A
13 comments
Also the try catch inside basicCrawler is triggered for errors from playCrawler
so you are trying to run playwrightCrawler inside the handaler of basicCrawler? what is the usecase for this? this is quite wild construction
maybe you use the same default requestQueue for both crawlers
The usecase would be calling an http api and run playwrightCrawler on its results
just advanced to level 1! Thanks for your contributions! πŸŽ‰
What i don't understand is the urls i pass to playwrightCrawler is queued to basicCrawler as well
How is that possible?
That is because, there is only one default RequestQueue related to the run. Since you didn't specified any requestQueue in the constructor for the crawlers they both are using the same default one. You may need to create another names Request queue for one of those crawlers.
is there a way to limit the number of tabs in a window? like use different window with one tab?
Plain Text
   maxConcurrency: 4,
   useSessionPool: true,
   browserPoolOptions: {
       maxOpenPagesPerBrowser: 2,
   }

For a PlaywrightCrawler constructor is probably what you are looking for. Should use two browsers each with two tabs.
Thanks a lot, Is there a way to put a delay in between the two requests? currently crawlee opens the urls almost at the same time.
You would probably need to solve your own logic in prenavigation hook: https://docs.apify.com/sdk/js/docs/2.3/typedefs/puppeteer-crawler-options#prenavigationhooks
This is very poor implementation but you may get the idea:

Plain Text
function increment() {
    this.number = (this.number || 0) + 1
    return number;
}


and then in PlaywrightCrawler constructor define something like
Plain Text
postNavigationHooks: [
    async ({page}) => {
        await page.waitForTimeout(increment() * 1_000); // 1 000ms = 1s - the number would be increasing with each request.
    },
]
whats the use case of storing results in seperate files?
Add a reply
Sign up and join the conversation on Discord