How to launch playwrightcrawler inside basiccrawler?

At a glance

The community member has a code that uses a BasicCrawler to fetch data from a URL, and then passes the URLs from the response to a PlaywrightCrawler. However, the community member is experiencing an issue where the BasicCrawler is being called again for the URLs passed to the PlaywrightCrawler.

The comments suggest that this is happening because both crawlers are using the same default RequestQueue. The solution proposed is to create a separate RequestQueue for one of the crawlers.

Additionally, the community members discuss ways to limit the number of tabs in a window, and how to add a delay between requests using the postNavigationHooks option in the PlaywrightCrawler constructor.

There is no explicitly marked answer in the comments.

Useful resources

NNisthar

So i have this code:

Plain Text

const cookieJar = new CookieJar();
export const basicCrawler = new BasicCrawler({
    async requestHandler({ sendRequest, request, log }) {
        try {
            const res = await sendRequest({
                url: request.url,
                method: 'GET',
                cookieJar
            });
            const json = destr(res.body);
             const urls = json.map(v => v.url);
            await playCrawler.run(urls);
        } catch (error) {
            console.log(error);

        }
     
    },
});

//code for playwright crawler here

I start the crawler by calling the basicCrawler.run(['url']);
The problem is it seems to call the basicCrawler again for the urls i pass to playCrawler. how is that possible?

13 comments

NNisthar

Also the try catch inside basicCrawler is triggered for errors from playCrawler

HHonzaS

so you are trying to run playwrightCrawler inside the handaler of basicCrawler? what is the usecase for this? this is quite wild construction
maybe you use the same default requestQueue for both crawlers

NNisthar

The usecase would be calling an http api and run playwrightCrawler on its results

AApifyBot

just advanced to level 1! Thanks for your contributions! 🎉

NNisthar

What i don't understand is the urls i pass to playwrightCrawler is queued to basicCrawler as well

NNisthar

How is that possible?

PPepa J

That is because, there is only one default RequestQueue related to the run. Since you didn't specified any requestQueue in the constructor for the crawlers they both are using the same default one. You may need to create another names Request queue for one of those crawlers.

NNisthar

is there a way to limit the number of tabs in a window? like use different window with one tab?

PPepa J

Plain Text

   maxConcurrency: 4,
   useSessionPool: true,
   browserPoolOptions: {
       maxOpenPagesPerBrowser: 2,
   }

For a PlaywrightCrawler constructor is probably what you are looking for. Should use two browsers each with two tabs.

NNisthar

Thanks a lot, Is there a way to put a delay in between the two requests? currently crawlee opens the urls almost at the same time.

PPepa J

You would probably need to solve your own logic in prenavigation hook: https://docs.apify.com/sdk/js/docs/2.3/typedefs/puppeteer-crawler-options#prenavigationhooks
This is very poor implementation but you may get the idea:

Plain Text

function increment() {
    this.number = (this.number || 0) + 1
    return number;
}

and then in PlaywrightCrawler constructor define something like

Plain Text

postNavigationHooks: [
    async ({page}) => {
        await page.waitForTimeout(increment() * 1_000); // 1 000ms = 1s - the number would be increasing with each request.
    },
]

NNisthar

can you take a look at this? https://discord.com/channels/801163717915574323/1076083814817869854

NNisthar

whats the use case of storing results in seperate files?

Add a reply

Apify Discord Mirror

How to launch playwrightcrawler inside basiccrawler?