PlayWrightCrawler new request results are bleeding into...

At a glance

The community member is running a web crawler function called crawl that is being called by a setInterval function, simulating a cron job. The issue is that when multiple jobs are running, the results are getting mixed together. The community member has tried using useState and a custom callback, but the results are still not isolated.

In the comments, another community member suggests creating multiple request queues or request lists, one for each crawler, to prevent the results from mixing. The community member tries this approach and it seems to have worked, but they will do more testing. Another community member also suggests cleaning up the named queues afterwards using await rQueue.drop().

There is no explicitly marked answer, but the community members have provided a solution to the issue.

ccryptorex

Hello, first some code:

crawl function

Plain Text

     async function crawl (jobId, websiteURL, cb) {

      var crawler = new crawlee.PlaywrightCrawler({
      // Use the requestHandler to process each of the crawled pages.
      async requestHandler({ request, page, enqueueLinks, log }) {

          const element = await page.$$eval('img', as => as.map(a => a.src));
          if (element.length > 0) {
            for (var img of element) {
              if(cb.indexOf(img) === -1) {
                cb.push(img);
              }
            }
          }
          
          // Extract links from the current page
          // and add them to the crawling queue.
          await enqueueLinks();
      },
      sessionPoolOptions: { persistStateKey: jobId, persistStateKeyValueStoreId: jobId },

     });

    await crawler.run([websiteURL]);
    await crawler.teardown()
    
    return cb;
}

setInterval calls this function

Plain Text

 
   async function fetchImagesUrls (uid, jobId, websiteURL) {
   console.log("Fetching images...")

   const results = await crawl(jobId, websiteURL, cb = []);
   console.log(results);

   return results;
}

Background: I'm calling the fetchImagesUrls from a setInterval function simulating a 'cron job'. I purposely make setinterval pick up Job#1 (details are fetched from a DB) then when the Job#1 starts, I make Job#2 be available for processing.

Behavior: Now Job#1 and Job#2 are running from two different calls, however, the results are getting mixed into each other.

I've tried useState() and my own callback (as shown here) - is there a way to make new calls be isolated to their own results set?

I understand I might be missing something regarding JS fundamentals, but some guidance would be much appreciated. Thanks!

6 comments

AApifyBot

just advanced to level 1! Thanks for your contributions! 🎉

ccryptorex

Other stuff I tried

injecting a key as the jobId into the cb array and push relevant job results to that key and return results from the cb array via the corresponding key, like: { 'jobId': ['url1', 'url2', 'url2'] }

LLukas Krivka

You need to create multiple request queues or request lists, one for each crawler. Then the results won't mix

ccryptorex

thanks! that seemed easy, and I think it worked. I can see the storage -> request_queues now has the assigned jobId (uuid)

so I added this:

Plain Text

const rQueue = await crawlee.RequestQueue.open(jobId);

and passed it into my crawl function, then to the crawler init object as requestQueue: rQueue and I think it worked!

I will do more testing but thanks again for your guidance!

LLukas Krivka

You will just need to clean the named queues afterwards. await rQueue.drop()

ccryptorex

ok thanks Lukas!

Add a reply

Apify Discord Mirror

PlayWrightCrawler new request results are bleeding into old requests. RequestQueue issue?