long running scraper, 500+ pages for each crawl

bbmax

Hello,

I have a playwright crawler that is listening to db changes, whenever a page gets added I want it to scrape and enqueue 500 links for the whole scraping process, but, there can be multiple things added to the DB at the same time. I've tried keepAlive and the maxRequests thing is hard to manage if we just keep adding urls to the same crawler.

My question is: what's the best way to create a playwright crawler that will automatically handle the processing of 500 pages for each start?

21 comments

AAlexey Udovydchenko

apify approach is to do multiple runs, so running crawlee in apify cloud considered the best way so far by many peoples 😉

bbmax

hahha, good upsell but I don't want to use that 😛 any other suggestions? cc

I was thinking overridi ng is finished function

AAlexey Udovydchenko

you can use it partially, i.e. by running nodejs processes at your server with named request queue and dataset in cloud, or dockerize and run instances entirely at your own host, I think approach will be the same in any case, you need environment for multiple runs

bbmax

is there no way to have 1 crawler handle multiple request queues?

LLukas Krivka

But what is the problem you are solving? If it is processing speed, then you need bigger server or more servers (Apify, AWS, etc.)

Otherwise, I don't see a problem with keepAlive: true, just can also use the forefront option of request queue to prioritize some requests over others.

bbmax

thanks for commenting. I basically have the addRequests listening to a db trigger, so whenever new rows come in, we add it to the request but each database row will need to do 500 crawls

AApifyBot

just advanced to level 2! Thanks for your contributions! 🎉

bbmax

so I need to control maxRequestsPerCrawl but per database row.. i.e either have a crawler per or somehow have a requestQueue per db id but that means the crawler will need to manage multiple queues.

bbmax

Hope that makes sense, thanks for the help etc.

LLukas Krivka

You can just track that in abitrary state object using useState

bbmax

I'm getting so many of these:

[31mERROR[39m[33m PlaywrightCrawler:[39m Request failed and reached maximum retries. elementHandle.textContent: Target page, context or browser has been closed

bbmax

Any way to debug this or maybe can I ask one of you to look at my code to give me some advice? willing to pay $

LLukas Krivka

Either you are running out of memory and the page crashed or you don't await some code and it already was closed when you tried to get the text

bbmax

I imagine it has to do with the hackiness of how I'm starting the playwright crawler.

bbmax

Plain Text

export const createCrawler = async (place_id: string, pdf_crawler: BasicCrawler) => {
  const request_queue = await RequestQueue.open(place_id)
  return new PlaywrightCrawler({

bbmax

then I do to start it:

Plain Text

  const pdf_crawler = await createPDFCrawler(place.id,)
  const web_crawler = await createWebCrawler(place.id, pdf_crawler)

  console.log('Starting crawlers', place.name, place.id)
  const requests = queuedRunDocument['urls'].map((url) => {
    return new CrawleeRequest({
      url: url,
      userData: {
        place: { name: place.name, id: place.id, url },
      },
    })
  })

  await web_crawler.addRequests(requests)
  const web_promise = new Promise((resolve, reject) => {
    web_crawler
      .run()
      .then(() => {
        console.log('web crawler finished', place.id)
        resolve(true)
      })
      .catch((e) => {
        console.log('web crawler error', e)
        reject(e)
      })
  })

bbmax

Error: Object with guid handle@dc8fe92256cc3997e03d3b2bf1e26da6 was not bound in the connection

elementHandle.evaluate: Target page, context or browser has been closed

I get these errors.

bbmax

probably using apify will solve my problems, but, scared it will get expensive.

bbmax

do you have an example of this?

bbmax

https://github.com/microsoft/playwright/issues/27997#issuecomment-1812673983 this actually solved a lot downgrading to 1.38.0

LLukas Krivka

yeah, your code feels like it has some unhandled promises, probably missing some await

Add a reply

Join on Discord

Apify and Crawlee Official Forum

long running scraper, 500+ pages for each crawl