Apify

Apify and Crawlee Official Forum

b
F
A
J
A

long running scraper, 500+ pages for each crawl

Hello,

I have a playwright crawler that is listening to db changes, whenever a page gets added I want it to scrape and enqueue 500 links for the whole scraping process, but, there can be multiple things added to the DB at the same time. I've tried keepAlive and the maxRequests thing is hard to manage if we just keep adding urls to the same crawler.

My question is: what's the best way to create a playwright crawler that will automatically handle the processing of 500 pages for each start?
1
A
b
L
21 comments
apify approach is to do multiple runs, so running crawlee in apify cloud considered the best way so far by many peoples πŸ˜‰
hahha, good upsell but I don't want to use that πŸ˜› any other suggestions? cc

I was thinking overridi ng is finished function
you can use it partially, i.e. by running nodejs processes at your server with named request queue and dataset in cloud, or dockerize and run instances entirely at your own host, I think approach will be the same in any case, you need environment for multiple runs
is there no way to have 1 crawler handle multiple request queues?
But what is the problem you are solving? If it is processing speed, then you need bigger server or more servers (Apify, AWS, etc.)

Otherwise, I don't see a problem with keepAlive: true, just can also use the forefront option of request queue to prioritize some requests over others.
thanks for commenting. I basically have the addRequests listening to a db trigger, so whenever new rows come in, we add it to the request but each database row will need to do 500 crawls
just advanced to level 2! Thanks for your contributions! πŸŽ‰
so I need to control maxRequestsPerCrawl but per database row.. i.e either have a crawler per or somehow have a requestQueue per db id but that means the crawler will need to manage multiple queues.
Hope that makes sense, thanks for the help etc.
You can just track that in abitrary state object using useState
I'm getting so many of these:
ERROR PlaywrightCrawler: Request failed and reached maximum retries. elementHandle.textContent: Target page, context or browser has been closed
Any way to debug this or maybe can I ask one of you to look at my code to give me some advice? willing to pay $
Either you are running out of memory and the page crashed or you don't await some code and it already was closed when you tried to get the text
I imagine it has to do with the hackiness of how I'm starting the playwright crawler.
Plain Text
export const createCrawler = async (place_id: string, pdf_crawler: BasicCrawler) => {
  const request_queue = await RequestQueue.open(place_id)
  return new PlaywrightCrawler({
then I do to start it:
Plain Text
  const pdf_crawler = await createPDFCrawler(place.id,)
  const web_crawler = await createWebCrawler(place.id, pdf_crawler)

  console.log('Starting crawlers', place.name, place.id)
  const requests = queuedRunDocument['urls'].map((url) => {
    return new CrawleeRequest({
      url: url,
      userData: {
        place: { name: place.name, id: place.id, url },
      },
    })
  })

  await web_crawler.addRequests(requests)
  const web_promise = new Promise((resolve, reject) => {
    web_crawler
      .run()
      .then(() => {
        console.log('web crawler finished', place.id)
        resolve(true)
      })
      .catch((e) => {
        console.log('web crawler error', e)
        reject(e)
      })
  })
Error: Object with guid handle@dc8fe92256cc3997e03d3b2bf1e26da6 was not bound in the connection

elementHandle.evaluate: Target page, context or browser has been closed

I get these errors.
probably using apify will solve my problems, but, scared it will get expensive.
do you have an example of this?
yeah, your code feels like it has some unhandled promises, probably missing some await
Add a reply
Sign up and join the conversation on Discord
Join