Apify and Crawlee Official Forum

Updated 4 months ago

Large threaded, kubernetes scrape = Target page, context or browser has been closed

At a glance
Ironically just posted a similar issue (https://discord.com/channels/801163717915574323/1255531330704375828/1255531330704375828) but I wanted to provide some additional context to see if they're related.

I'm:
  • Running a node app with worker threads (usually 32 of them)
  • Running multiple containers in kubernetes
Each thread:
  • Grabs 5 domains from my postgres DB (of 5 million!)
  • Loops through each domain
  • Creates a new PlaywrightCrawler with unique-named storages (to prevent collision / global deletion from crawlers in other threads)
  • Queues the domains home page
  • Controllers then queue up some additional pages based on what's found on the home page
  • The results are processed in real-time and pushed to the database (since we don't want to wait until all 5M all are complete
  • The thread-specific sotrages are then deleted used drop()
The Problem
This works flawlessly... for about 60 minutes... afterwards, I get plagued with Target page, context or browser has been closed. It appears at the ~ hour mark is when this first presents itself and then incrementally increases in frequency until I'm getting more failed records than successful (at which point, I kill the cluster or restart it).

What I've tried:
  • browserPoolOptions like retireBrowserAfterPageCount: 100 and closeInactiveBrowserAfterSecs: 200
  • await crawler.teardown(); in hopes that this would clear and sort of cache/memory that could be stacking up
  • A cron to restart my cluster 🀣
  • Ensuring the EBS volumes are not running out of space (they're 20GB each and seem to be 50% full when crashing)
  • Ensuring the pods have plenty of memory (running EC2s with 64GB memory and 16 CPU (32 threads). Seems to handle the load in the first hour just fine.
I suspect there's a leak or store not being cleared out since it happens gradually?
2
c
J
A
23 comments
thanks for joining me in my........puzzle? haha. I'll elaborate a bit more.

  • The environment is docker running the crawler in nodeJS 20+ crawlee 3.9.2
  • Docker image is mcr.microsoft.com/playwright:v1.42.1-amd64
  • Running on a 12 Core / 48 GB memory / CRAWLEE_MEMORY_MBYTES=32768
  • Also have named storages (key stores, and queues)
  • For us, it doesn't seem to be a one hour mark, I've had some instances fail after 2 days and 45,000 requests later, this one yesterday was 10k requests and about 6 hours.
  • We are utilizing proxyConfiguration with about 25 proxies.
Settings we're using:

Plain Text
    maxRequestRetries: 3,
    maxConcurrency: 15,
    maxRequestsPerMinute: 150,
    maxRequestsPerCrawl: 35000,
    requestHandlerTimeoutSecs: 180,
Almost identical setup! I'm going to keep trying different configurations/ideas (and will share back if I find anything). I wonder if your variation in results is a clue of any sort....

Do you store fairly consistent amounts of data in each request? If so, crashing at vastly different points would point me away from memory/storage issues and almost more towards... site-specific errors?

I'm trying to understand a bit more about how Crawlee initiates browsers / clears storage. You'd think if the page was no longer available, just that request would fail and the next one would open a fresh browser and be just fine.

When you start to see the error does it only happen once or does it plague all the threads eventually until the process is basically useless?

Also, we're calling const crawler = new PlaywrightCrawler() inside our loop (ie. it's not a single crawler that stays alive for the entire thread). Is that your approach too?
Plain Text
Do you store fairly consistent amounts of data in each request? If so, crashing at vastly different points would point me away from memory/storage issues and almost more towards... site-specific errors?


It's not stored in a database until the crawler completes. It's only image URL and page URL data. So its stored in 'memory' (RAM) because we are doing some deduplication logic, and then sends it to database (firebase RTDB) upon crawler completion.

Plain Text
When you start to see the error does it only happen once or does it plague all the threads eventually until the process is basically useless?


It only happens once and crashes.


Plain Text
Also, we're calling const crawler = new PlaywrightCrawler() inside our loop  (ie. it's not a single crawler that stays alive for the entire thread). Is that your approach too?


No, we're calling it for each site that gets submitted. It's a customer flow, so a URL is submitted and we have a process listening for new submissions. These submissions trigger the crawler function. So its a single crawler for each URL, I would say.
just advanced to level 3! Thanks for your contributions! πŸŽ‰
Plain Text
I'm trying to understand a bit more about how Crawlee initiates browsers / clears storage. You'd think if the page was no longer available, just that request would fail and the next one would open a fresh browser and be just fine.


Are you using postNavigationHooks and/or preNavigationHooks ? We're using both. It seems to me that during one of these, the browser is closed - I can't seem to pinpoint the origin trigger for this error.
, you had asked for more context in a separate thread. Ever seen this before?
just advanced to level 1! Thanks for your contributions! πŸŽ‰
We're not using pre/post hooks but that's a good point... hmm....
hehe, then there is something in the requestHandler I might say...because we recently updated our crawler logic for thoroughness. Previously, we were not doing any page.evaluate(), page.waitForTimeout(3000), or page scrolling in the requestHandler - so it might be there then. πŸ€”
I'm going to look into ours! Btw, who'd you choose for proxies?
right now we are using https://instantproxies.com/
I guess we are outta luck ? πŸ˜„
i'm running things at a similar scale
we do about a million scraped pages a month
and I run into all kinds of issues.
  1. make sure you're await page.close() at the end of each one.
  2. no missing await's or things that might tell crawlee to close the page.
  3. I've had issues specifically w/ some chrome flags I had enabled that would make this worse. Are you using chrome flags and are you using headless: new, true, or false?
btw love the concept of your company, hope it works! attribution is important.
thanks for the details - i'm not setting headless, so I think its default to true, not using any chrome flags
quick update, interestingly we've updated to crawlee 3.11.0 and the crashes seem to have gone away
it hasn't crashed for weeks
Yeah, there was some bug in old crawlee versions.
It's always a good practice to use the latest version (now it's 3.11.3)
Add a reply
Sign up and join the conversation on Discord