Joshua Perk

Large threaded, kubernetes scrape = Target page, context or browser has been closed

Ironically just posted a similar issue (https://discord.com/channels/801163717915574323/1255531330704375828/1255531330704375828) but I wanted to provide some additional context to see if they're related.

I'm:

Running a node app with worker threads (usually 32 of them)
Running multiple containers in kubernetes

Each thread:

Grabs 5 domains from my postgres DB (of 5 million!)
Loops through each domain
Creates a new PlaywrightCrawler with unique-named storages (to prevent collision / global deletion from crawlers in other threads)
Queues the domains home page
Controllers then queue up some additional pages based on what's found on the home page
The results are processed in real-time and pushed to the database (since we don't want to wait until all 5M all are complete
The thread-specific sotrages are then deleted used drop()

The Problem
This works flawlessly... for about 60 minutes... afterwards, I get plagued with Target page, context or browser has been closed. It appears at the ~ hour mark is when this first presents itself and then incrementally increases in frequency until I'm getting more failed records than successful (at which point, I kill the cluster or restart it).

What I've tried:

browserPoolOptions like retireBrowserAfterPageCount: 100 and closeInactiveBrowserAfterSecs: 200
await crawler.teardown(); in hopes that this would clear and sort of cache/memory that could be stacking up
A cron to restart my cluster 🤣
Ensuring the EBS volumes are not running out of space (they're 20GB each and seem to be 50% full when crashing)
Ensuring the pods have plenty of memory (running EC2s with 64GB memory and 16 CPU (32 threads). Seems to handle the load in the first hour just fine.

I suspect there's a leak or store not being cleared out since it happens gradually?

23 comments

JJoshua Perk

How to scrape different things per page

I'm wrapping my head around how to architect my use case. Essentially I have an array of different domain:

Plain Text

[ 'acme.com', 'foo.com, 'bar.com', 'helloworld.org' ]

I want to look for different things as I guide my crawler through the domain. For example:

On the home page/root, I want to find any links that look similar to: /pricing, /security, /careers, and /blog.
I then want to perform different skills on each of these potential pages. For example:
a. On the pricing page, pass the innerHTML to ChatGPT to classify their pricing model
b. On the security page, search for the word "SOC2"
c. On the careers page, queue up to 100 links, and further process the individual job postings
d. On the blog page, count the number of articles

I'm not looking for someone to help specifically with a - d but more so help me understand best practices for structuring how you might go about creating "context aware" tasks on different pages.

1 comment

JJoshua Perk

Large threaded, kubernetes scrape = Target page, context or browser has been closed

Ironically @cryptorex just posted a similar issue (https://discord.com/channels/801163717915574323/1255531330704375828/1255531330704375828) but I wanted to provide some additional context to see if they're related.

I'm:

Running a node app with worker threads (usually 32 of them)
Running multiple containers in kubernetes

Each thread:

Grabs 5 domains from my postgres DB (of 5 million!)
Loops through each domain
Creates a new PlaywrightCrawler with unique-named storages (to prevent collision / global deletion from crawlers in other threads)
Queues the domains home page
Controllers then queue up some additional pages based on what's found on the home page
The results are processed in real-time and pushed to the database (since we don't want to wait until all 5M all are complete
The thread-specific sotrages are then deleted used drop()

browserPoolOptions like retireBrowserAfterPageCount: 100 and closeInactiveBrowserAfterSecs: 200
await crawler.teardown(); in hopes that this would clear and sort of cache/memory that could be stacking up
A cron to restart my cluster 🤣
Ensuring the EBS volumes are not running out of space (they're 20GB each and seem to be 50% full when crashing)
Ensuring the pods have plenty of memory (running EC2s with 64GB memory and 16 CPU (32 threads). Seems to handle the load in the first hour just fine.

I suspect there's a leak or store not being cleared out since it happens gradually?

20 comments

Apify Discord Mirror

Large threaded, kubernetes scrape = Target page, context or browser has been closed

How to scrape different things per page

Large threaded, kubernetes scrape = Target page, context or browser has been closed