Has anyone found a solution to run Crawlee inside a Res...

At a glance

The community member has a NodeJS API that starts a crawler, but they are having trouble managing the request queue to handle additional and concurrent API calls. They need to run the crawler in their own cloud due to constraints. The community members discuss various approaches, such as using Apify, Crawlee, and Puppeteer-cluster. Some suggestions include running multiple crawlers concurrently, manipulating the crawler.autoscaledPool, and adjusting the autoscaledPoolOptions.isFinishedFunction to keep the crawler running even when the queue is empty. The community members also discuss hosting options and the challenges of meeting a 5-second response time requirement for on-demand crawling.

Useful resources

CCasper

I have managed to get some parts of it working such that I have a NodeJs API that starts my crawler.
I have yet to manage the request queue to handle additional and concurrent API calls, so I would just like to know if someone has had any luck implementing such a solution?
My particular use case for this API requires running in my cloud instead of Apify

23 comments

AAlexey Udovydchenko

depends on data case, I created several small cheerio actors, runnable under 128Mb of RAM, for single data request run is done in 4-5 seconds, I still prefer to read data from dataset, but actually it can be delivered as regular API: Apify.main(async () => { .... return finalJSONData; }

AAlexey Udovydchenko

then you can POST to https://api.apify.com/v2/acts/your~actor/run-sync?token=... and get data in response

CCasper

Interesting however my use case requires running the crawlee crawler in my own cloud because of some constraints. So I will try to work a bit more with it. In order to scale the api I am not sure if crawlee supports running multiple crawlers at the same time or I should start a separate instance?

CCasper

I need to do on demand crawling with PlaywrightCrawler within 5 seconds and scale when more people hit the api. So it is a bit of an edge case

LLukas Krivka

You can run more crawlers at the same time, just need to assign a new queue or list to each and cleanup after if needed.

You can also keep a crawler running and just fill the queue. You need to manipulate the crawler.autoscaledPool

CCasper

Thanks

AAlexey Udovydchenko

hm, from past experience I would like to suggest to create cheapest DigitalOcean instance for $5 a month and run crawler endlessly in ExpressJS wrapper, otherwise 5sec challenge will become a pretty big issue

AAlexey Udovydchenko

if you feel confident about dev-ops you can get twice bigger server from contabo.com for a same price, or check other hosting options

CCasper

thanks that is a good suggestion. I am using azure container apps or Kubernetes in Azure to handle the container. Locally I get data from the site within 6,7 seconds which is also fine. So the only thing I need to solve at the moment is to keep the crawler running as long as there are API requests and run a crawler for each request.

CCasper

I am quite confident with DevOps

CCasper

I tried adding, but it stops the crawler after first crawl, because queue is apparently empty even though I add a new url to it in my GET api endpoint

Plain Text

 autoscaledPoolOptions: {
      minConcurrency: 1,
    },

CCasper

What can I pass to await crawler.run(), to keep the crawler running until I explicitly say it should stop?

CCasper

I currently have this code, which works and keeps the crawler running, but sometimes the crawler does not wait till the price appears for the domain on the site and the crawler does not shutdown the browser once processing is done plus the api does not finish and return data to the client. So if you have any suggestions about how the code should look like to have an API that can run multiple crawlers at the same time independent of each API all that would be nice 🙂
https://gist.github.com/Trubador/b67a6b78cafec99f191b7aa33f2ed654

RRomja

I'm also solving a similar problem right now. I used to use puppeteer-cluster (example of implementation https://github.com/thomasdondorf/puppeteer-cluster/blob/master/examples/express-screenshot.js) Now I've decided to migrate to Crawlee and found that it doesn't seem to have any functionality.

CCasper

Yeah a lot of customization is needed. I have the api working, however it can only process 1 request right now before needing to be restarted.

RRomja

I don't think that's the coolest solution 😦

CCasper

No defintely not. It needs to handle multiple concurrent requests

CCasper

It should be pretty evident how I want to solve this by looking at the code

LLukas Krivka

You need to adjust the autoscaledPoolOptions isFinishedFunction https://crawlee.dev/api/core/class/AutoscaledPool. This way you can keep the crawler running even if the queue is empty

AAlexey Udovydchenko

Its site-specific, I did not check your targeted site, in the past I used crawler with permanently opened page and all other logic was performed based on page instance, this way you will save time on page opening and if you will be able to find out how internal page webapp works you can mimic calls to their data via fetch() under browser. If anything else works faster for browser-based scraping I will be surprised 😉

SStatbot

Attachment

CCasper

thanks I will try that 🙂

CCasper

I don't completely understand your comment, do you mean to call apis in case the client application is not hosted server-side?

Add a reply

Apify Discord Mirror

Has anyone found a solution to run Crawlee inside a Rest API on demand?