Apify Discord Mirror

Updated 5 months ago

Has anyone found a solution to run Crawlee inside a Rest API on demand?

At a glance

The community member has a NodeJS API that starts a crawler, but they are having trouble managing the request queue to handle additional and concurrent API calls. They need to run the crawler in their own cloud due to constraints. The community members discuss various approaches, such as using Apify, Crawlee, and Puppeteer-cluster. Some suggestions include running multiple crawlers concurrently, manipulating the crawler.autoscaledPool, and adjusting the autoscaledPoolOptions.isFinishedFunction to keep the crawler running even when the queue is empty. The community members also discuss hosting options and the challenges of meeting a 5-second response time requirement for on-demand crawling.

Useful resources
I have managed to get some parts of it working such that I have a NodeJs API that starts my crawler.
I have yet to manage the request queue to handle additional and concurrent API calls, so I would just like to know if someone has had any luck implementing such a solution?
My particular use case for this API requires running in my cloud instead of Apify
2
A
C
L
23 comments
depends on data case, I created several small cheerio actors, runnable under 128Mb of RAM, for single data request run is done in 4-5 seconds, I still prefer to read data from dataset, but actually it can be delivered as regular API: Apify.main(async () => { .... return finalJSONData; }
then you can POST to https://api.apify.com/v2/acts/your~actor/run-sync?token=... and get data in response
Interesting however my use case requires running the crawlee crawler in my own cloud because of some constraints. So I will try to work a bit more with it. In order to scale the api I am not sure if crawlee supports running multiple crawlers at the same time or I should start a separate instance?
I need to do on demand crawling with PlaywrightCrawler within 5 seconds and scale when more people hit the api. So it is a bit of an edge case
You can run more crawlers at the same time, just need to assign a new queue or list to each and cleanup after if needed.

You can also keep a crawler running and just fill the queue. You need to manipulate the crawler.autoscaledPool
hm, from past experience I would like to suggest to create cheapest DigitalOcean instance for $5 a month and run crawler endlessly in ExpressJS wrapper, otherwise 5sec challenge will become a pretty big issue
if you feel confident about dev-ops you can get twice bigger server from contabo.com for a same price, or check other hosting options
thanks that is a good suggestion. I am using azure container apps or Kubernetes in Azure to handle the container. Locally I get data from the site within 6,7 seconds which is also fine. So the only thing I need to solve at the moment is to keep the crawler running as long as there are API requests and run a crawler for each request.
I am quite confident with DevOps
I tried adding, but it stops the crawler after first crawl, because queue is apparently empty even though I add a new url to it in my GET api endpoint
Plain Text
 autoscaledPoolOptions: {
      minConcurrency: 1,
    },
What can I pass to await crawler.run(), to keep the crawler running until I explicitly say it should stop?
I currently have this code, which works and keeps the crawler running, but sometimes the crawler does not wait till the price appears for the domain on the site and the crawler does not shutdown the browser once processing is done plus the api does not finish and return data to the client. So if you have any suggestions about how the code should look like to have an API that can run multiple crawlers at the same time independent of each API all that would be nice πŸ™‚
https://gist.github.com/Trubador/b67a6b78cafec99f191b7aa33f2ed654
I'm also solving a similar problem right now. I used to use puppeteer-cluster (example of implementation https://github.com/thomasdondorf/puppeteer-cluster/blob/master/examples/express-screenshot.js) Now I've decided to migrate to Crawlee and found that it doesn't seem to have any functionality.
Yeah a lot of customization is needed. I have the api working, however it can only process 1 request right now before needing to be restarted.
I don't think that's the coolest solution 😦
No defintely not. It needs to handle multiple concurrent requests
It should be pretty evident how I want to solve this by looking at the code
You need to adjust the autoscaledPoolOptions isFinishedFunction https://crawlee.dev/api/core/class/AutoscaledPool. This way you can keep the crawler running even if the queue is empty
Its site-specific, I did not check your targeted site, in the past I used crawler with permanently opened page and all other logic was performed based on page instance, this way you will save time on page opening and if you will be able to find out how internal page webapp works you can mimic calls to their data via fetch() under browser. If anything else works faster for browser-based scraping I will be surprised πŸ˜‰
thanks I will try that πŸ™‚
I don't completely understand your comment, do you mean to call apis in case the client application is not hosted server-side?
Add a reply
Sign up and join the conversation on Discord