Crawler for SPAs (Single Page Application)

MMarc Plouhinec

Hi all!

My target is to scrap a website composed of SPAs (Single Page Application) and it looks like existing browser crawlers (i.e. PlaywrightCrawler and PuppeteerCrawler) are not a good fit as each request is processed in a new page, which is a waste of resources.

What I need is to open one browser page and execute multiple XHR / fetch requests to their unofficial API, until I get blocked and need to re-open a new browser page to continue until all requests have been processed.

Note that need a browser to pass fingerprint checks and use the website's internal library to digitally sign each request to their unofficial API.

I'm thinking to solve my need by writing a SinglePageBrowserCrawler that extends BasicCrawler and works similarly to BrowserCrawler but manage browser pages differently.

Is it a good idea? Is there a way to do this in a better way?

Thanks in advance for your feedback!

13 comments

MMike

Overall, to my experience, it's better to try to reuse existing solutions as maximum as possible, before implementing something by hand.
I'm not very proficient with Crawlee, but would like to draw attention to couple of things that could help you. There is function sendRequest (from Got Scrapping), which could be used for inner requests (while Playwright loads the main request page). It's available in the CrawlingContext same as crawler, log, pushData, etc.
Also, you could add these API requests to the Request Queue and in your parse-json handler retrieve the webpage's body and do something like await page.body.toJson() (not precise syntax, but the idea)

MMike

Also just find out about the skipNavigation parameter, so you could speed up the API crawling using a RequestQueue without openning the pages in a browser (docs)

MMarc Plouhinec

Thanks , I agree with you, I also don't like to waste time to re-invent the wheel. That's why I share my problem here 🙂

I tried to use skipNavigation and sendRequest already, but the PlaywrightCrawler or PuppeteerCrawler open and close a page for each request.

This is a problem for me, because each request I make to the website's internal API needs to be digitally signed via an algorithm contained in the website page. I tried to use sendRequest but I was always blocked, even when I passed the headers and cookies, I suspect that the digital signature expires quickly (I might also have missed something, the fetch function is overridden by the website to add this digital signature).

In the meantime I made a class that extends the BrowserCrawler and override _runRequestHandler (where the page is opened) and _cleanupContext (where the page is closed). I'm very grateful for Apify to open source and document Crawlee, as I was able to come up with this solution relatively quickly, but again there might be a better way to do that.

I also tried another solution: I gave only one request to the crawler, to open the homepage of the website I want to scrap, and saved all my API calls in the userData:

Plain Text

await crawler.addRequests([
    { url: 'https://website-to-scrap.com', userData: { apiUrls: ['internal API url 1', 'internal API url 2', /* ... */] } },
]);

The problem with this solution is that I have to manage the actual requests all by myself, so I can't leverage Crawlee features such as parallelization and auto-retry.

mmemo23

interesting topic, in my case I created two crawlers one with Puppeteer for login and saving necessary details and those details I inject into second Basic crawler which finishes the job, seems fine to me

MMarc Plouhinec

It's a smart strategy, unfortunately in my case I need to keep the puppeteer page open as I need to constantly invoke functions inside this page to digitally sign my forged HTTP requests to their API.

mmemo23

If I don’t use crawlee, what I usually do with puppeteer is in page.evaluate I fire jusj ajax request amd when it fails I just fix with puppeteer and continue

LLukas Krivka

Generally, we solve SPAs with CheerioCrawler, the data is either in initial HTML in scripts and/or in XHRs that you need to compose. Sometimes that might be tricky and requires some experience. https://docs.apify.com/academy/api-scraping

If that would be too difficult, you can always fallback to browsers and clicking around.

MMarc Plouhinec

A CheerioCrawler would be the best, but in my case the website's backend APIs requires signed HTTP requests. So I would have to replicate their signature algorithm in order to send my HTTP requests from NodeJS.

I tried this solution initially, by extracting the website's Webpack bundle responsible for signing requests, and then running it in a NodeJS context. But I eventually abandoned this solution as the signature was rejected (I must have missed something).

I finally implemented a SinglePageBrowserCrawler that keeps pages open across requests (as long as no blocking is detected) and run page.evaluate() to sign and execute fetch/XHR requests . It's an acceptable compromise for now: although it's more resource intensive than a CheerioCrawler, it's much faster than clicking around, and also more flexible as I can tweak the API parameters (e.g. more results per page).

If I have time in the future I will try again to convert my actor into a CheerioCrawler.

LLukas Krivka

Yeah, we also sometimes give up on reverse engineering if it takes few hours with not much progress. Would you mind sharing core of the SinglePageBrowserCrawler code?, it seems like an elegant solution.

MMarc Plouhinec

Sure! not sure it's elegant as I had to copy/paste code from BrowserCrawler and PuppeteerCrawler, but here you go!

MMarc Plouhinec

What would be great is to have multiple "modes" in the existing BrowserCrawler in Crawlee:

open and close a page for each request
keep the page open across requests

My solution is a "hack" that will need to be updated every time a new release of Crawlee updates the BrowserCrawler.

GGabe

Do you have a code sample that you could possibly share how you sync cookies and other context from Puppeteer Crawler to the Brasic crawler?

MMarc Plouhinec

Sorry I don't share cookies: I open the website, I run fetch as much as I can, and when I get blocked I re-open a new browser page with a fresh session

Add a reply

Join on Discord

Apify and Crawlee Official Forum

Crawler for SPAs (Single Page Application)