Thanks , I agree with you, I also don't like to waste time to re-invent the wheel. That's why I share my problem here 🙂
I tried to use
skipNavigation
and
sendRequest
already, but the PlaywrightCrawler or PuppeteerCrawler open and close a page for each request.
This is a problem for me, because each request I make to the website's internal API needs to be digitally signed via an algorithm contained in the website page. I tried to use
sendRequest
but I was always blocked, even when I passed the headers and cookies, I suspect that the digital signature expires quickly (I might also have missed something, the
fetch
function is overridden by the website to add this digital signature).
In the meantime I made a class that extends the
BrowserCrawler and override
_runRequestHandler
(where the page is opened) and
_cleanupContext
(where the page is closed). I'm very grateful for Apify to open source and document Crawlee, as I was able to come up with this solution relatively quickly, but again there might be a better way to do that.
I also tried another solution: I gave only one request to the crawler, to open the homepage of the website I want to scrap, and saved all my API calls in the
userData:
await crawler.addRequests([
{ url: 'https://website-to-scrap.com', userData: { apiUrls: ['internal API url 1', 'internal API url 2', /* ... */] } },
]);
The problem with this solution is that I have to manage the actual requests all by myself, so I can't leverage Crawlee features such as parallelization and auto-retry.