Apify

Apify and Crawlee Official Forum

b
F
A
J
A

How to make sure all external requests have been awaited and intercepted?

I'm scrapping pages of a website as part of a content migration. Some of those pages make some post requests to algolia (3-4 requests) on the client side and I need to intercept those requests, since I need some data that is sent in the request body. One thing that is important to note is that I don't know which pages make the requests and which pages don't. Because of that, I'd need to find a way to await for all the external requests FOR EACH PAGE and just strat crawling the page html after that. That way, if I can ensure I awaited for all the requests and still it didn't intercept the algolia request, it would mean that specific page didn't make a request to algolia.

I created a solution that seemed to be working at first. However, after crawling the pages a few times, I noticed that, sometimes, it wouldn't show the algolia data in the dataset for a few pages but I could confirm in the browser that the page makes the algolia request. So, my guess is that it ends crawiling the page html before intercepting that algolia request (??). Ideally, it would only start crawling the html AFTER all the external requests ended. I used puppeteer because I found the addInterceptRequestHandler in the docs, but I could use Playwright if it's easier. Can someone here help me to understand what I'm doing wrong? Here is a gist with the code I'm using: https://gist.github.com/lcnogueira/d1822287d718731a7f4a36f05d1292fc (I can't post it here, otherwise my message becomes too long)
L
A
2 comments
Well, I found a simpler solution to solve this specific need...instead of trying to intercept the requests I just got the script elements used to make the request and got the data from them. Something like:

Plain Text
const data = await page.$eval('html', (html) => {
  const script = Array.from(html.querySelectorAll('script')).find(script => script.src.includes('my-script-name.js'));
  const field = providerScript?.getAttribute('attribute-name');;
  return field;
});


However, I'd still be interested to know how I can pause my crawling until all the external requests are made so that I can handle another need I have.
Like this: https://docs.apify.com/academy/node-js/caching-responses-in-puppeteer#implementation-in-crawlee - intercept in preNavigationHooks and add relevant "waitfor" in handle function
Add a reply
Sign up and join the conversation on Discord
Join