4unkur

Puppeteer Crawler cannot open the page

Hi,
I have a puppeteer scrapper, which worked just fine until this Monday. Nothing is changed, but scrapper stopped working.

The HTML markup of the page is not changed, a[data-testid="search-listing-title"] this element is still there. Apify run logs says it is failing to find this HTML element:
TimeoutError: waiting for selector a[data-testid="search-listing-title"] failed: timeout 30000ms exceeded
I have tried to launch scrapper from local machine and it did work but does not work on Apify platform. I guess something has to do with proxy.
This is part of my code:

Plain Text

//...
const proxyConfiguration = await Apify.createProxyConfiguration();

const launchContext = {
  useChrome: true,
  stealth: true,
  launchOptions: {
    headless: true,
  },
};

const crawler = new Apify.PuppeteerCrawler({
  requestList,
  requestQueue,
  proxyConfiguration,
  launchContext: launchContext as any,
  maxRequestRetries: 5,
  handlePageTimeoutSecs: 180,
  navigationTimeoutSecs: 180,
  async handlePageFunction({ page, request }): Promise<void> {

  await utils.puppeteer.saveSnapshot(page, { key: 'beforescrap', saveHtml: false });
  const cheerio = load(await page.content());

  const html = cheerio.html();
  await Apify.setValue('htmlstring', html, { contentType: 'text/html' });

  await page.waitForSelector('a[data-testid="search-listing-title"]');
//...

I have tried to take a screenshot to see what the page looks like and it gives blank white page.
I have also tried to change proxy settings to use residential servers and change the country - also did not work.
How can I debug this?
Logs screenshot is also attached.

10 comments

44unkur

Puppeteer waitForResponse in a loop. Is it possible?

Plain Text

        for (let i = 1; i <= 5; i++) {
            const [response] = await Promise.all([
                page.waitForResponse((res) => res.url() !== locationURL  && res.url().includes('/approved-used?')),
                page.click('.load-more button')
            ]);


            const d = await response.json();

Basically my code looks something like this. I click the load more button and want to grab ajax api call response data. It works only on the first try, then it just fails, due to waitForResponse never getting the response

Am I missing something?

1 comment

44unkur

Submit login form with CheerioCrawler

Is it possible to fill the login form, submit the form and get the cookies using CheerioCrawler?
The form has CSRF protection, so it's not just an api endpoint. Or should I stick to puppeteer for such websites?

2 comments

44unkur

How to determine if dynamic content is loaded or not. PuppeteerCrawler

in the requestHandler I'm trying to click to the pagination next button and I cannot determine if the content is changed or not.
How can I do it? waitfornetworkidle does not seem to work here. any ideas? See the GIF

Plain Text

new PuppeteerCrawler({
preNavigationHooks: [
        async ({ page }) => {
            page.on('response', async (res) => {
                if (res.url().includes('api/offersearches/filters')) {
                    try {
                        const json = await res.json();
                        const jsonString = JSON.stringify(json);
                        const filePath = 'data.json';
                        fs.appendFile(filePath, jsonString + '\n', () => {});
                    } catch (err) {
                        console.error('Response wasn\'t JSON or failed to parse response.');
                    }
                }
            });
        },
    ],
    async requestHandler({ request, page }) {
        for (let i = 0; i < maxNumberOfPages; i++) {
            const isDisabled = await page.evaluate(() => document.querySelector('[data-testid="mo-pagination-next"] button.mo-button--pagination').disabled);
            if (isDisabled) {
                break;
            }

            await Promise.all([
                page.waitForNetworkIdle(),
                page.click('[data-testid="mo-pagination-next"] button.mo-button--pagination'),
            ]);
            console.log('clicked'); // it never reaches
        }
    },
});

Here's my code so far. Currently button is clicked OK, the data is fetched OK. it just hangs in the end, I guess waitForNetworkIdle is never resolving

4 comments

44unkur

PuppeteerCrawler waitForResponse timeout issue. Seems like it skips desired request

I'm trying to get the data from ajax post call (graphQL) on a webpage but it does not seem to work
I have tried to run the crawler with headful mode and open the network tab, the request is being made and response is there but waitForResponse does not seem to work (
Here's my code:

Plain Text

const crawler = new PuppeteerCrawler({
    proxyConfiguration,
    requestQueue,
    maxRequestRetries: 5,
    navigationTimeoutSecs: 180,
    requestHandlerTimeoutSecs: 180,
    async requestHandler({ request, page }) {
// ...
   log.warning('GraphQL starting to wait');

    await page.waitForNetworkIdle();

    log.warning('IDLE!!!');

    await page.waitForRequest(
        (req) => req.url().includes(URL_PROPERTIES_DICTIONARY.GRAPHQL_PATH),
    );

    log.warning('GraphQL request is done');

    const response = await page.waitForResponse(
        (httpResponse) => httpResponse.status() === 200 && httpResponse.url().includes(URL_PROPERTIES_DICTIONARY.GRAPHQL_PATH),
        { timeout: 180 * 1000 },
    );

    log.warning('GraphQL response arrived');

    const data = await response.json();
//...

As you can see I also have added waitForNetworkIdle for testing and it finishes before waitForResponse, which is strange. See the logs:

Plain Text

INFO  Page opened. {"label":"vehicle","url":"https://www.autotrader.co.uk/car-details/202307270142806?sort=relevance&advertising-location=at_cars&make=Audi&model=A2&postcode=PO16%207GZ&fromsra"}
WARN  GraphQL starting to wait
WARN  IDLE!!!
WARN  PuppeteerCrawler: Reclaiming failed request back to the list or queue. Timed out after waiting 30000ms

Maybe I'm missing something?
By the way, the code was written for apify sdk version 1 and was working OK. I have upgraded to v3 and it stopped working OR it works reallly slow. like really slow

9 comments

Apify and Crawlee Official Forum

Puppeteer Crawler cannot open the page

Puppeteer waitForResponse in a loop. Is it possible?

Submit login form with CheerioCrawler

How to determine if dynamic content is loaded or not. PuppeteerCrawler

PuppeteerCrawler waitForResponse timeout issue. Seems like it skips desired request