Apify and Crawlee Official Forum

4
4unkur
Offline, last seen 4 months ago
Joined August 30, 2024
Hi,
I have a puppeteer scrapper, which worked just fine until this Monday. Nothing is changed, but scrapper stopped working.

The HTML markup of the page is not changed, a[data-testid="search-listing-title"] this element is still there. Apify run logs says it is failing to find this HTML element:
TimeoutError: waiting for selector a[data-testid="search-listing-title"] failed: timeout 30000ms exceeded
I have tried to launch scrapper from local machine and it did work but does not work on Apify platform. I guess something has to do with proxy.
This is part of my code:
Plain Text
//...
const proxyConfiguration = await Apify.createProxyConfiguration();

const launchContext = {
  useChrome: true,
  stealth: true,
  launchOptions: {
    headless: true,
  },
};

const crawler = new Apify.PuppeteerCrawler({
  requestList,
  requestQueue,
  proxyConfiguration,
  launchContext: launchContext as any,
  maxRequestRetries: 5,
  handlePageTimeoutSecs: 180,
  navigationTimeoutSecs: 180,
  async handlePageFunction({ page, request }): Promise<void> {

  await utils.puppeteer.saveSnapshot(page, { key: 'beforescrap', saveHtml: false });
  const cheerio = load(await page.content());

  const html = cheerio.html();
  await Apify.setValue('htmlstring', html, { contentType: 'text/html' });

  await page.waitForSelector('a[data-testid="search-listing-title"]');
//...

I have tried to take a screenshot to see what the page looks like and it gives blank white page.
I have also tried to change proxy settings to use residential servers and change the country - also did not work.
How can I debug this?
Logs screenshot is also attached.
10 comments
P
4
A
Plain Text
        for (let i = 1; i <= 5; i++) {
            const [response] = await Promise.all([
                page.waitForResponse((res) => res.url() !== locationURL  && res.url().includes('/approved-used?')),
                page.click('.load-more button')
            ]);


            const d = await response.json();


Basically my code looks something like this. I click the load more button and want to grab ajax api call response data. It works only on the first try, then it just fails, due to waitForResponse never getting the response

Am I missing something?
1 comment
A
Is it possible to fill the login form, submit the form and get the cookies using CheerioCrawler?
The form has CSRF protection, so it's not just an api endpoint. Or should I stick to puppeteer for such websites?
2 comments
L
S
in the requestHandler I'm trying to click to the pagination next button and I cannot determine if the content is changed or not.
How can I do it? waitfornetworkidle does not seem to work here. any ideas? See the GIF

Plain Text
new PuppeteerCrawler({
preNavigationHooks: [
        async ({ page }) => {
            page.on('response', async (res) => {
                if (res.url().includes('api/offersearches/filters')) {
                    try {
                        const json = await res.json();
                        const jsonString = JSON.stringify(json);
                        const filePath = 'data.json';
                        fs.appendFile(filePath, jsonString + '\n', () => {});
                    } catch (err) {
                        console.error('Response wasn\'t JSON or failed to parse response.');
                    }
                }
            });
        },
    ],
    async requestHandler({ request, page }) {
        for (let i = 0; i < maxNumberOfPages; i++) {
            const isDisabled = await page.evaluate(() => document.querySelector('[data-testid="mo-pagination-next"] button.mo-button--pagination').disabled);
            if (isDisabled) {
                break;
            }

            await Promise.all([
                page.waitForNetworkIdle(),
                page.click('[data-testid="mo-pagination-next"] button.mo-button--pagination'),
            ]);
            console.log('clicked'); // it never reaches
        }
    },
});


Here's my code so far. Currently button is clicked OK, the data is fetched OK. it just hangs in the end, I guess waitForNetworkIdle is never resolving
4 comments
4
L
I'm trying to get the data from ajax post call (graphQL) on a webpage but it does not seem to work
I have tried to run the crawler with headful mode and open the network tab, the request is being made and response is there but waitForResponse does not seem to work (
Here's my code:
Plain Text
const crawler = new PuppeteerCrawler({
    proxyConfiguration,
    requestQueue,
    maxRequestRetries: 5,
    navigationTimeoutSecs: 180,
    requestHandlerTimeoutSecs: 180,
    async requestHandler({ request, page }) {
// ...
   log.warning('GraphQL starting to wait');

    await page.waitForNetworkIdle();

    log.warning('IDLE!!!');

    await page.waitForRequest(
        (req) => req.url().includes(URL_PROPERTIES_DICTIONARY.GRAPHQL_PATH),
    );

    log.warning('GraphQL request is done');

    const response = await page.waitForResponse(
        (httpResponse) => httpResponse.status() === 200 && httpResponse.url().includes(URL_PROPERTIES_DICTIONARY.GRAPHQL_PATH),
        { timeout: 180 * 1000 },
    );

    log.warning('GraphQL response arrived');

    const data = await response.json();
//...

As you can see I also have added waitForNetworkIdle for testing and it finishes before waitForResponse, which is strange. See the logs:
Plain Text
INFO  Page opened. {"label":"vehicle","url":"https://www.autotrader.co.uk/car-details/202307270142806?sort=relevance&advertising-location=at_cars&make=Audi&model=A2&postcode=PO16%207GZ&fromsra"}
WARN  GraphQL starting to wait
WARN  IDLE!!!
WARN  PuppeteerCrawler: Reclaiming failed request back to the list or queue. Timed out after waiting 30000ms


Maybe I'm missing something?
By the way, the code was written for apify sdk version 1 and was working OK. I have upgraded to v3 and it stopped working OR it works reallly slow. like really slow
9 comments
4
A
A
m