error handling w/ playwright

At a glance

The community member is experiencing an issue with their web scraper, where it successfully scrapes around 30-40 product URLs using a Playwright crawler, but then crashes and randomly rescrapes some old URLs. The community members have provided the actual scraping logic, which involves using Playwright to extract various product details. They have asked about changing the timeout settings and how to handle errors when data elements are not found, without crashing the crawler. The community members have received suggestions to run the Puppeteer and Cheerio crawlers separately, with the Puppeteer crawler collecting the product URLs first, and then passing them to the Cheerio crawler. They have also been advised to use separate request queues for each crawler to ensure independence and to stop the Cheerio crawler once there are no more links to crawl, to avoid further issues. The community members have tried using separate request queues and passing the scraped requests from the Cheerio crawler to the Playwright crawler, but the issue still persists. The answer provided is that they need to run each crawler separately.

hharish

i've been experiencing this same error pattern with my scrapers of completely different sites that i thought was an individual site problem but now its a repeating pattern
i have my scraper scraping results page urls, then product urls, which it has no problem with, but when it goes through those product urls with a playwright crawler, it always scrapes around 30-40 URLs successfully and suddenly experiences some crash error and then randomly rescrapes a couple of old product urls before crashing

11 comments

hharish

heres the actual scrpaing logic - each site follows mostlythe same pattern with diff tags and slightly diff logic for the descriptions and shipping info:

hharish

Plain Text

publicGoodsPwRouter.addHandler('PUBLIC_GOODS_PRODUCT', async ({ page, request }) => {
    try {
        console.log('Scraping products');

        const site = 'Public Goods';

        const title = await page.$eval('h1.ProductMeta__Title.Heading.aos-init.aos-animate', (el) => el.textContent?.trim() || '');

        const descriptions = await page.$$eval('div.ProductMeta__Description--metafields.aos-init.aos-animate p', (paragraphs) => {
            return paragraphs.map((p) => p.textContent?.trim());
        });
        let originalPrice = '';
        try {
            originalPrice = await page.$eval('span.ProductMeta__Price.Price.Price--compareAt.Text--subdued', (el) => el.textContent?.trim() || '');
        } catch (error) {
            console.log('Error retrieving original price:', error);
            // Handle the error or set a default value for originalPrice
            originalPrice = 'N/A';
        }
        const salePrice = await page.$eval('span.ProductMeta__Price.Price.Text--subdued', (el) => el.textContent?.trim() || '');


        const shippingInfo = await page.$$eval('div#tab-4 div.product-description.rte p', (paragraphs) => {
            return paragraphs.map((p) => p.textContent?.trim());
        });


        const reviewScore = await page.$eval('span.sr-only', (el) => el.textContent?.trim() || '');
        const reviewNumber = await page.$eval('a.text-m', (el) => el.textContent?.trim() || '');

        const productData = {
        url: request.loadedUrl,
        site,
        title,
        descriptions,
        originalPrice,
        salePrice,
        shippingInfo,
        reviewScore,
        reviewNumber,
        };

        productList.push(productData);

        ....
    } catch (error) {
        console.log('Error scraping product:', error);
        publicGoodsPwQueue.reclaimRequest(request);
        return
    }    
});

hharish

do i need to change the timeout setting
and also how do i deal with errors when one or more of the data elements arent found without crashing the crawelr and still scrpaing the other available info

HHamza

To resolve the first issue, I would recommend you run each crawler separately, first run the Puppeteer crawler, and once you collect all of the product URLs, run the Cheerio crawler.

If you don't find data elements and you want to continue you don't have to do anything, it should automatically mark the request as done.

hharish

ive used an all pw crawler and it doesnt run into these issues it seems to occur when the AutoscaledPool scales up after this message:
INFO Statistics: PlaywrightCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":5894,"requestsFinishedPerMinute":37,"requestsFailedPerMinute":0,"requestTotalDurationMillis":218083,"requestsTotal":37,"crawlerRuntimeMillis":60049,"retryHistogram":[37]}
INFO PlaywrightCrawler:AutoscaledPool: state {"currentConcurrency":7,"desiredConcurrency":8,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0.085},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}

hharish

how can i make sure that the cheerio crawling is independent of the puppeteer crawling - they use different request queues, crawlers, and routers
and how can i make sure the cheerio crawler is stopped after there are no more links to crawl to not cause any of these errors

hharish

i alos tested and saw that even if i use two playwright crawlers it doesnt work so its an issue with two crawlers colliding with each other when the autoscaled pool scales up

HHamza

What you can do is create named named queues for each crawler or requests array and keep pushing the requests into it, once the Playwright crawler finishes, you can create the Cheerio crawler and pass the queue/requests array to it and that should solve the issue.

Note that if you want to use queues, you need to create separate named queues for both crawlers

Something like this:

Plain Text

const playwrightQueue = await Actor.openRequestQueue('playwright_queue');
const cheerioQueue = await Actor.openRequestQueue('cheerio_queue');

// Add initital requests to the queue
playwrightQueue.addRequests(initialRequests);

// const cheerioRequests = [];

const playwrightCrawler = new PlaywrightCrawler({
    proxyConfiguration,
    requestQueue: playwrightQueue,
    requestHandler: async () => {
        // handle request...

        // Push the requests
        cheerioQueue.addRequests(
            // ...
        );

        // OR
        // cheerioRequests.push(
            // ...
        // );
    },
});

// Run playwright crawler
await playwrightCrawler.run();

// Once it's done, run cheerio crawler
const cheerioCrawler = new CheerioCrawler({
    proxyConfiguration,
    // Pass the cheerio queue
    requestQueue: cheerioQueue,
    requestHandler: async () => { },
});

await cheerioCrawler.run();
// Or
// Pass the generated requests to cheerio
// await cheerioCrawler.run(cheerioRequests);

hharish

i alr use separate request queues and the scraped requests from the cheerio crawler are passed to the playwright crawler that has a separate request queue

HHamza

Okay then you need to run each crawler separately

Add a reply

Apify Discord Mirror

error handling w/ playwright