Apify and Crawlee Official Forum

Updated 3 months ago

error handling w/ playwright

i've been experiencing this same error pattern with my scrapers of completely different sites that i thought was an individual site problem but now its a repeating pattern
i have my scraper scraping results page urls, then product urls, which it has no problem with, but when it goes through those product urls with a playwright crawler, it always scrapes around 30-40 URLs successfully and suddenly experiences some crash error and then randomly rescrapes a couple of old product urls before crashing
h
H
11 comments
heres the actual scrpaing logic - each site follows mostlythe same pattern with diff tags and slightly diff logic for the descriptions and shipping info:
Plain Text
publicGoodsPwRouter.addHandler('PUBLIC_GOODS_PRODUCT', async ({ page, request }) => {
    try {
        console.log('Scraping products');

        const site = 'Public Goods';

        const title = await page.$eval('h1.ProductMeta__Title.Heading.aos-init.aos-animate', (el) => el.textContent?.trim() || '');

        const descriptions = await page.$$eval('div.ProductMeta__Description--metafields.aos-init.aos-animate p', (paragraphs) => {
            return paragraphs.map((p) => p.textContent?.trim());
        });
        let originalPrice = '';
        try {
            originalPrice = await page.$eval('span.ProductMeta__Price.Price.Price--compareAt.Text--subdued', (el) => el.textContent?.trim() || '');
        } catch (error) {
            console.log('Error retrieving original price:', error);
            // Handle the error or set a default value for originalPrice
            originalPrice = 'N/A';
        }
        const salePrice = await page.$eval('span.ProductMeta__Price.Price.Text--subdued', (el) => el.textContent?.trim() || '');


        const shippingInfo = await page.$$eval('div#tab-4 div.product-description.rte p', (paragraphs) => {
            return paragraphs.map((p) => p.textContent?.trim());
        });


        const reviewScore = await page.$eval('span.sr-only', (el) => el.textContent?.trim() || '');
        const reviewNumber = await page.$eval('a.text-m', (el) => el.textContent?.trim() || '');

        const productData = {
        url: request.loadedUrl,
        site,
        title,
        descriptions,
        originalPrice,
        salePrice,
        shippingInfo,
        reviewScore,
        reviewNumber,
        };

        productList.push(productData);

        ....
    } catch (error) {
        console.log('Error scraping product:', error);
        publicGoodsPwQueue.reclaimRequest(request);
        return
    }    
});
do i need to change the timeout setting
and also how do i deal with errors when one or more of the data elements arent found without crashing the crawelr and still scrpaing the other available info
To resolve the first issue, I would recommend you run each crawler separately, first run the Puppeteer crawler, and once you collect all of the product URLs, run the Cheerio crawler.

If you don't find data elements and you want to continue you don't have to do anything, it should automatically mark the request as done.
ive used an all pw crawler and it doesnt run into these issues it seems to occur when the AutoscaledPool scales up after this message:
INFO Statistics: PlaywrightCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":5894,"requestsFinishedPerMinute":37,"requestsFailedPerMinute":0,"requestTotalDurationMillis":218083,"requestsTotal":37,"crawlerRuntimeMillis":60049,"retryHistogram":[37]}
INFO PlaywrightCrawler:AutoscaledPool: state {"currentConcurrency":7,"desiredConcurrency":8,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0.085},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
how can i make sure that the cheerio crawling is independent of the puppeteer crawling - they use different request queues, crawlers, and routers
and how can i make sure the cheerio crawler is stopped after there are no more links to crawl to not cause any of these errors
i alos tested and saw that even if i use two playwright crawlers it doesnt work so its an issue with two crawlers colliding with each other when the autoscaled pool scales up
What you can do is create named named queues for each crawler or requests array and keep pushing the requests into it, once the Playwright crawler finishes, you can create the Cheerio crawler and pass the queue/requests array to it and that should solve the issue.

Note that if you want to use queues, you need to create separate named queues for both crawlers

Something like this:

Plain Text
const playwrightQueue = await Actor.openRequestQueue('playwright_queue');
const cheerioQueue = await Actor.openRequestQueue('cheerio_queue');

// Add initital requests to the queue
playwrightQueue.addRequests(initialRequests);

// const cheerioRequests = [];

const playwrightCrawler = new PlaywrightCrawler({
    proxyConfiguration,
    requestQueue: playwrightQueue,
    requestHandler: async () => {
        // handle request...

        // Push the requests
        cheerioQueue.addRequests(
            // ...
        );

        // OR
        // cheerioRequests.push(
            // ...
        // );
    },
});

// Run playwright crawler
await playwrightCrawler.run();

// Once it's done, run cheerio crawler
const cheerioCrawler = new CheerioCrawler({
    proxyConfiguration,
    // Pass the cheerio queue
    requestQueue: cheerioQueue,
    requestHandler: async () => { },
});

await cheerioCrawler.run();
// Or
// Pass the generated requests to cheerio
// await cheerioCrawler.run(cheerioRequests);
i alr use separate request queues and the scraped requests from the cheerio crawler are passed to the playwright crawler that has a separate request queue
Okay then you need to run each crawler separately
Add a reply
Sign up and join the conversation on Discord