heres the actual scrpaing logic - each site follows mostlythe same pattern with diff tags and slightly diff logic for the descriptions and shipping info:
publicGoodsPwRouter.addHandler('PUBLIC_GOODS_PRODUCT', async ({ page, request }) => {
try {
console.log('Scraping products');
const site = 'Public Goods';
const title = await page.$eval('h1.ProductMeta__Title.Heading.aos-init.aos-animate', (el) => el.textContent?.trim() || '');
const descriptions = await page.$$eval('div.ProductMeta__Description--metafields.aos-init.aos-animate p', (paragraphs) => {
return paragraphs.map((p) => p.textContent?.trim());
});
let originalPrice = '';
try {
originalPrice = await page.$eval('span.ProductMeta__Price.Price.Price--compareAt.Text--subdued', (el) => el.textContent?.trim() || '');
} catch (error) {
console.log('Error retrieving original price:', error);
// Handle the error or set a default value for originalPrice
originalPrice = 'N/A';
}
const salePrice = await page.$eval('span.ProductMeta__Price.Price.Text--subdued', (el) => el.textContent?.trim() || '');
const shippingInfo = await page.$$eval('div#tab-4 div.product-description.rte p', (paragraphs) => {
return paragraphs.map((p) => p.textContent?.trim());
});
const reviewScore = await page.$eval('span.sr-only', (el) => el.textContent?.trim() || '');
const reviewNumber = await page.$eval('a.text-m', (el) => el.textContent?.trim() || '');
const productData = {
url: request.loadedUrl,
site,
title,
descriptions,
originalPrice,
salePrice,
shippingInfo,
reviewScore,
reviewNumber,
};
productList.push(productData);
....
} catch (error) {
console.log('Error scraping product:', error);
publicGoodsPwQueue.reclaimRequest(request);
return
}
});
do i need to change the timeout setting
and also how do i deal with errors when one or more of the data elements arent found without crashing the crawelr and still scrpaing the other available info
To resolve the first issue, I would recommend you run each crawler separately, first run the Puppeteer crawler, and once you collect all of the product URLs, run the Cheerio crawler.
If you don't find data elements and you want to continue you don't have to do anything, it should automatically mark the request as done.
ive used an all pw crawler and it doesnt run into these issues it seems to occur when the AutoscaledPool scales up after this message:
INFO Statistics: PlaywrightCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":5894,"requestsFinishedPerMinute":37,"requestsFailedPerMinute":0,"requestTotalDurationMillis":218083,"requestsTotal":37,"crawlerRuntimeMillis":60049,"retryHistogram":[37]}
INFO PlaywrightCrawler:AutoscaledPool: state {"currentConcurrency":7,"desiredConcurrency":8,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0.085},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
how can i make sure that the cheerio crawling is independent of the puppeteer crawling - they use different request queues, crawlers, and routers
and how can i make sure the cheerio crawler is stopped after there are no more links to crawl to not cause any of these errors
i alos tested and saw that even if i use two playwright crawlers it doesnt work so its an issue with two crawlers colliding with each other when the autoscaled pool scales up
What you can do is create named named queues for each crawler or requests array and keep pushing the requests into it, once the Playwright crawler finishes, you can create the Cheerio crawler and pass the queue/requests array to it and that should solve the issue.
Note that if you want to use queues, you need to create separate named queues for both crawlers
Something like this:
const playwrightQueue = await Actor.openRequestQueue('playwright_queue');
const cheerioQueue = await Actor.openRequestQueue('cheerio_queue');
// Add initital requests to the queue
playwrightQueue.addRequests(initialRequests);
// const cheerioRequests = [];
const playwrightCrawler = new PlaywrightCrawler({
proxyConfiguration,
requestQueue: playwrightQueue,
requestHandler: async () => {
// handle request...
// Push the requests
cheerioQueue.addRequests(
// ...
);
// OR
// cheerioRequests.push(
// ...
// );
},
});
// Run playwright crawler
await playwrightCrawler.run();
// Once it's done, run cheerio crawler
const cheerioCrawler = new CheerioCrawler({
proxyConfiguration,
// Pass the cheerio queue
requestQueue: cheerioQueue,
requestHandler: async () => { },
});
await cheerioCrawler.run();
// Or
// Pass the generated requests to cheerio
// await cheerioCrawler.run(cheerioRequests);
i alr use separate request queues and the scraped requests from the cheerio crawler are passed to the playwright crawler that has a separate request queue
Okay then you need to run each crawler separately