Apify and Crawlee Official Forum

Updated 2 years ago

taking list of scraped urls and conducting multiple new scrapes

i have this code that scrapes product URLs from an Amazon results page
i am able to successfully scrape the product URLs, but I'm unable to take each link and scrape the needed info in another crawler
do i need another cheerio router
also how can i take each link once scraped and instead add it to a requestlist and requestqueue and then take the urls in that request queue and scrape that information
h
A
7 comments
here are the codes:
main.js:
import { CheerioCrawler } from 'crawlee';
import { router } from './routes.js';

const searchKeywords = 'computers'; // Replace with desired search keywords
const searchUrl = https://www.amazon.com/s?k=${searchKeywords};
const startUrls = [searchUrl];

const crawler = new CheerioCrawler({
// Start the crawler right away and ensure there will always be 5 concurrent requests
ran at any time
minConcurrency: 5,
// Ensure the crawler doesn't exceed 15 concurrent requests ran at any time
maxConcurrency: 15,
// ...but also ensure the crawler never exceeds 250 requests per minute
maxRequestsPerMinute: 250,
// Define router to run crawl
requestHandler: router
});

await crawler.run(startUrls)
routes.js:
import { CheerioCrawler, createCheerioRouter } from 'crawlee';
import fs from 'fs';

export const router = createCheerioRouter();
const linkArray = [];


router.addHandler(async ({ $ }) => {
// Scrape product links from search results page
const productLinks = $('h2 a').map((_, el) => 'https://www.amazon.com' +
$(el).attr('href')).get();
console.log(Found ${productLinks.length} product links);

// Add each product link to array (this is inside router[01])
for (const link of productLinks) {
const router02 = createCheerioRouter();
router02.addDefaultHandler(async ({ $ }) => {
const productInfo = {};
productInfo.storeName = 'Amazon';
productInfo.productTitle =
$('span.a-size-large.product-title-word-break').text().trim();
productInfo.productDescription =
$('div.a-row.a-size-base.a-color-secondary').text().trim();
productInfo.salePrice = $('span.a-offscreen').text().trim();
productInfo.originalPrice = $('span.a-price.a-text-price').text().trim();
productInfo.reviewScore = $('span.a-icon-alt').text().trim();
productInfo.shippingInfo =
$('div.a-row.a-size-base.a-color-secondary.s-align-children-center').text().trim();
// Write product info to JSON file
if (productInfoList.length > 0) {
const rawData = JSON.stringify(productInfo, null, 2);
fs.appendFile('rawData.json', rawData, (err) => {
if (err) throw err;
console.log(Product info written to rawData.json for ${link});
});
}
})

//router02.queue.addRequest({ url: link });
const amazon = new CheerioCrawler({
// Start the crawler right away and ensure there will always be 5 concurrent
requests ran at any time
minConcurrency: 1,
// Ensure the crawler doesn't exceed 15 concurrent requests ran at any time
maxConcurrency: 10,
// ...but also ensure the crawler never exceeds 400 requests per minute
maxRequestsPerMinute: 400,
// Define route for crawler to run on
requestHandler: router02
});
await amazon.run(link);
console.log('running link')
}
});
here is the console output i receive:
INFO CheerioCrawler: Starting the crawl
Found 36 product links
WARN CheerioCrawler: Reclaiming failed request back to the list or queue. Expected
requests to be of type array but received type string
{"id":"b1h8C8G7WjcTMKd","url":"https://www.amazon.com/s?k=computers","retryCount":1}
INFO CheerioCrawler: Crawl finished. Final request statistics:
{"requestsFinished":0,"requestsFailed":1,"retryHistogram":[null,null,null,1],"requestAvgFailedDur
ationMillis":1880,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"reque
stsFailedPerMinute":3,"requestTotalDurationMillis":1880,"requestsTotal":1,"crawlerRuntimeMillis
":18054}
here is the pdf as well with the codes especially if you are confused on different indents and what each function goes under'
That's a lot of code, but straight away I see that you're creating a second router. Why? You should use one router per crawler, and use different routes. You could differentiate them with request.label. router.addHandler is not correct syntax - you're not providing label here. It should be either default handler, or router.addHandler('SEARH_PAGE', async ....) while the first request, instead of just URL will be { url: searchUrl], label: 'SEARCH_PAGE' } . router02.queue.addRequest this is also not correct - it should crawler.addRequests([]) , while router is part of the context. Some relevant links:
https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler#router
https://crawlee.dev/api/cheerio-crawler/function/createCheerioRouter
https://crawlee.dev/api/cheerio-crawler/interface/CheerioCrawlingContext
Add a reply
Sign up and join the conversation on Discord