accessing RequestQueue/RequestList for scraper

At a glance

The community member has a Cheerio crawler that can successfully crawl an Amazon search results page for product links. They want to add these product links to a RequestQueue/RequestList and then access the list in a different route to crawl the product pages and extract the necessary data. The community member is having trouble implementing this functionality and is receiving an error related to the request type.

In the comments, another community member suggests using crawler.addRequests() to add the product links to the queue, and provides a link to the relevant documentation. However, there is no explicitly marked answer to the original question.

Useful resources

hharish

i have a cheerio craweler able to crawl an amazon results page for product inks and it does so successfully

but then i want to add those to a RequestQueue/RequestList (by enqueueing each request from RequestList into RequestQueue) and then access it in a diff route and crawl that list of product links with the cheerio crawler for the data needed, how can i do so

this is what my code looks like

7 comments

AApifyBot

just advanced to level 2! Thanks for your contributions! 🎉

hharish

routes.js

Plain Text

import { CheerioCrawler, createCheerioRouter } from 'crawlee';
import fs from 'fs';
export const router = createCheerioRouter();
router.addDefaultHandler(async ({ $ }) => {
  //Scrape product links from search results page
  const productLinks = $('h2 a').map((_, el) => 'https://www.amazon.com' + $(el).attr('href')).get();
  console.log(`Found ${productLinks.length} product links`);
  //Add each product link to array
  for (const link of productLinks) {
    router.addHandler(async ({ $ }) => {
    const productInfo = {};
    productInfo.storeName = 'Amazon';
    productInfo.productTitle = $('span.a-size-large.product-title-word-break').text().trim();               productInfo.productDescription = $('div.a-row.a-size-base.a-color-secondary').text().trim();
    productInfo.salePrice = $('span.a-offscreen').text().trim();
    productInfo.originalPrice = $('span.a-price.a-text-price').text().trim();
    productInfo.reviewScore = $('span.a-icon-alt').text().trim();
    productInfo.shippingInfo = $('div.a-row.a-size-base.a-color-secondary.s-align-children-center').text().trim();    
      //Write product info to JSON file
     if (productInfoList.length > 0) {
       const rawData = JSON.stringify(productInfo, null, 2);
       fs.appendFile('rawData.json', rawData, (err) => {
        if (err) throw err;
        console.log(`Product info written to rawData.json for ${link}`);
     });
    }
  })
   //router.queue.addRequest({ url: link });
   const amazon = new CheerioCrawler({
    // Start the crawler right away and ensure there will always be 5 concurrent requests ran at any time
   minConcurrency: 1,
     //Ensure the crawler doesn't exceed 15 concurrent requests ran at any time
     maxConcurrency: 10,
     //but also ensure the crawler never exceeds 400 requests per minute
     maxRequestsPerMinute: 400,
 
     //Define route for crawler to run on
     requestHandler: router
   });
   await amazon.run(link);
   console.log('running link')
 }
});

hharish

this routes.js FILE is run from a main.js file:

Plain Text

import { CheerioCrawler } from 'crawlee';
import { router } from './routes.js';

const searchKeywords = 'computers'; // Replace with desired search keywords
const searchUrl = `https://www.amazon.com/s?k=${searchKeywords}`;

const startUrls = [searchUrl];

const crawler = new CheerioCrawler({
  // Start the crawler right away and ensure there will always be 5 concurrent requests ran at any time
  minConcurrency: 5,
  // Ensure the crawler doesn't exceed 15 concurrent requests ran at any time
  maxConcurrency: 15,
  // ...but also ensure the crawler never exceeds 250 requests per minute
  maxRequestsPerMinute: 250,

  // Define router to run crawl
  requestHandler: router
});

await crawler.run(startUrls);

hharish

i want to change this to do what i described at the top - > i am not using the RequestList or RequestQueue yet in this code and I am recieving this error:

Plain Text

INFO  CheerioCrawler: Starting the crawl
Found 31 product links
WARN  CheerioCrawler: Reclaiming failed request back to the list or queue. Expected `requests` to be of type `array` but received type `string`
 {"id":"b1h8C8G7WjcTMKd","url":"https://www.amazon.com/s?k=computers","retryCount":1}
Found 35 product links
WARN  CheerioCrawler: Reclaiming failed request back to the list or queue. Expected `requests` to be of type `array` but received type `string`
 {"id":"b1h8C8G7WjcTMKd","url":"https://www.amazon.com/s?k=computers","retryCount":2}
Found 35 product links
WARN  CheerioCrawler: Reclaiming failed request back to the list or queue. Expected `requests` to be of type `array` but received type `string`
 {"id":"b1h8C8G7WjcTMKd","url":"https://www.amazon.com/s?k=computers","retryCount":3}
Found 27 product links
ERROR CheerioCrawler: Request failed and reached maximum retries. ArgumentError: Expected `requests` to be of type `array` but received type `string`
    at ow (C:\Users\haris\OneDrive\Documents\GitHub\crawleeScraper\my-crawler\node_modules\ow\dist\index.js:33:28)
    at CheerioCrawler.addRequests (C:\Users\haris\OneDrive\Documents\GitHub\crawleeScraper\my-crawler\node_modules\@crawlee\basic\internals\basic-crawler.js:493:26)
    at CheerioCrawler.run (C:\Users\haris\OneDrive\Documents\GitHub\crawleeScraper\my-crawler\node_modules\@crawlee\basic\internals\basic-crawler.js:421:24)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async file:///C:/Users/haris/OneDrive/Documents/GitHub/crawleeScraper/my-crawler/src/routes.js:45:5 {"id":"b1h8C8G7WjcTMKd","url":"https://www.amazon.com/s?k=computers","method":"GET","uniqueKey":"https://www.amazon.com/s?k=computers"}
INFO  CheerioCrawler: All requests from the queue have been processed, the crawler will shut down.

AAndrey Bykov

And one more link to add to another reply (in different thread). You should be using crawler.addRequests() https://crawlee.dev/api/next/cheerio-crawler/class/CheerioCrawler#addRequests

hharish

thanks!

ccurioussoul

did you find some elegant solution ?

Add a reply

Apify Discord Mirror

accessing RequestQueue/RequestList for scraper