Apify and Crawlee Official Forum

h
harish
Offline, last seen 3 months ago
Joined August 30, 2024
in my code i want to send a json response to a file and save it
how do i install the dependencies needed (mongoose, next, etc.) and without any errors when sending the json and use the fetchAPI or whatever method that can work w/ apify, and save the data do a mongodb database
heres the code:
6 comments
h
A
ive tried saving the data to a rawdata.json file from the data i scrape from my actors,

however i dont get a json output even thought the scraping works

how would i save the data to the apify console that i can then use mongodb to take that data and put it in my database -

i have my mongodb schema already setup so how would i save the data to the apify console and access it

would i have to save it to the apify dataset, if so how, and how would i also put it through a cleaning process through the same actor or if possible, a different actor and THEN save it to a mongodb database?'

would i have to download fs somehow in the apify console to make this work?

heres what i have for saving the json file so far:
3 comments
h
P
ive tried saving the data to a rawdata.json file from the data i scrape from my actors, however i dont get a json output even thought the scraping works
how would i save the data to the apify console that i can then use mongodb to take that data and put it in my database - i have my mongodb schema already setup so how would i save the data to the apify console and access it
heres what i have for saving the json file so far:
1 comment
h
i have multiple crawlers - primarily playwright, per site that work on their own completely fine when i use only 1 crawler per site
i have tried running these crawlers concurrently through a scrape event that is emitted from the server that emits individual scrape events for each site to run each crawler
i face a lot of memory overloads, timed out navigations, skipping of many products, and early ending of the crawlers
each crawler essentially takes base urls or scrapes these base urls to get product urls that are then indvidually scraped through to get product page info
3 comments
h
i've been experiencing this same error pattern with my scrapers of completely different sites that i thought was an individual site problem but now its a repeating pattern
i have my scraper scraping results page urls, then product urls, which it has no problem with, but when it goes through those product urls with a playwright crawler, it always scrapes around 30-40 URLs successfully and suddenly experiences some crash error and then randomly rescrapes a couple of old product urls before crashing
11 comments
H
h
trying to get infinite scrolling to render in all products while scraping them as the page is being scrolled down
i looked at the documentation but didnt understand how to do this:
Plain Text
kotnRouter.addHandler('KOTN_DETAIL', async ({ log, page, parseWithCheerio }) => {
    log.info(`Scraping product URLs`);
  
    const $ = await parseWithCheerio()

    const productUrls: string[] = [];
  
    $('a').each((_, el) => {
        let productUrl = $(el).attr('href');
        if (productUrl) {
          if (!productUrl.startsWith('https://')) {
            productUrl = 'https://www.kotn.com' + productUrl;
            if(productUrl.includes('/products')){
                productUrls.push(productUrl);

            }
          } 
        }
    });
  
    // Push unique URLs to the dataset
    const uniqueProductUrls = Array.from(new Set(productUrls));
  
    await Dataset.pushData({
      urls: uniqueProductUrls,
    });
  
    await Promise.all(uniqueProductUrls.map(link => kotnPw.addRequests([{ url: link, label: 'KOTN_PRODUCT' }])));
  
    linksCount += uniqueProductUrls.length;
  
    await infiniteScroll(page, {
        maxScrollHeight: 0,
    });

    console.log(uniqueProductUrls);
    console.log(`Total product links scraped so far: ${linksCount}`);
    // Run bronPuppet crawler once after pushing the first product requests
    if (linksCount === uniqueProductUrls.length) {
      await kotnPw.run();
    }
});
6 comments
v
A
h
H
hey i was curious that when im scraping amazon, what's a reasonable time frame for the scraping duration considering i scrape each product link from the results page and then scrape each individual product page for the information and also paginate through each results page until there are no more pages left
i did previously just scrape product info straight of product cards on the results page but it would some times give dummy links that would lead to an unrelated amazon page and the product info would be more innacurate
how can i increase the speed of my scrapes, especially considering i want add on more and more scrapers in the future that i all want to happen concurrently to save time, im aiming for quite a low scrape time of within 10 seconds - 15 seconds or lower and its taking upwards of 1 minute
this is a cheerio crawler
3 comments
A
h
i have my crawler to crawl a couple of sites and scrape them but i get this import problem when importing the router (which is the same for both sites but using a different route for the sites) from both of the sites
if i only import it from one site, it only runs one site how do i import it so it runs multiple sites and make it so it can scale up to multiple sites in the near future

it can successfully scrape amazon and ebay (ebay tags are kinda innacurate) but if only if i use the router from ebay or amazon and remove the other url from starturls, or else it gives an error for not having the AMAZON label or EBAY label anywhere
12 comments
A
h
when using crawlee in a node.js project (npm i crawlee), I keep getting this error with my code :
(cheerio crawler, btw)
TypeError: Dataset is not a constructor

from this section in my scraper code:
const { CheerioCrawler } = require('crawlee');
const Dataset = require('crawlee').dataset;



i changed it from
import { CheerioCrawler, Dataset } from 'crawlee';
to
const { CheerioCrawler } = require('crawlee');

and tried moving the dataset too in its separate require statement but i'm getting this error

this is not in a my-crawler folder and not made with "npx crawlee create my-crawler"
this is node.js project that i downloaded the crawlee package to with "npm i crawlee"
is there something i need to change with the package.json or what's the problem?
2 comments
h
L
Don’t know if it works or not but would it work to add other libraries from npm like fs or readline in a Crawlee crawler
1 comment
P
How should I structure my crawler when scraping possibly 100s of different sites with different structures, handling multiple requests at once in Crawlee
3 comments
A
h
n
when i am scraping product data from product urls, if i am trying to either see whether a tag is available and if not to use a different tag or if a tag simply isn't found, i don't want it to give a full error for not finding that certain element i want and not scrape and save the rest of the data
how do i avoid this "skipping" over by overriding or changing the natural response of the crawler

i even have tried try catch statements and if else statements and nothing works
3 comments
h
how to delete a request queue once the crawling is finished and how can you tell when the crawling is finished to delete the request queue
1 comment
P
ive tried saving the data to a rawdata.json file from the data i scrape from my actors,

however i dont get a json output even thought the scraping works

how would i save the data to the apify console that i can then use mongodb to take that data and put it in my database -

i have my mongodb schema already setup so how would i save the data to the apify console and access it

would i have to save it to the apify dataset, if so how, and how would i also put it through a cleaning process through the same actor or if possible, a different actor and THEN save it to a mongodb database?

heres what i have for saving the json file so far:
6 comments
P
h
ive tried saving the data to a rawdata.json file from the data i scrape from my actors, however i dont get a json output even thought the scraping works
how would i save the data to the apify console that i can then use mongodb to take that data and put it in my database - i have my mongodb schema already setup so how would i save the data to the apify console and access it
heres what i have for saving the json file so far:
5 comments
s
h
how do you delete actors that you don't want that you accidently made
1 comment
H
im scraping entire sites and running multiple crawlers at once for each site - looking to scrape 50+ sites, and im running multiple site scrapes at once from a start file emitting an event emitter to run each site, specifically the
Plain Text
await crawler.run(startUrls)
line.
Should i run them all at once in one terminal or run each one in separate terminals with different scripts to run each scraper
Also is this a maintanable approach to run multiple crawler instances at once
One final problem I am running into is that I'll run the start file but I get this request queue error when I run multiple crawlers
when i run it again, it sometimes works, but it's inconsistent in how this error pops up
request queue error:
Plain Text
ERROR CheerioCrawler:AutoscaledPool: isTaskReadyFunction failed       
ectory, open 'C:\Users\haris\OneDrive\Documents\GitHub\periodicScraper01\pscrape\storage\request_queues\default\1Rk4szfVGlTLik4.json'] {
  errno: -4058,
  code: 'ENOENT',
  syscall: 'open',
  path: 'C:\\Users\\haris\\OneDrive\\Documents\\GitHub\\periodicScraper01\\pscrape\\storage\\request_queues\\default\\1Rk4szfVGlTLik4.json'
}
21 comments
2
O
A
M
H
B
in a playwright crawler, how could i keep clicking and waiting a second and clicking a button that renders in more data until that button is no longer present and end the loop

and while the clicking is going on, rendered product urls are scraped
1 comment
L
when i am scraping product data from product urls, if i am trying to either see whether a tag is available and if not to use a different tag or if a tag simply isn't found, i don't want it to give a full error for not finding that certain element i want and not scrape and save the rest of the data
how do i avoid this "skipping" over by overriding or changing the natural response of the crawler

i even have tried try catch statements and if else statements to handle a product not being found and nothing works
Plain Text
   let salePrice = await page.$eval('span.price-value', (el) => el.textContent?.trim() || '');
        let newTag = await page.$eval('span.price-ns', (el) => el.textContent?.trim() || '');
        let originalPrice = salePrice;

        if(newTag){
          originalPrice = newTag;
        }else{
          return
        }
3 comments
h
L
i have a crawler that goes through collection pages of stores

and scrapes their product links

and goes through those product page links to get product data

when getting the product links in the collection pages,

many sites utilize an infinite scrolling to render in all the products


how do i implement infinite scrolling into this specific crawler route handler here below

while scraping the product page urls to render in all the products to make sure i scraped all the products on the page:

Plain Text
kotnRouter.addHandler('KOTN_DETAIL', async ({ page, log }) => {
    log.info('Scraping product URLs');
    
    await page.goto(page.url(), { waitUntil: 'domcontentloaded' })

  
    const productUrls: string[] = [];
  
    const links = await page.$$eval('a', (elements) =>
      elements.map((el) => el.getAttribute('href'))
    );
  
    for (const link of links) {
      if (link && !link.startsWith('https://')) {
        const productUrl = 'https://www.kotn.com' + link;
        if (productUrl.includes('/products')) {
          productUrls.push(productUrl);
        }
      }
    }
  
    // Push unique URLs to the dataset
    const uniqueProductUrls = Array.from(new Set(productUrls));
    console.log(uniqueProductUrls);
    await Dataset.pushData({
      urls: uniqueProductUrls,
    });
  
    await Promise.all(
      uniqueProductUrls.map((link) => kotnCrawler.addRequests([{ url: link, label: 'KOTN_PRODUCT' }]))
    );
  
    linksCount += uniqueProductUrls.length;
  
    console.log(uniqueProductUrls);
    console.log(`Total product links scraped so far: ${linksCount}`);
  
});
z
7 comments
H
h
R
in my puppeteer crawler, im searching for some elements and they sometimes might not be there, so when they're not there, since the crawler is awaiting that element, if it isnt there, it causes an error that can often crash the crawler
ive tried wrapping the await element statements in try catch statements to handle errors and return but ive seen that it still returns errors because when it awaits the element, it needs to see that element to move on
i want it to be able to skip over unfound elements, scrape the OTHER elements on the page, and move on
a small snippet of the code:
6 comments
R
h
A
alr have a web scraper for amazon outputting to a rawData.json file able to successfully to scrape product links and then go through each of those product links to get the data i need

but i want to scale up to many many scrapers and im having trouble running multiple scrapers at once

i essentially made a new router to handle the other site and want to know how i can make sure that only the url with the same label will run the router handler with the same label but it wont let me define both routes like
Plain Text
 requestHandler: [router, router2]




it didn't work and i had to combine both routers in a weird way to get it to work and there weren't any errors but I keep getting no scraped data from the second site (ebay) and it it sometimes shows objects that have the eBay site name instead of amazon but still have an amazon link with an amazon product in it

i want to be able to run both scrapes at the same time, get rid of the combinedRouter, and define them as different routes and also make the scrapes happen faster and also make it so that it is easy to add routes on and scale up the process and keep adding on new scrapers daily

here are my codes:
10 comments
A
h
A
alr have a web scraper for amazon outputting to a rawData.json file able to successfully to scrape product links and then go through each of those product links to get the data i need

but i want to scale up to many many scrapers and im having trouble running multiple scrapers at once

i essentially made a new router to handle the other site and want to know how i can make sure that only the url with the same label will run the router handler with the same label but it wont let me define both routes like
Plain Text
 requestHandler: [router, router2]


it didn't work and i had to combine both routers in a weird way to get it to work and there weren't any errors but I keep getting no scraped data from the second site (ebay) and it it sometimes shows objects that have the eBay site name instead of amazon but still have an amazon link with an amazon product in it

i want to be able to run both scrapes at the same time, get rid of the combinedRouter, and define them as different routes and also make the scrapes happen faster and also make it so that it is easy to add routes on and scale up the process and keep adding on new scrapers daily

here are my codes:
6 comments
h
i have a cheerio craweler able to crawl an amazon results page for product inks and it does so successfully

but then i want to add those to a RequestQueue/RequestList (by enqueueing each request from RequestList into RequestQueue) and then access it in a diff route and crawl that list of product links with the cheerio crawler for the data needed, how can i do so

this is what my code looks like
7 comments
c
h
A
A
i have this code that scrapes product URLs from an Amazon results page
i am able to successfully scrape the product URLs, but I'm unable to take each link and scrape the needed info in another crawler
do i need another cheerio router
also how can i take each link once scraped and instead add it to a requestlist and requestqueue and then take the urls in that request queue and scrape that information
7 comments
A
h