Apify

Apify and Crawlee Official Forum

b
F
A
J
A
Members
harish
h
harish
Offline, last seen last month
Joined August 30, 2024
in my code i want to send a json response to a file and save it
how do i install the dependencies needed (mongoose, next, etc.) and without any errors when sending the json and use the fetchAPI or whatever method that can work w/ apify, and save the data do a mongodb database
heres the code:
6 comments
h
A
ive tried saving the data to a rawdata.json file from the data i scrape from my actors,

however i dont get a json output even thought the scraping works

how would i save the data to the apify console that i can then use mongodb to take that data and put it in my database -

i have my mongodb schema already setup so how would i save the data to the apify console and access it

would i have to save it to the apify dataset, if so how, and how would i also put it through a cleaning process through the same actor or if possible, a different actor and THEN save it to a mongodb database?'

would i have to download fs somehow in the apify console to make this work?

heres what i have for saving the json file so far:
3 comments
h
P
when i am scraping product data from product urls, if i am trying to either see whether a tag is available and if not to use a different tag or if a tag simply isn't found, i don't want it to give a full error for not finding that certain element i want and not scrape and save the rest of the data
how do i avoid this "skipping" over by overriding or changing the natural response of the crawler

i even have tried try catch statements and if else statements and nothing works
3 comments
h
how to delete a request queue once the crawling is finished and how can you tell when the crawling is finished to delete the request queue
1 comment
P
ive tried saving the data to a rawdata.json file from the data i scrape from my actors,

however i dont get a json output even thought the scraping works

how would i save the data to the apify console that i can then use mongodb to take that data and put it in my database -

i have my mongodb schema already setup so how would i save the data to the apify console and access it

would i have to save it to the apify dataset, if so how, and how would i also put it through a cleaning process through the same actor or if possible, a different actor and THEN save it to a mongodb database?

heres what i have for saving the json file so far:
6 comments
P
h
ive tried saving the data to a rawdata.json file from the data i scrape from my actors, however i dont get a json output even thought the scraping works
how would i save the data to the apify console that i can then use mongodb to take that data and put it in my database - i have my mongodb schema already setup so how would i save the data to the apify console and access it
heres what i have for saving the json file so far:
1 comment
h
ive tried saving the data to a rawdata.json file from the data i scrape from my actors, however i dont get a json output even thought the scraping works
how would i save the data to the apify console that i can then use mongodb to take that data and put it in my database - i have my mongodb schema already setup so how would i save the data to the apify console and access it
heres what i have for saving the json file so far:
5 comments
s
h
i have multiple crawlers - primarily playwright, per site that work on their own completely fine when i use only 1 crawler per site
i have tried running these crawlers concurrently through a scrape event that is emitted from the server that emits individual scrape events for each site to run each crawler
i face a lot of memory overloads, timed out navigations, skipping of many products, and early ending of the crawlers
each crawler essentially takes base urls or scrapes these base urls to get product urls that are then indvidually scraped through to get product page info
3 comments
h
how do you delete actors that you don't want that you accidently made
1 comment
H
i've been experiencing this same error pattern with my scrapers of completely different sites that i thought was an individual site problem but now its a repeating pattern
i have my scraper scraping results page urls, then product urls, which it has no problem with, but when it goes through those product urls with a playwright crawler, it always scrapes around 30-40 URLs successfully and suddenly experiences some crash error and then randomly rescrapes a couple of old product urls before crashing
11 comments
H
h
trying to get infinite scrolling to render in all products while scraping them as the page is being scrolled down
i looked at the documentation but didnt understand how to do this:
Plain Text
kotnRouter.addHandler('KOTN_DETAIL', async ({ log, page, parseWithCheerio }) => {
    log.info(`Scraping product URLs`);
  
    const $ = await parseWithCheerio()

    const productUrls: string[] = [];
  
    $('a').each((_, el) => {
        let productUrl = $(el).attr('href');
        if (productUrl) {
          if (!productUrl.startsWith('https://')) {
            productUrl = 'https://www.kotn.com' + productUrl;
            if(productUrl.includes('/products')){
                productUrls.push(productUrl);

            }
          } 
        }
    });
  
    // Push unique URLs to the dataset
    const uniqueProductUrls = Array.from(new Set(productUrls));
  
    await Dataset.pushData({
      urls: uniqueProductUrls,
    });
  
    await Promise.all(uniqueProductUrls.map(link => kotnPw.addRequests([{ url: link, label: 'KOTN_PRODUCT' }])));
  
    linksCount += uniqueProductUrls.length;
  
    await infiniteScroll(page, {
        maxScrollHeight: 0,
    });

    console.log(uniqueProductUrls);
    console.log(`Total product links scraped so far: ${linksCount}`);
    // Run bronPuppet crawler once after pushing the first product requests
    if (linksCount === uniqueProductUrls.length) {
      await kotnPw.run();
    }
});
6 comments
v
A
h
H
im scraping entire sites and running multiple crawlers at once for each site - looking to scrape 50+ sites, and im running multiple site scrapes at once from a start file emitting an event emitter to run each site, specifically the
Plain Text
await crawler.run(startUrls)
line.
Should i run them all at once in one terminal or run each one in separate terminals with different scripts to run each scraper
Also is this a maintanable approach to run multiple crawler instances at once
One final problem I am running into is that I'll run the start file but I get this request queue error when I run multiple crawlers
when i run it again, it sometimes works, but it's inconsistent in how this error pops up
request queue error:
Plain Text
ERROR CheerioCrawler:AutoscaledPool: isTaskReadyFunction failed       
ectory, open 'C:\Users\haris\OneDrive\Documents\GitHub\periodicScraper01\pscrape\storage\request_queues\default\1Rk4szfVGlTLik4.json'] {
  errno: -4058,
  code: 'ENOENT',
  syscall: 'open',
  path: 'C:\\Users\\haris\\OneDrive\\Documents\\GitHub\\periodicScraper01\\pscrape\\storage\\request_queues\\default\\1Rk4szfVGlTLik4.json'
}
21 comments
2
A
M
H
B
O
hey i was curious that when im scraping amazon, what's a reasonable time frame for the scraping duration considering i scrape each product link from the results page and then scrape each individual product page for the information and also paginate through each results page until there are no more pages left
i did previously just scrape product info straight of product cards on the results page but it would some times give dummy links that would lead to an unrelated amazon page and the product info would be more innacurate
how can i increase the speed of my scrapes, especially considering i want add on more and more scrapers in the future that i all want to happen concurrently to save time, im aiming for quite a low scrape time of within 10 seconds - 15 seconds or lower and its taking upwards of 1 minute
this is a cheerio crawler
3 comments
A
h
i have my crawler to crawl a couple of sites and scrape them but i get this import problem when importing the router (which is the same for both sites but using a different route for the sites) from both of the sites
if i only import it from one site, it only runs one site how do i import it so it runs multiple sites and make it so it can scale up to multiple sites in the near future

it can successfully scrape amazon and ebay (ebay tags are kinda innacurate) but if only if i use the router from ebay or amazon and remove the other url from starturls, or else it gives an error for not having the AMAZON label or EBAY label anywhere
12 comments
A
h
when using crawlee in a node.js project (npm i crawlee), I keep getting this error with my code :
(cheerio crawler, btw)
TypeError: Dataset is not a constructor

from this section in my scraper code:
const { CheerioCrawler } = require('crawlee');
const Dataset = require('crawlee').dataset;



i changed it from
import { CheerioCrawler, Dataset } from 'crawlee';
to
const { CheerioCrawler } = require('crawlee');

and tried moving the dataset too in its separate require statement but i'm getting this error

this is not in a my-crawler folder and not made with "npx crawlee create my-crawler"
this is node.js project that i downloaded the crawlee package to with "npm i crawlee"
is there something i need to change with the package.json or what's the problem?
2 comments
h
L
Don’t know if it works or not but would it work to add other libraries from npm like fs or readline in a Crawlee crawler
1 comment
P
How should I structure my crawler when scraping possibly 100s of different sites with different structures, handling multiple requests at once in Crawlee
3 comments
A
h
n
when i am scraping product data from product urls, if i am trying to either see whether a tag is available and if not to use a different tag or if a tag simply isn't found, i don't want it to give a full error for not finding that certain element i want and not scrape and save the rest of the data
how do i avoid this "skipping" over by overriding or changing the natural response of the crawler

i even have tried try catch statements and if else statements and nothing works
3 comments
h
how to delete a request queue once the crawling is finished and how can you tell when the crawling is finished to delete the request queue
1 comment
P
ive tried saving the data to a rawdata.json file from the data i scrape from my actors,

however i dont get a json output even thought the scraping works

how would i save the data to the apify console that i can then use mongodb to take that data and put it in my database -

i have my mongodb schema already setup so how would i save the data to the apify console and access it

would i have to save it to the apify dataset, if so how, and how would i also put it through a cleaning process through the same actor or if possible, a different actor and THEN save it to a mongodb database?

heres what i have for saving the json file so far:
6 comments
P
h
ive tried saving the data to a rawdata.json file from the data i scrape from my actors, however i dont get a json output even thought the scraping works
how would i save the data to the apify console that i can then use mongodb to take that data and put it in my database - i have my mongodb schema already setup so how would i save the data to the apify console and access it
heres what i have for saving the json file so far:
1 comment
h
ive tried saving the data to a rawdata.json file from the data i scrape from my actors, however i dont get a json output even thought the scraping works
how would i save the data to the apify console that i can then use mongodb to take that data and put it in my database - i have my mongodb schema already setup so how would i save the data to the apify console and access it
heres what i have for saving the json file so far:
5 comments
s
h
i have multiple crawlers - primarily playwright, per site that work on their own completely fine when i use only 1 crawler per site
i have tried running these crawlers concurrently through a scrape event that is emitted from the server that emits individual scrape events for each site to run each crawler
i face a lot of memory overloads, timed out navigations, skipping of many products, and early ending of the crawlers
each crawler essentially takes base urls or scrapes these base urls to get product urls that are then indvidually scraped through to get product page info
3 comments
h
how do you delete actors that you don't want that you accidently made
1 comment
H
i've been experiencing this same error pattern with my scrapers of completely different sites that i thought was an individual site problem but now its a repeating pattern
i have my scraper scraping results page urls, then product urls, which it has no problem with, but when it goes through those product urls with a playwright crawler, it always scrapes around 30-40 URLs successfully and suddenly experiences some crash error and then randomly rescrapes a couple of old product urls before crashing
11 comments
H
h