LARGO

Crawler skipping Jobs after processing 5,000-6,000 Requests

Since a few days, I have been running the crawler with a high number of jobs. As a result, I have run into a problem.

I have found, that not all jobs are processed by the CheerioCrawler despite these jobs being added to the queue through addRequest([job]).

I can't really reproduce it, it happens approximately after 5000 - 6000 number of jobs.

My code doesn't crash, it continues to the next jobs (BullMQ job queue) without scraping the link.

This is normal behavior, since it reaches the requestHandler (CheerioInfo logger)

7 comments

LLARGO

Re-using the crawler, instead initializing after each url?

My scraper uses BullMQ, which retrieves jobs (URL's) from the job queue and runs them with CheerioCrawler.

Is there any way to initialize the crawler once and keep using it. I assume this will also consume less resources and increase performance?

If there are any best practices that I have not implemented I would love to hear about it.

Plain Text

// worker.ts
import { Worker } from 'bullmq';
import { CheerioCrawler, ProxyConfiguration } from 'crawlee';
import Redis from 'ioredis';
import { router } from './router';
import dotenv from 'dotenv';
dotenv.config();

console.log("REDIS_URL_JOB_QUEUE", process.env.REDIS_URL_JOB_QUEUE)
const connection = new Redis(process.env.REDIS_URL_JOB_QUEUE || '', {
    maxRetriesPerRequest: null
}); // Connect to a local Redis instance

const proxy = process.env?.PROXY_URL || '';
console.log('proxy', proxy)

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [proxy],
});

const crawler = new CheerioCrawler({
    proxyConfiguration,
    requestHandler: router,
});

const scraperWorker = new Worker(
    'scraper',
    async (job) => {
        const url: string = job.data.url;
        
        try {
            // await crawler.addRequests([url]);
            await crawler.run([
                {
                    label: 'PRODUCT',
                    url
                },
            ]);

            // If everything went well, return a result
            return { result: 'success' };
        } catch (error) {
            // If something went wrong, throw an error
            console.error(`Scrape of ${url} failed with error ${error.message}`);
            throw error;
        }
    },
    {
        connection, limiter: {
            max: 2,    // Max number of jobs to handle
            duration: 5000 // per duration value in milliseconds (60,000ms = 1 minute)
        }
    }
);

scraperWorker.on('completed', (job, result) => {
    console.log(`Job ${job.id} completed with result ${result.result}`);
});

scraperWorker.on('failed', (job, err) => {
    if (!job) return console.log('Job not found');
    console.log(`Job ${job.id} failed with error ${err.message}`);
});

2 comments

LLARGO

Handling HTML structures of different websites

Hey,

I want to scrape multiple e commerce web shops that have different HTML structures.

I was thinking about making handlers for each shop. Allowing each shop to scrape the HTML on its own. Eventually all sites should come up with kind of similar data, such as price, title, in stock sizes etc. This is necessary because the data must then be processed, requiring each product to meet the schema.

Is this the best way to do so? I honestly don't know how to work this out into code yet, am now mostly thinking about a good approach. I would like to hear if there is a better approach 🙂

1 comment

Apify and Crawlee Official Forum

Crawler skipping Jobs after processing 5,000-6,000 Requests

Re-using the crawler, instead initializing after each url?

Handling HTML structures of different websites