Crawlee not working with cloudflare

At a glance

The community member is experiencing a 403 Forbidden error when using a rotating proxy pool with a web crawler built using the Playwright library. The code provided shows the setup of the crawler, including the use of a proxy configuration. The community members in the comments suggest looking into techniques for handling Captchas, as the issue may be related to anti-scraping measures. However, one community member indicates that the suggested solution does not work, as it requires solving a challenge.

Useful resources

oosama

It keeps on returning 403 even with rotating proxy pool

Source code:

Plain Text

import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';
import proxy from './proxy_config.js';

// PlaywrightCrawler crawls the web using a headless browser controlled by the Playwright library.
const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [`http://${proxy.username}:${proxy.password}@${proxy.host}:${proxy.port}`]
});
const crawler = new PlaywrightCrawler({
    // Use the requestHandler to process each of the crawled pages.
    proxyConfiguration,
    async requestHandler({ request, page, enqueueLinks, pushData, log }) {
        const title = await page.title();
        log.info(`Title of ${request.loadedUrl} is '${title}'`);

        // Save results as JSON to `./storage/datasets/default` directory.
        await pushData({ title, url: request.loadedUrl });

        // Extract links from the current page and add them to the crawling queue.
        await enqueueLinks();
    },

    // Uncomment this option to see the browser window.
    // headless: false,

    // Comment this option to scrape the full website.
    maxRequestsPerCrawl: 20,
});

// Add first URL to the queue and start the crawl.
await crawler.run(['https://nopecha.com/demo/cloudflare']);

// Export the whole dataset to a single file in `./result.csv`.
await crawler.exportData('./result.csv');

// Or work with the data directly.
const data = await crawler.getData();
console.table(data.items);

Attachment

3 comments

oosama

@Helper @Apify Developer Community Manager

HHamza

Take a look at this:
https://docs.apify.com/academy/anti-scraping/techniques/captchas

oosama

@Hamza Not working it require to solve the challenge

Add a reply

Apify Discord Mirror

Crawlee not working with cloudflare