Michal

How to reset a queue?

While developing a scraper I'm often facing this issue:
1) I add initial page to the queue
2) I run the scraper, which marks the url done
3) I want to re-run the scraper on the same page

I know I can keep renaming the queue name, but is there a way to reset/clear the queue instead?

if I call drop() on it, it simply fails with:

Plain Text

Request queue with id: 7ae80a2d-3b06-4a8f-929d-4fbfc5947e81 does not exist.

14 comments

MMichal

Downloading an image using puppeteer example

Hi, I found this simple example using Puppeteer, that downloads images as you visit the page and I'm wonderign how I can encorporate it into my crawlee scraper.

Plain Text

    this.page.on('response', async (response) => {
      const matches = /.*\.(jpg|png|svg|gif)$/.exec(response.url());

      console.log(matches);

      if (matches && (matches.length === 2)) {
        const extension = matches[1];
        const buffer = await response.buffer();

        fs.writeFileSync(`downloads/${this.request.userData}.${extension}`, buffer, 'base64');
        counter += 1;
      }
    });

2 comments

MMichal

How to prevent following redirects to other domains?

I see there is a way to prevent this once page loads with something like this:

Plain Text

    await page.setRequestInterception(true);

    page.on('request', async (request) => {
      if (ok) {
        await request.continue();
      } else {
        await request.abort();
      }
    }

But how about when I have a url in queue, is there a nice way to have the request interception setup in the crawler options instead and have it defined globally?

4 comments

MMichal

How to retry failed requests after the queue as "ended"?

I just depleted my proxy quota and all the remaining requests in the queue failed.

Similar thing happens often, how do I retry/re-enqueue the failed requests?

I've been googling it for a while now, hardly any up to date info on google, only bits and pieces from older versions, closed Github issues etc..

I'm sure it's in the API docs somehow somewhere but they're extremely hard to navigate and usually point to other classes / interfaces.

Could such a basic thing be explained in docs in one paragraph and code snippet?

43 comments

MMichal

Concurrency: How to use multiple proxies / session pool IDs?

Hi, I'm using the proxy config with 100 proxies.

The goal is to let the scraper run with say 4 sessions concurrently - using 4 different proxies.

In each run, I see it picks one Session ID = One proxy and runs through all requests with the same one.
(it's different one each time, but each time it's a single IP).

Plain Text

import { ProxyConfiguration } from 'crawlee';
import { SMART_PROXY_DATACENTER_IPS } from '../utils/proxies.js';

import ApplicationRouter from './ApplicationRouter.js';


export default class TestProxies extends ApplicationRouter {

  async setup() {
    this.version = 1;
    this.prefix = 'TestProxies';
    this.datasetName = `${this.prefix}_dataset_V${this.version}`;
  }

  async getInitialPages() {
    return [
      { url: "https://ifconfig.co/?a=1", label: "page" },
      { url: "https://ifconfig.co/?a=2", label: "page" },
      { url: "https://ifconfig.co/?a=3", label: "page" },
      { url: "https://ifconfig.co/?a=4", label: "page" },
    ];
  }

  getRequestQueueName() {
    return `${this.prefix}_queue`;
  }

  getPageRoot() {
    return 'https://ifconfig.co';
  }

  // This is the entry
  async visitPage() {
    const ip = await this.text({ css: "#output" })
    this.debug("Proxy IP is", ip);

    await this.sleep(4000);
  }

  async getCrawlerOptions() {
    return {
      maxRequestRetries: 3,
      maxConcurrency: 2,

      useSessionPool: true,
      sessionPoolOptions: {
        maxPoolSize: 25,

        sessionOptions: {
          maxUsageCount: 150,
          maxAgeSecs: 23*60, // IPs rotate after 30 minutes
        },

        persistStateKeyValueStoreId: `${this.prefix}_V${this.version}_sessions`,
        persistStateKey: `${this.prefix}_V${this.version}_my-session-pool`,
      },

      proxyConfiguration: new ProxyConfiguration({
        proxyUrls: SMART_PROXY_DATACENTER_IPS
      })
    }
  }

}

25 comments

MMichal

Trying to extend Dockerfile, can't install using apt-get getting permission denied

Hey, I'm trying to install stuff into the dockerfile and it won'r let me.

1) either I get permission denied when running apt-get install
2) using sudo says there is no sudo (so I should already be root?)

2 comments

MMichal

Crawlee not working(?) on a page with shadow dom

Hey, I've encountered a website using shadow dom, where crawlee isn't able to find elements (for a good reason).
https://developer.mozilla.org/en-US/docs/Web/Web_Components/Using_shadow_DOM

I was wondering since there is no mentions of shadow dom if anyone knows what to look at to make it work?

8 comments

Apify Discord Mirror

How to reset a queue?

Downloading an image using puppeteer example

How to prevent following redirects to other domains?

How to retry failed requests after the queue as "ended"?

Concurrency: How to use multiple proxies / session pool IDs?

Trying to extend Dockerfile, can't install using apt-get getting permission denied

Crawlee not working(?) on a page with shadow dom