Apify

Apify and Crawlee Official Forum

b
F
A
J
A
Members
Michal
M
Michal
Offline, last seen last month
Joined August 30, 2024
Hi, I found this simple example using Puppeteer, that downloads images as you visit the page and I'm wonderign how I can encorporate it into my crawlee scraper.

Plain Text
    this.page.on('response', async (response) => {
      const matches = /.*\.(jpg|png|svg|gif)$/.exec(response.url());

      console.log(matches);

      if (matches && (matches.length === 2)) {
        const extension = matches[1];
        const buffer = await response.buffer();

        fs.writeFileSync(`downloads/${this.request.userData}.${extension}`, buffer, 'base64');
        counter += 1;
      }
    });
2 comments
M
While developing a scraper I'm often facing this issue:
1) I add initial page to the queue
2) I run the scraper, which marks the url done
3) I want to re-run the scraper on the same page

I know I can keep renaming the queue name, but is there a way to reset/clear the queue instead?

if I call drop() on it, it simply fails with:
Plain Text
Request queue with id: 7ae80a2d-3b06-4a8f-929d-4fbfc5947e81 does not exist.
14 comments
M
H
A
P
Hi, I found this simple example using Puppeteer, that downloads images as you visit the page and I'm wonderign how I can encorporate it into my crawlee scraper.

Plain Text
    this.page.on('response', async (response) => {
      const matches = /.*\.(jpg|png|svg|gif)$/.exec(response.url());

      console.log(matches);

      if (matches && (matches.length === 2)) {
        const extension = matches[1];
        const buffer = await response.buffer();

        fs.writeFileSync(`downloads/${this.request.userData}.${extension}`, buffer, 'base64');
        counter += 1;
      }
    });
2 comments
M
While developing a scraper I'm often facing this issue:
1) I add initial page to the queue
2) I run the scraper, which marks the url done
3) I want to re-run the scraper on the same page

I know I can keep renaming the queue name, but is there a way to reset/clear the queue instead?

if I call drop() on it, it simply fails with:
Plain Text
Request queue with id: 7ae80a2d-3b06-4a8f-929d-4fbfc5947e81 does not exist.
14 comments
M
H
A
P
I see there is a way to prevent this once page loads with something like this:
Plain Text
    await page.setRequestInterception(true);

    page.on('request', async (request) => {
      if (ok) {
        await request.continue();
      } else {
        await request.abort();
      }
    }


But how about when I have a url in queue, is there a nice way to have the request interception setup in the crawler options instead and have it defined globally?
4 comments
M
P
I see there is a way to prevent this once page loads with something like this:
Plain Text
    await page.setRequestInterception(true);

    page.on('request', async (request) => {
      if (ok) {
        await request.continue();
      } else {
        await request.abort();
      }
    }


But how about when I have a url in queue, is there a nice way to have the request interception setup in the crawler options instead and have it defined globally?
4 comments
M
P
I just depleted my proxy quota and all the remaining requests in the queue failed.

Similar thing happens often, how do I retry/re-enqueue the failed requests?

I've been googling it for a while now, hardly any up to date info on google, only bits and pieces from older versions, closed Github issues etc..

I'm sure it's in the API docs somehow somewhere but they're extremely hard to navigate and usually point to other classes / interfaces.

Could such a basic thing be explained in docs in one paragraph and code snippet?
43 comments
M
L
y
A
I just depleted my proxy quota and all the remaining requests in the queue failed.

Similar thing happens often, how do I retry/re-enqueue the failed requests?

I've been googling it for a while now, hardly any up to date info on google, only bits and pieces from older versions, closed Github issues etc..

I'm sure it's in the API docs somehow somewhere but they're extremely hard to navigate and usually point to other classes / interfaces.

Could such a basic thing be explained in docs in one paragraph and code snippet?
43 comments
M
A
y
L
Hi, I'm using the proxy config with 100 proxies.

The goal is to let the scraper run with say 4 sessions concurrently - using 4 different proxies.

In each run, I see it picks one Session ID = One proxy and runs through all requests with the same one.
(it's different one each time, but each time it's a single IP).

Plain Text
import { ProxyConfiguration } from 'crawlee';
import { SMART_PROXY_DATACENTER_IPS } from '../utils/proxies.js';

import ApplicationRouter from './ApplicationRouter.js';


export default class TestProxies extends ApplicationRouter {

  async setup() {
    this.version = 1;
    this.prefix = 'TestProxies';
    this.datasetName = `${this.prefix}_dataset_V${this.version}`;
  }

  async getInitialPages() {
    return [
      { url: "https://ifconfig.co/?a=1", label: "page" },
      { url: "https://ifconfig.co/?a=2", label: "page" },
      { url: "https://ifconfig.co/?a=3", label: "page" },
      { url: "https://ifconfig.co/?a=4", label: "page" },
    ];
  }

  getRequestQueueName() {
    return `${this.prefix}_queue`;
  }

  getPageRoot() {
    return 'https://ifconfig.co';
  }

  // This is the entry
  async visitPage() {
    const ip = await this.text({ css: "#output" })
    this.debug("Proxy IP is", ip);

    await this.sleep(4000);
  }

  async getCrawlerOptions() {
    return {
      maxRequestRetries: 3,
      maxConcurrency: 2,

      useSessionPool: true,
      sessionPoolOptions: {
        maxPoolSize: 25,

        sessionOptions: {
          maxUsageCount: 150,
          maxAgeSecs: 23*60, // IPs rotate after 30 minutes
        },

        persistStateKeyValueStoreId: `${this.prefix}_V${this.version}_sessions`,
        persistStateKey: `${this.prefix}_V${this.version}_my-session-pool`,
      },

      proxyConfiguration: new ProxyConfiguration({
        proxyUrls: SMART_PROXY_DATACENTER_IPS
      })
    }
  }

}
25 comments
A
M
L
b
Hi, I'm using the proxy config with 100 proxies.

The goal is to let the scraper run with say 4 sessions concurrently - using 4 different proxies.

In each run, I see it picks one Session ID = One proxy and runs through all requests with the same one.
(it's different one each time, but each time it's a single IP).

Plain Text
import { ProxyConfiguration } from 'crawlee';
import { SMART_PROXY_DATACENTER_IPS } from '../utils/proxies.js';

import ApplicationRouter from './ApplicationRouter.js';


export default class TestProxies extends ApplicationRouter {

  async setup() {
    this.version = 1;
    this.prefix = 'TestProxies';
    this.datasetName = `${this.prefix}_dataset_V${this.version}`;
  }

  async getInitialPages() {
    return [
      { url: "https://ifconfig.co/?a=1", label: "page" },
      { url: "https://ifconfig.co/?a=2", label: "page" },
      { url: "https://ifconfig.co/?a=3", label: "page" },
      { url: "https://ifconfig.co/?a=4", label: "page" },
    ];
  }

  getRequestQueueName() {
    return `${this.prefix}_queue`;
  }

  getPageRoot() {
    return 'https://ifconfig.co';
  }

  // This is the entry
  async visitPage() {
    const ip = await this.text({ css: "#output" })
    this.debug("Proxy IP is", ip);

    await this.sleep(4000);
  }

  async getCrawlerOptions() {
    return {
      maxRequestRetries: 3,
      maxConcurrency: 2,

      useSessionPool: true,
      sessionPoolOptions: {
        maxPoolSize: 25,

        sessionOptions: {
          maxUsageCount: 150,
          maxAgeSecs: 23*60, // IPs rotate after 30 minutes
        },

        persistStateKeyValueStoreId: `${this.prefix}_V${this.version}_sessions`,
        persistStateKey: `${this.prefix}_V${this.version}_my-session-pool`,
      },

      proxyConfiguration: new ProxyConfiguration({
        proxyUrls: SMART_PROXY_DATACENTER_IPS
      })
    }
  }

}
25 comments
b
L
M
A
Hey, I'm trying to install stuff into the dockerfile and it won'r let me.

1) either I get permission denied when running apt-get install
2) using sudo says there is no sudo (so I should already be root?)
2 comments
L
M
Hey, I'm trying to install stuff into the dockerfile and it won'r let me.

1) either I get permission denied when running apt-get install
2) using sudo says there is no sudo (so I should already be root?)
2 comments
L
M
Hey, I've encountered a website using shadow dom, where crawlee isn't able to find elements (for a good reason).
https://developer.mozilla.org/en-US/docs/Web/Web_Components/Using_shadow_DOM

I was wondering since there is no mentions of shadow dom if anyone knows what to look at to make it work?
8 comments
M
A
Hey, I've encountered a website using shadow dom, where crawlee isn't able to find elements (for a good reason).
https://developer.mozilla.org/en-US/docs/Web/Web_Components/Using_shadow_DOM

I was wondering since there is no mentions of shadow dom if anyone knows what to look at to make it work?
8 comments
M
A