Apify

Apify and Crawlee Official Forum

b
F
A
J
A
Members
hunterleung.
h
hunterleung.
Offline, last seen 2 weeks ago
Joined August 30, 2024
I am building a scrape app and encountered an issue.

I set up two different crawler tasks in the same app. When the first crawler task is completed, the app uses the abort method to exit the first task and then starts the second task. However, the task object obtained in the route handler still contains the task configuration of the first crawler task.

Every time I run a crawler instance, I create it using the new method. The route handlers on the instance are also created with new, returning new instances each time, not following a singleton pattern. The userData I pass in is also the task object for the current run.

Could you please help me identify what's wrong with my code and how I should modify it? Thank you.

Here is some of my code :

__crawlerRunner.ts file __

import { PlaywrightCrawler } from 'crawlee'
import { CrawlerTask, CrawlerType } from '../../types'

export async function runTaskCrawler(crawler: PlaywrightCrawler, task: CrawlerTask) {
switch (task.taskType) {
case CrawlerType.WEBSITE:
return await runWebsiteTaskCrawler(crawler, task)

default:
throw new Error('Invalid crawler type')
}
}

async function runWebsiteTaskCrawler(crawler: PlaywrightCrawler, task: CrawlerTask) {
console.log(task.sourceUrl)
await crawler.run([
{
url: task.sourceUrl,
userData: {
task,
depth: 0
}
}
])
}
9 comments
h
V
O
Hi. I am running a playwright crawler in my linux vps. The vps ihas 8 core CPU and 15533MB memory.
But I got many warning like :
WARN PlaywrightCrawler:AutoscaledPool:Snapshotter: Memory is critically overloaded. Using 12184 MB of 3883 MB (314%). Consider increasing available memory.

So how should I fix this ?

Thanks for your help.
30 comments
N
b
A
h
Hello everyone, I am using crawlee and electron to develop a desktop program, I met two problems:

  1. I want to start multiple crawler tasks at the same time in my software, and each crawler task can have independent parameter settings.
I don't see anything about this in the documentation. How should I implement this requirement?

  1. I wanna add pause and restart features to the crawler task,but I haven't seen the relate function in the documentation.
Can someone give me some tip? I would be very grateful.
4 comments
H
A
h
Hi. I am running a playwright crawler in my linux vps. The vps ihas 8 core CPU and 15533MB memory.
But I got many warning like :
WARN PlaywrightCrawler:AutoscaledPool:Snapshotter: Memory is critically overloaded. Using 12184 MB of 3883 MB (314%). Consider increasing available memory.

So how should I fix this ?

Thanks for your help.
30 comments
N
b
h
A
I am building a scrape app and encountered an issue.

I set up two different crawler tasks in the same app. When the first crawler task is completed, the app uses the abort method to exit the first task and then starts the second task. However, the task object obtained in the route handler still contains the task configuration of the first crawler task.

Every time I run a crawler instance, I create it using the new method. The route handlers on the instance are also created with new, returning new instances each time, not following a singleton pattern. The userData I pass in is also the task object for the current run.

Could you please help me identify what's wrong with my code and how I should modify it? Thank you.

Here is some of my code :

__crawlerRunner.ts file __

import { PlaywrightCrawler } from 'crawlee'
import { CrawlerTask, CrawlerType } from '../../types'

export async function runTaskCrawler(crawler: PlaywrightCrawler, task: CrawlerTask) {
switch (task.taskType) {
case CrawlerType.WEBSITE:
return await runWebsiteTaskCrawler(crawler, task)

default:
throw new Error('Invalid crawler type')
}
}

async function runWebsiteTaskCrawler(crawler: PlaywrightCrawler, task: CrawlerTask) {
console.log(task.sourceUrl)
await crawler.run([
{
url: task.sourceUrl,
userData: {
task,
depth: 0
}
}
])
}
7 comments
h
V
Hello everyone, I am using crawlee and electron to develop a desktop program, I met two problems:

  1. I want to start multiple crawler tasks at the same time in my software, and each crawler task can have independent parameter settings.
I don't see anything about this in the documentation. How should I implement this requirement?

  1. I wanna add pause and restart features to the crawler task,but I haven't seen the relate function in the documentation.
Can someone give me some tip? I would be very grateful.
4 comments
H
A
h
Hi,guys. I am a python coder but not good at nodejs. I make a crawler to bulk check the infomation by the crawlee.

This is my option :
useSessionPool: true,
useIncognitoPages: true,

and I using the Residential Proxies .
But I found that the ip of some pages as same as the others.
I want launch diffrent browser context in each target url . But I don't know how to do that.
Anybody could help me ?

Sorry for my poor English .
Thanks
2 comments
h
N
Hi,guys. I am a python coder but not good at nodejs. I make a crawler to bulk check the infomation by the crawlee.

This is my option :
useSessionPool: true,
useIncognitoPages: true,

and I using the Residential Proxies .
But I found that the ip of some pages as same as the others.
I want launch diffrent browser context in each target url . But I don't know how to do that.
Anybody could help me ?

Sorry for my poor English .
Thanks
2 comments
h
N