How should I fix the userData if I run 2 different craw...

At a glance

The community member is building a scrape app and encountered an issue where the task object obtained in the route handler still contains the task configuration of the first crawler task, even after the first task is completed and the second task is started. The community member has tried creating new instances of the crawler and route handlers, but the issue persists. The community members suggest the following:

1. Ensure that the userData being passed to the router is specific to the current task by logging and verifying the content of userData before starting the crawl.

2. Reset or clear the userData before starting the second task, and deep clone the task object before passing it to userData to avoid references to the previous task.

3. Before starting the second task, ensure that the crawler and its related state are completely reset by aborting the autoscaled pool, tearing down the crawler, and reinitializing a new crawler instance.

4. Try using the useState() method instead of userData to manage the crawler state.

There is no explicitly marked answer in the comments.

Useful resources

hhunterleung.

I am building a scrape app and encountered an issue.

I set up two different crawler tasks in the same app. When the first crawler task is completed, the app uses the abort method to exit the first task and then starts the second task. However, the task object obtained in the route handler still contains the task configuration of the first crawler task.

Every time I run a crawler instance, I create it using the new method. The route handlers on the instance are also created with new, returning new instances each time, not following a singleton pattern. The userData I pass in is also the task object for the current run.

Could you please help me identify what's wrong with my code and how I should modify it? Thank you.

Here is some of my code :

__crawlerRunner.ts file __

import { PlaywrightCrawler } from 'crawlee'
import { CrawlerTask, CrawlerType } from '../../types'

export async function runTaskCrawler(crawler: PlaywrightCrawler, task: CrawlerTask) {
switch (task.taskType) {
case CrawlerType.WEBSITE:
return await runWebsiteTaskCrawler(crawler, task)

default:
throw new Error('Invalid crawler type')
}
}

async function runWebsiteTaskCrawler(crawler: PlaywrightCrawler, task: CrawlerTask) {
console.log(task.sourceUrl)
await crawler.run([
{
url: task.sourceUrl,
userData: {
task,
depth: 0
}
}
])
}

9 comments

hhunterleung.

__routerFactory.ts file __
import { createPlaywrightRouter } from 'crawlee'
import { CrawlerTask, SkipOperator } from '../../types'
import { getTaskCrawler } from '.'

export function routerFactory() {
const router = createPlaywrightRouter()
router.addDefaultHandler(async ctx => {
const userData = ctx.request.userData
console.log(userData)
const task = userData.task as CrawlerTask
const depth = userData.depth + 1
const limitCrawlDepth = task.limitCrawlDepth

if (limitCrawlDepth !== -1 && depth > limitCrawlDepth) {
ctx.crawler.autoscaledPool?.abort()
// ctx.crawler.teardown()
return
}

const crawlPagesTotal = ++getTaskCrawler(task.taskId).crawlPagesTotal
const limitCrawlPagesTotal = task.limitCrawlPagesTotal as number

if (limitCrawlPagesTotal !== -1 && crawlPagesTotal > limitCrawlPagesTotal) {
ctx.crawler.autoscaledPool?.abort()
// ctx.crawler.teardown()
return
}

await collectData(ctx)

await ctx.enqueueLinks({
strategy: 'all',
userData: {
task,
depth
},
transformRequestFunction(req) {
const url = req.url
const skipOperator = task.skipOperator

return req
}
})
})
return router
}

async function collectData({ request, page, log }) {
log.info('URL：' + request.url)
const title = await page.title()
let links = await page.$$eval('a', anchors => anchors.map(anchor => anchor.href))
links = Array.from(new Set(links.filter(Boolean).map(l => new URL(l).hostname)))
if (!title || links.length === 0) return

log.info('get data success', {
url: request.url,
did: request.userData.did,
title,
links
})
}

VVistics

Always ensure that the userData being passed to the router is specific to the current task. You can enforce this by logging and verifying the content of userData right before starting the crawl

VVistics

I pray it helps

VVistics

Ensure that the userData is being properly reset or cleared before starting the second task. You might need to deep clone the task object before passing it to userData to avoid references to the previous task.
Before starting the second task, ensure that the crawler and its related state are completely reset. You might also want to destroy or reinitialize the crawler instance.check the code below
async function runTaskCrawler(crawler: PlaywrightCrawler, task: CrawlerTask) {

await crawler.autoscaledPool?.abort();
await crawler.teardown(); // Ensure full teardown of the previous task's state.

// Reinitialize the crawler or ensure it's fresh.
const newCrawler = new PlaywrightCrawler();

switch (task.taskType) {
case CrawlerType.WEBSITE:
return await runWebsiteTaskCrawler(newCrawler, task);
default:
throw new Error('Invalid crawler type');
}
}
Always ensure that the userData being passed to the router is specific to the current task. You can enforce this by logging and verifying the content of userData right before starting the crawl
I pray it helps

hhunterleung.

HI Vistics, Thanks for your help. I will try this way as you said . And I will tell you the test result.

VVistics

👍

hhunterleung.

HI Vistics, I have try your method,but the result as same as mine,It seems not work.

OOleg V.

Maybe You can try to use useState() method instead of userData ? :

https://crawlee.dev/api/next/core/function/useState

Example:

Plain Text

const crawler = new CheerioCrawler({
    async requestHandler({ crawler }) {
        const state = await crawler.useState({ foo: [] as number[] });
        // just change the value, no need to care about saving it
        state.foo.push(123);
    },
});

hhunterleung.

thanks Oleg V. , I will try the way as you said .Thanks

Add a reply

Apify Discord Mirror

How should I fix the userData if I run 2 different crawler in the same app?