Apify and Crawlee Official Forum

Updated 3 months ago

is there a way to have custom variables accessible inside the crawler function?

Is there a way we can pass in some variables and have them accessible inside the crawler? main use case is to have some internal variables which we can check/modify and execute some conditional logic using them.

for e.g. lets say we have duplicate_count variable (I know we already have this functionality but this is just an example), and I'll update it if the data is already there in my db and stop the crawler if the count exceeds some threshold. Is there a way I could implement this?
J
A
L
13 comments
This was essentially answered/led to in my thread https://discord.com/channels/801163717915574323/1178416767991824415 . You can pass data in "userData".

Example passing data:
Plain Text
crawler.run([
    { url: 'someUrl', userData: { thing: 'value' } }
]);


And you can access/modify it from within a crawl with the "request" property:
Plain Text
request.userData
thanks, this worksm only issue is you gotta explicitly pass this in enququLinks to ensure it propagates in further calls
you could use a "preNavigationHook" to automatically set it as well
although at that point you might as well just import the variables where you need them. unless i'm understanding what you want incorrectly.
I dont think preNavigationHooks will work, but yeah importing them is a good way but it's not that good DX wise, I prefer explicit variables defined right in the file instead of having to import them unless necessary, makes for lesser load cognitively but thanks for the info!
userData works perfectly
There are generally 2 ways to manage state
  1. For sequential flow, it is request.userData
  2. For non sequential, you can have global state object with useState
https://crawlee.dev/api/core/function/useState
hey thanks for the info, do we define it outside of the router/crawler like this? and then use state variable
Plain Text
import { createPlaywrightRouter , useState} from 'crawlee';

export const router = createPlaywrightRouter();

const state = await useState("test", {"val":12})

router.addDefaultHandler(async ({ enqueueLinks, log }) => {

    log.info(`enqueueing new URLs`);
    await enqueueLinks({
        globs: ['https://crawlee.dev/**'],
        label: 'detail',
    });
});

router.addHandler('detail', async ({ request, page, log, pushData,  }) => {
    const title = await page.title();
    log.info(`${title}`, { url: request.loadedUrl });

    await pushData({
        url: request.loadedUrl,
        title,
    });
});
I also saw another snippet on github issues using it like crawler.useState , can you clarify a bit on this?
and whats the difference between passing in name in the useState func vs passing it in the config parameter? on the docs both options use it to define a custom key value store
Both imports are equivalent. Name in useState will be just for that function, config would be global
gotcha thanks, so call it outside of crawler like const state = await useState() and then use it inside the crawler like a simple object? e.g. state.property=val
Yep, the reason to have this instead of just naked object is that it is persisted to KV Store in case you need to restart
Add a reply
Sign up and join the conversation on Discord