Apify

Apify and Crawlee Official Forum

b
F
A
J
A

How to "store" and "retrieve" a browser on a per user basis?

I am working on a Crawlee-based crawler that performs actions on a "per user" basis. Given this, i want to keep a configuration of the browser on a per user basis. I already store cookies, and push those into the browser before each page is loaded on behalf of the user. But, this is done in the same browser.. and I think this is throwing things off.

I run into problems where the cookies “go bad”. This could have NOTHING to do with the architecture. But, it seems to me that every browser is “different”, and as such I think that might be throwing an error.

Does anyone have any thoughts on how to store on a per user basis? We also configure the proxy to use a sticky proxy/IP if we can for each user as well.

Thanks!
w
v
A
13 comments
Also, related to this, in our current/old system, we would open tabs for threaded execution. I'm not sure if I need todo something similar with Crawlee? And, given this, I'm not sure how this works.
For user profile isolation with the browser-based crawlers, you can use the launchContext.userDataDir option - this is basically a passthrough option for the Playwright / Puppeteer option of the same name (https://playwright.dev/docs/api/class-browsertype#browser-type-launch-persistent-context-option-user-data-dir).

Plain Text
const crawler = new PlaywrightCrawler({
    requestHandler: router,
    launchContext: {
        userDataDir: './user_data' // path to the folder where you want to store the per-user data.
    },
});
Regarding the "threaded" execution, Crawlee handles per-request concurrency automatically, so you don't really have to care for it (it scales up and down based on the current system load).
Ok that's super helpful.
so, it basically works with playwright's options for this?
Yep, launchContext.userDataDir is just passed to Playwright afaik. You can pass more launch options to the browser (like CLI arguments) in launchContext.launchOptions (check out the TS type annotation in your IDE, it gives you all the options you can use)
just advanced to level 1! Thanks for your contributions! 🎉
so, we're running in Kubernetes.. wiht multiple worker processes.. so i'd need to mount these standard data directory into all of my workers.. so it could get to the correct path.
we're already stuffing the browser before retrieve with cookies, and dumping them back after the page is loaded
this would be the data directory that would store "other" stuff.. i'd guess. that would help us keep things "clean" between users.
I don't think we ever tried anything like this, but yes - in theory, it should work like this 🙂

If you keep the mapping "one user = one userDataDir", you might even save yourself the hassle with injecting the cookies - the cookies are saved in the userDataDir (along with localStorage contents etc.) This also shows why you definitely shouldn't share the same userDataDir between multiple users 🙂 If you don't specify this option, Playwright generates a new ephemeral userDataDir for each script execution iirc.
thanks for that.. yeah, we have initial cookies we'd need to inject. but otherwise, yeah, that seems like it would be logical.
Add a reply
Sign up and join the conversation on Discord
Join