I am working on a Crawlee-based crawler that performs actions on a "per user" basis. Given this, i want to keep a configuration of the browser on a per user basis. I already store cookies, and push those into the browser before each page is loaded on behalf of the user. But, this is done in the same browser.. and I think this is throwing things off.
I run into problems where the cookies “go bad”. This could have NOTHING to do with the architecture. But, it seems to me that every browser is “different”, and as such I think that might be throwing an error.
Does anyone have any thoughts on how to store on a per user basis? We also configure the proxy to use a sticky proxy/IP if we can for each user as well.
Also, related to this, in our current/old system, we would open tabs for threaded execution. I'm not sure if I need todo something similar with Crawlee? And, given this, I'm not sure how this works.
const crawler = new PlaywrightCrawler({
requestHandler: router,
launchContext: {
userDataDir: './user_data' // path to the folder where you want to store the per-user data.
},
});
Regarding the "threaded" execution, Crawlee handles per-request concurrency automatically, so you don't really have to care for it (it scales up and down based on the current system load).
Yep, launchContext.userDataDir is just passed to Playwright afaik. You can pass more launch options to the browser (like CLI arguments) in launchContext.launchOptions (check out the TS type annotation in your IDE, it gives you all the options you can use)
so, we're running in Kubernetes.. wiht multiple worker processes.. so i'd need to mount these standard data directory into all of my workers.. so it could get to the correct path.
I don't think we ever tried anything like this, but yes - in theory, it should work like this 🙂
If you keep the mapping "one user = one userDataDir", you might even save yourself the hassle with injecting the cookies - the cookies are saved in the userDataDir (along with localStorage contents etc.) This also shows why you definitely shouldn't share the same userDataDir between multiple users 🙂 If you don't specify this option, Playwright generates a new ephemeral userDataDir for each script execution iirc.