Apify

Apify and Crawlee Official Forum

b
F
A
J
A
Members
wflanagan
w
wflanagan
Offline, last seen last month
Joined August 30, 2024
I am working on a Crawlee-based crawler that performs actions on a "per user" basis. Given this, i want to keep a configuration of the browser on a per user basis. I already store cookies, and push those into the browser before each page is loaded on behalf of the user. But, this is done in the same browser.. and I think this is throwing things off.

I run into problems where the cookies “go bad”. This could have NOTHING to do with the architecture. But, it seems to me that every browser is “different”, and as such I think that might be throwing an error.

Does anyone have any thoughts on how to store on a per user basis? We also configure the proxy to use a sticky proxy/IP if we can for each user as well.

Thanks!
13 comments
w
v
A
I am working on a Crawlee-based crawler that performs actions on a "per user" basis. Given this, i want to keep a configuration of the browser on a per user basis. I already store cookies, and push those into the browser before each page is loaded on behalf of the user. But, this is done in the same browser.. and I think this is throwing things off.

I run into problems where the cookies “go bad”. This could have NOTHING to do with the architecture. But, it seems to me that every browser is “different”, and as such I think that might be throwing an error.

Does anyone have any thoughts on how to store on a per user basis? We also configure the proxy to use a sticky proxy/IP if we can for each user as well.

Thanks!
13 comments
w
v
A
So, I'm trying to understand what I'm doing architecturally wrong. I define a "crawler" using Playwright crawler. Then, I add some urls to the requestQueue and it runs fine. I can then load more into the crawler, and again call run and it works. But, in the case where my crawler will have a queue that is sending it new requests ongoing, i'm not sure how to architect this.

When you make it where each crawler receive a url, processes it, and closes down, it makes it impossible effectively to run things in parallel.

When you try to add things to the queue while the crawler is running (using addRequests), that seems to fail as well.

So, how do I architect this?

This is my example code for reference.

(I attempted to add the example code i'm using, but it was too long. So, here's a gist: )
38 comments
2
w
A
H
g
G
So, I'm trying to understand what I'm doing architecturally wrong. I define a "crawler" using Playwright crawler. Then, I add some urls to the requestQueue and it runs fine. I can then load more into the crawler, and again call run and it works. But, in the case where my crawler will have a queue that is sending it new requests ongoing, i'm not sure how to architect this.

When you make it where each crawler receive a url, processes it, and closes down, it makes it impossible effectively to run things in parallel.

When you try to add things to the queue while the crawler is running (using addRequests), that seems to fail as well.

So, how do I architect this?

This is my example code for reference.

(I attempted to add the example code i'm using, but it was too long. So, here's a gist: )
38 comments
2
w
A
H
g
G