How to manually pass datasets, sessions, cookies, proxi...

At a glance

The community member is trying to manually manage datasets and sessions in a web crawler, specifically wanting to use a session they have created and pass the dataset to the request handler. The comments suggest using named datasets, session pool options, and the BasicCrawler class to achieve this. Some community members recommend separating dataset management from request handling, and reusing authentication cookies to maintain the same session and proxy. The community member is concerned about being blocked by the server when using different proxies with the same authentication cookies. The documentation is praised for explaining individual classes well, but lacking in examples of how to use them together in a crawler.

Useful resources

GGoranTopic

It might be obvious but have not been able to figure this out, nor in the documentation nor in the forums.
I want to manually manage my datasets and session, but I want to make a Request use a session I have created and to pass on the dataset to the handler of the request.
I know I could pass on using the userData, or I could create it in a different file and simply import it, but these seem like the wrong approaches.

7 comments

AAndrey Bykov

For datasets - you could open e.g. several named datasets, and then just save depending on some condition to one or another. For sessionPool - you could also provide e.g. createSessionFunction of sessionPoolOptions - https://crawlee.dev/api/core/interface/SessionPoolOptions#createSessionFunction

You could also use BasicCrawler https://crawlee.dev/api/basic-crawler and explicitly call the request, mark session good/bad, etc

but I guess the main question is - what exactly are you trying to achieve?

AAlexey Udovydchenko

dataset management should be separate logic imho, since its not related to the way how you making requests; cookies per raw request always in headers, so if you want to keep i.e. doing requests as logged user then find auth cookies and reuse them

GGoranTopic

Honestly, I just want to make sure I am using the authentication cookies with the same proxy in the same session. Since i don't know how exactly Crawlee handles the session, Request is called.

GGoranTopic

You are absolutely right, the database login make a lot more sense to be separate. I am worried about hitting the sever from diffident proxies which have the same auth cookies

GGoranTopic

"Having our cookies and other identifiers used only with a specific IP will reduce the chance of being blocked." https://crawlee.dev/docs/guides/session-management i was trying to do this,

GGoranTopic

The documentation is very good at explain how to use every class separately but doe snot provide example of how to use it in the crawler

AAndrey Bykov

in crawlers session is part of the crawlingContext . Also you could provide sessionPoolOptions, where you can specify the session options, createSesssionFunction, etc. You could even specify one session to make sure you're going with only 1 IP, or specify that session should be retired let's say after 500 request, etc. Do you need to use only 1 account and thus you want to use only one session all the time?

Add a reply

Apify Discord Mirror

How to manually pass datasets, sessions, cookies, proxies between Requests?