Apify and Crawlee Official Forum

Updated 2 days ago

Python Session Tracking

At a glance
The community member is trying to ensure that successive requests in a Python API are made using the same session, with the same cookies, to scrape a fussy site that requires strict session continuity. The community members suggest two workarounds using the PlaywrightCrawler pre-navigation hook: 1) checking if the session has the necessary cookies and setting them if not, and 2) passing cookies as user data and updating the session with them. As an alternative, the community members suggest using a single session for all requests, which may work if high parallelism is not a concern.
Useful resources
Is there a way to ensure that successive requests are made using the same session (with the same cookies, etc.) in the Python API? I am scraping a very fussy site that seems to have strict session continuity requirements so I need to ensure that for main page A, all requests to sub pages linked from there, A-1, A-2, A-3, etc. (as well as A-1-1, A-1-2, etc.,) are made within the same session as the original request.

Thanks as always.
Marked as solution
Unfortunately, I don't see a good way to do this at the moment. Since the session is passed to the context at a pretty deep level - https://github.com/apify/crawlee-python/blob/master/src/crawlee/crawlers/_basic/_basic_crawler.py#L985

I think it has to do with some boundary cases. For example when in the middle of a request chain, the session gets blocked.

I would consider 2 workarounds with https://crawlee.dev/python/api/class/PlaywrightCrawler#pre_navigation_hook.

You check if the session has the necessary cookies and if not, you make a request to the page that generates them

Plain Text
  @crawler.pre_navigation_hook
  async def hook1(context: HttpCrawlingContext) -> None:
      if context.request.label and 'basic' not in context.session.cookies:
          await context.send_request('https://httpbin.org/cookies/set/basic/100')


The second is to pass cookies as user_data and update the session that will make the request with them

Plain Text
    @crawler.router.default_handler
    async def handler_one(context: HttpCrawlingContext) -> None:
        session_cookie = context.session.cookies
        request =  Request.from_url(
            url='https://httpbin.org/cookies/set/d/10',
            label='label_two',
            user_data={'session_cookie': session_cookie})
        await context.add_requests([request])

    @crawler.pre_navigation_hook
    async def hook1(context: HttpCrawlingContext) -> None:
        if context.request.label:
            context.session.cookies.update(context.request.user_data['session_cookie'])

If you don't care about high parallelism. You can try to use 1 session for everything

Plain Text
from crawlee.sessions import SessionPool

crawler = HttpCrawler(
    session_pool=SessionPool(
        max_pool_size=1,
        create_session_settings={
            'max_usage_count': float('inf'),
        }))
View full solution
1
M
u
A
17 comments
Unfortunately, I don't see a good way to do this at the moment. Since the session is passed to the context at a pretty deep level - https://github.com/apify/crawlee-python/blob/master/src/crawlee/crawlers/_basic/_basic_crawler.py#L985

I think it has to do with some boundary cases. For example when in the middle of a request chain, the session gets blocked.

I would consider 2 workarounds with https://crawlee.dev/python/api/class/PlaywrightCrawler#pre_navigation_hook.

You check if the session has the necessary cookies and if not, you make a request to the page that generates them

Plain Text
  @crawler.pre_navigation_hook
  async def hook1(context: HttpCrawlingContext) -> None:
      if context.request.label and 'basic' not in context.session.cookies:
          await context.send_request('https://httpbin.org/cookies/set/basic/100')


The second is to pass cookies as user_data and update the session that will make the request with them

Plain Text
    @crawler.router.default_handler
    async def handler_one(context: HttpCrawlingContext) -> None:
        session_cookie = context.session.cookies
        request =  Request.from_url(
            url='https://httpbin.org/cookies/set/d/10',
            label='label_two',
            user_data={'session_cookie': session_cookie})
        await context.add_requests([request])

    @crawler.pre_navigation_hook
    async def hook1(context: HttpCrawlingContext) -> None:
        if context.request.label:
            context.session.cookies.update(context.request.user_data['session_cookie'])

If you don't care about high parallelism. You can try to use 1 session for everything

Plain Text
from crawlee.sessions import SessionPool

crawler = HttpCrawler(
    session_pool=SessionPool(
        max_pool_size=1,
        create_session_settings={
            'max_usage_count': float('inf'),
        }))
Thanks! These are great solutions. I'm going with option 3 for now (which is working for me well enough for now), but I'll experiment with 1 and 2 as well.
@uberpea5000 just advanced to level 1! Thanks for your contributions! πŸŽ‰
Glad it's helpful for you
Hey Mantisus,

I was wondering what is the trade off between updating the session request by passing the cookies in the pre_navigation_hook or in the request header level like you said in this issue:

https://github.com/apify/crawlee-python/issues/710

Just to clarify my understanding with these solutions, the session cookies will persist with each session, so we wouldn't need to store them ourselves?

Thanks super much.
Hey @Doigus

The key difference between these approaches. When you pass a cookie to a Request it will overwrite any other cookies. So this approach works best when you want all requests to be made with the same cookie.

With pre_navigation_hook you have more control over what happens.

For example, if your crawler is performing authorization on a site and you know that the sessionid cookie is responsible for this, you can hash it and pass it inside pre_navigation_hook for all sessions that do not have a sessionid.
Plain Text
async def main() -> None:
    crawler = HttpCrawler()
    _cache = {}

    @crawler.pre_navigation_hook
    async def hook(context: HttpCrawlingContext) -> None:
        if 'sessionid' not in context.session.cookies and 'sessionid' in _cache:
            context.session.cookies['sessionid'] = _cache['sessionid']

    @crawler.router.default_handler
    async def request_handler(context: HttpCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url}...')

        if 'sessionid' not in _cache and 'sessionid' in context.session.cookies:
            _cache['sessionid'] = context.session.cookies['sessionid']

        print(context.http_response.read())

        await context.add_requests([Request.from_url('https://httpbin.org/get')])

    await crawler.run([Request.from_url('https://httpbin.org/cookies/set/sessionid/1')])

if __name__ == '__main__':
    asyncio.run(main())

or using use_state since version 0.5.0

Plain Text
async def main() -> None:
    crawler = HttpCrawler()

    @crawler.pre_navigation_hook
    async def hook(context: HttpCrawlingContext) -> None:
        _cache = await context.use_state()
        if 'sessionid' not in context.session.cookies and 'sessionid' in _cache:
            context.session.cookies['sessionid'] = _cache['sessionid']

    @crawler.router.default_handler
    async def request_handler(context: HttpCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url}...')
        _cache = await context.use_state()

        if 'sessionid' not in _cache and 'sessionid' in context.session.cookies:
            _cache['sessionid'] = context.session.cookies['sessionid']

        print(context.http_response.read())

        await context.add_requests([Request.from_url('https://httpbin.org/get')])

    await crawler.run([Request.from_url('https://httpbin.org/cookies/set/sessionid/1')])

if __name__ == '__main__':
    asyncio.run(main())
In this case, yes, the sessionid cookie will be in every session and it doesn't matter when it was created.
Note that this approach will not work for Playwright, as it is a bit more complicated.
Plain Text
async def main() -> None:
    crawler = PlaywrightCrawler()

    @crawler.pre_navigation_hook
    async def hook(context: PlaywrightCrawlingContext) -> None:
        _cache = await context.use_state()
        if 'sessionid' in _cache:
            await context.page.context.add_cookies([_cache['sessionid']])

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context_cookies = await context.page.context.cookies(context.request.url)

        _cache = await context.use_state()

        target_cookie = None
        for cookie in context_cookies:
            if cookie['name'] == 'sessionid':
                target_cookie = cookie

        if 'sessionid' not in _cache and target_cookie:
            _cache['sessionid'] = target_cookie

        print(await context.page.content())

        # clearing cookies to make sure that even if the same context is used. Our solution works.
        await context.page.context.clear_cookies()

        await context.add_requests([Request.from_url('https://httpbin.org/get')])

    await crawler.run([Request.from_url('https://httpbin.org/cookies/set/sessionid/1')])

if __name__ == '__main__':
    asyncio.run(main())
I'm using Playwright with Camoufox,

I'll give this a go thank you πŸ™‚
Glad if this proves useful.
Oh, that's a pretty heavy decision. I've been testing Camoufox with PlaywrightCrawler for a while.

Interesting, but very resource intensive, although I realize that in some cases this is the best approach πŸ™‚
You would suggest trying Chromium instead?

Am I write to assume that the sessions get automattically set after login?
I favor HTTP crawlers wherever possible. πŸ™‚

Yes, in any browser-based system, cookies are set automatically in context when you authorize. If you have a single context that won't be closed, you may not have to worry about cookies at all
If the site uses a lot of anti-scraping technologies, just Chromium probably won't work.

But if Chromium works for you, then yes it is better than Camoufox as it will use significantly less resources.
This is some very promising PR - https://github.com/apify/crawlee-python/pull/829

Which can replace many cases when simply Chromium does not work, and Camoufox is excessive.
Add a reply
Sign up and join the conversation on Discord