Python Session Tracking

Question

Is there a way to ensure that successive requests are made using the same session (with the same cookies, etc.) in the Python API? I am scraping a very fussy site that seems to have strict session continuity requirements so I need to ensure that for main page A, all requests to sub pages linked from there, A-1, A-2, A-3, etc. (as well as A-1-1, A-1-2, etc.,) are made within the same session as the original request.

Thanks as always.

Mantisus · Accepted Answer

Unfortunately, I don't see a good way to do this at the moment. Since the session is passed to the context at a pretty deep level - https://github.com/apify/crawlee-python/blob/master/src/crawlee/crawlers/_basic/_basic_crawler.py#L985

I think it has to do with some boundary cases. For example when in the middle of a request chain, the session gets blocked.

I would consider 2 workarounds with https://crawlee.dev/python/api/class/PlaywrightCrawler#pre_navigation_hook.

You check if the session has the necessary cookies and if not, you make a request to the page that generates them

Plain Text

  @crawler.pre_navigation_hook
  async def hook1(context: HttpCrawlingContext) -> None:
      if context.request.label and 'basic' not in context.session.cookies:
          await context.send_request('https://httpbin.org/cookies/set/basic/100')

The second is to pass cookies as user_data and update the session that will make the request with them

Plain Text

    @crawler.router.default_handler
    async def handler_one(context: HttpCrawlingContext) -> None:
        session_cookie = context.session.cookies
        request =  Request.from_url(
            url='https://httpbin.org/cookies/set/d/10',
            label='label_two',
            user_data={'session_cookie': session_cookie})
        await context.add_requests([request])

    @crawler.pre_navigation_hook
    async def hook1(context: HttpCrawlingContext) -> None:
        if context.request.label:
            context.session.cookies.update(context.request.user_data['session_cookie'])

If you don't care about high parallelism. You can try to use 1 session for everything

Plain Text

from crawlee.sessions import SessionPool

crawler = HttpCrawler(
    session_pool=SessionPool(
        max_pool_size=1,
        create_session_settings={
            'max_usage_count': float('inf'),
        }))

uberpea5000 · Answer

Thanks! These are great solutions. I'm going with option 3 for now (which is working for me well enough for now), but I'll experiment with 1 and 2 as well.

ApifyBot · Answer

@uberpea5000 just advanced to level 1! Thanks for your contributions! 🎉

Mantisus · Answer

Glad it's helpful for you

Doigus · Answer

Hey Mantisus,I was wondering what is the trade off between updating the session request by passing the cookies in the pre_navigation_hook or in the request header level like you said in this issue:https://github.com/apify/crawlee-python/issues/710Just to clarify my understanding with these solutions, the session cookies will persist with each session, so we wouldn't need to store them ourselves?Thanks super much.

Mantisus · Answer

Hey @Doigus

The key difference between these approaches. When you pass a cookie to a Request it will overwrite any other cookies. So this approach works best when you want all requests to be made with the same cookie.

With pre_navigation_hook you have more control over what happens.

For example, if your crawler is performing authorization on a site and you know that the sessionid cookie is responsible for this, you can hash it and pass it inside pre_navigation_hook for all sessions that do not have a sessionid.

Mantisus · Answer

async def main() -> None: crawler = HttpCrawler() _cache = {} @crawler.pre_navigation_hook async def hook(context: HttpCrawlingContext) -> None: if 'sessionid' not in context.session.cookies and 'sessionid' in _cache: context.session.cookies['sessionid'] = _cache['sessionid'] @crawler.router.default_handler async def request_handler(context: HttpCrawlingContext) -> None: context.log.info(f'Processing {context.request.url}...') if 'sessionid' not in _cache and 'sessionid' in context.session.cookies: _cache['sessionid'] = context.session.cookies['sessionid'] print(context.http_response.read()) await context.add_requests([Request.from_url('https://httpbin.org/get')]) await crawler.run([Request.from_url('https://httpbin.org/cookies/set/sessionid/1')]) if __name__ == '__main__': asyncio.run(main())or using use_state since version 0.5.0async def main() -> None: crawler = HttpCrawler() @crawler.pre_navigation_hook async def hook(context: HttpCrawlingContext) -> None: _cache = await context.use_state() if 'sessionid' not in context.session.cookies and 'sessionid' in _cache: context.session.cookies['sessionid'] = _cache['sessionid'] @crawler.router.default_handler async def request_handler(context: HttpCrawlingContext) -> None: context.log.info(f'Processing {context.request.url}...') _cache = await context.use_state() if 'sessionid' not in _cache and 'sessionid' in context.session.cookies: _cache['sessionid'] = context.session.cookies['sessionid'] print(context.http_response.read()) await context.add_requests([Request.from_url('https://httpbin.org/get')]) await crawler.run([Request.from_url('https://httpbin.org/cookies/set/sessionid/1')]) if __name__ == '__main__': asyncio.run(main())

Mantisus · Answer

In this case, yes, the sessionid cookie will be in every session and it doesn't matter when it was created.

Mantisus · Answer

Note that this approach will not work for Playwright, as it is a bit more complicated.

Mantisus · Answer

async def main() -> None: crawler = PlaywrightCrawler() @crawler.pre_navigation_hook async def hook(context: PlaywrightCrawlingContext) -> None: _cache = await context.use_state() if 'sessionid' in _cache: await context.page.context.add_cookies([_cache['sessionid']]) @crawler.router.default_handler async def request_handler(context: PlaywrightCrawlingContext) -> None: context_cookies = await context.page.context.cookies(context.request.url) _cache = await context.use_state() target_cookie = None for cookie in context_cookies: if cookie['name'] == 'sessionid': target_cookie = cookie if 'sessionid' not in _cache and target_cookie: _cache['sessionid'] = target_cookie print(await context.page.content()) # clearing cookies to make sure that even if the same context is used. Our solution works. await context.page.context.clear_cookies() await context.add_requests([Request.from_url('https://httpbin.org/get')]) await crawler.run([Request.from_url('https://httpbin.org/cookies/set/sessionid/1')]) if __name__ == '__main__': asyncio.run(main())

Doigus · Answer

I'm using Playwright with Camoufox, I'll give this a go thank you 🙂

Mantisus · Answer

Glad if this proves useful.

Mantisus · Answer

Oh, that's a pretty heavy decision. I've been testing Camoufox with PlaywrightCrawler for a while.

Interesting, but very resource intensive, although I realize that in some cases this is the best approach 🙂

Doigus · Answer

You would suggest trying Chromium instead?Am I write to assume that the sessions get automattically set after login?

Mantisus · Answer

I favor HTTP crawlers wherever possible. 🙂

Yes, in any browser-based system, cookies are set automatically in context when you authorize. If you have a single context that won't be closed, you may not have to worry about cookies at all

Mantisus · Answer

If the site uses a lot of anti-scraping technologies, just Chromium probably won't work.But if Chromium works for you, then yes it is better than Camoufox as it will use significantly less resources.

Mantisus · Answer

This is some very promising PR - https://github.com/apify/crawlee-python/pull/829Which can replace many cases when simply Chromium does not work, and Camoufox is excessive.

Apify and Crawlee Official Forum

Python Session Tracking