Adding session-cookies

At a glance

The post describes a community member's difficulty in adding specific cookies to requests when using the Crawlee library for web scraping with Playwright. The community members discuss various approaches to set cookies, including using the PlaywrightBrowserPlugin, overriding the HeaderGenerator, and creating a custom BrowserPool. They also mention upcoming changes in the Crawlee library that may provide better support for setting cookies. Overall, the community members provide several potential solutions, but there is no explicitly marked answer.

Useful resources

ccrawleexl

After following the tutorial on scraping using crawlee, I cannot figure out how to add specific cookies (key-value pair) to the request. E.g., sid=1234

There is something like a session, and a session-pool, but how to reach these objects?

Then max_pool_size of session-pool has default size of 1000, should one then iterate through the sessions in the session-pool to set the session-id to the session.cookies (dict)?

Imagine the below from the tutorial, default handler handles the incoming request, it wants to enqueue requests to the category-pages. Lets say these category-pages require the sid-cookie to be set, how to achieve this?

Any help is very much appreciated, as no examples can be found via Google / ChatGPT / Perplexity.

Plain Text

@router.default_handler
async def default_handler(context: PlaywrightCrawlingContext) -> None:
    # This is a fallback route which will handle the start URL.
    context.log.info(f'default_handler is processing {context.request.url}')
    
    await context.page.wait_for_selector('.collection-block-item')

    await context.enqueue_links(
        selector='.collection-block-item',
        label='CATEGORY',
    )

17 comments

MMantisus

I'm a little surprised that you need to set cookies for PlaywrightCrawling.

For HTTP crawlers you could pass cookies inside headers. But for Playwright I can't think of a quick solution.

ccrawleexl

Dear Mantisus, thanks for your follow-up. How would you handle then a login-page, the sid-cookie is not shared with all 1000 sessions in the session-pool right? So instead of logging in once, would it then need a seperate login (including 2fa-resolving in worst case) for every request?

MMantisus

Understood your use case. I'm going to go dig into the crawlee-python code a bit and see if I can come up with some ideas

MMantisus

@crawleexl

I would use something like this

Plain Text

import asyncio

from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.browsers._playwright_browser_plugin import PlaywrightBrowserPlugin
from crawlee.browsers import BrowserPool

async def main() -> None:

    plugin = PlaywrightBrowserPlugin(
        page_options={"extra_http_headers": {"cookie": "auth=to_rule_over_everyone"}}
        )
    pool = BrowserPool(plugins=[plugin])
    crawler = PlaywrightCrawler(
        max_requests_per_crawl=10,
        browser_pool=pool
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        content = await context.page.content()
        print(content)


    await crawler.run(['https://httpbin.org/get'])


if __name__ == '__main__':
    asyncio.run(main())

MMantisus

Or if you want to set a cookie after some action

Plain Text

import asyncio

from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee import Request
from crawlee.browsers._playwright_browser_plugin import PlaywrightBrowserPlugin
from crawlee.browsers import BrowserPool

async def main() -> None:
    user_headers = {}
    user_plugin = PlaywrightBrowserPlugin(page_options={"extra_http_headers": user_headers})
    pool = BrowserPool(plugins=[user_plugin])
    crawler = PlaywrightCrawler(
        max_requests_per_crawl=10,
        browser_pool=pool
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        content = await context.page.content()
        print(content)
        user_headers["cookie"] = "auth=to_rule_over_everyone"

        await context.add_requests([Request.from_url(
            "https://httpbin.org/get?page=2"
        )])


    await crawler.run(['https://httpbin.org/get'])


if __name__ == '__main__':
    asyncio.run(main())

ccrawleexl

Thanks for your quick reply.

Crawlee v0.3.9

Just checking, trying out the first code example it fails as it says:
'PlaywrightBrowserPlugin' object is not iterable

Or when simply doing:

Plain Text

# Create a browser pool with a Playwright browser plugin
    pool = BrowserPool(
        plugins=[
            PlaywrightBrowserPlugin(
                browser_type='chromium',
                browser_options={'headless': False},
                page_options={
                    'extra_http_headers': {
                        'Custom-Header': 'Value'
                    }
                }
            )
        ]
    )

It says:
BrowserContext.new_page() got an unexpected keyword argument 'extra_http_headers'

The looking in Git:
https://github.com/apify/crawlee-python/blob/master/src/crawlee/browsers/_playwright_browser_plugin.py

It does not specify what valid page-options are, extra_http_headers should be part of the normal specifications.

MMantisus

I tested it on Crawlee v0.3.5.

I see they've changed something.

MMantisus

Until the development team provides public methods for passing parameters to the PlaywrightBrowserController I can only see a solution with patching the HeaderGenerator

Example

Plain Text

import asyncio

from crawlee.fingerprint_suite import HeaderGenerator
from crawlee._types import HttpHeaders
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext

def get_common_headers(self):
    headers = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
        'Accept-Language': 'en-US,en;q=0.9',
        'cookie': "auth=to_rule_over_everyone"
    }
    return HttpHeaders(headers)

HeaderGenerator.get_common_headers = get_common_headers

async def main() -> None:
    crawler = PlaywrightCrawler(
        max_requests_per_crawl=10
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        content = await context.page.content()
        print(content)


    await crawler.run(['https://httpbin.org/get'])


if __name__ == '__main__':
    asyncio.run(main())

MMantisus

They went from creating a one-page context to a full context.

But they don't provide any methods to pass custom parameters to it yet

https://github.com/apify/crawlee-python/blob/master/src/crawlee/browsers/_playwright_browser_controller.py#L155

MMantisus

Apparently these updates came with version 0.3.9 if you are using an earlier version then my previous examples should work (at least on version 0.3.5).

You can see the allowed parameters for single page context in the playwright documentation - https://playwright.dev/python/docs/api/class-browser#browser-new-page.

MMantisus

A cleaner solution for v0.3.9

Plain Text

import asyncio

from crawlee.browsers import BrowserPool
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext

class CustomBrowserPool(BrowserPool):
    async def new_page(self, *args, **kwargs):
        page = await super().new_page(*args, **kwargs)
        await page.page.set_extra_http_headers({'cookie': "auth=to_rule_over_everyone"})
        return page

async def main() -> None:
    crawler = PlaywrightCrawler(
        browser_pool=CustomBrowserPool(),
        max_requests_per_crawl=10
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')
        
        content = await context.page.content()
        print(content)


    await crawler.run(['https://httpbin.org/get'])


if __name__ == '__main__':
    asyncio.run(main())

ccrawleexl

That works like a charm for now with the override.

Just for future reference, from v0.4.0 onwards.

Lets say one sets the session-cookie like this:

Plain Text

pool = BrowserPool(
        plugins=[
            PlaywrightBrowserPlugin(
                browser_type='chromium',
                browser_options={'headless': False},
                page_options={
                    'extra_http_headers': {
                        'Custom-Header': 'Value'
                    }
                }
            )
        ]
    )

Then on a certain request it needs to re-auth. Is there a way to within a request_handler retrieve the BrowserPool object and override the Custom-Header?

Plain Text

@router.default_handler
async def default_handler(context: PlaywrightCrawlingContext) -> None:
    # pseudo-code
    pool = BrowserPool()
    pool.plugins[0].page_options['extra_http_headers'] = { 'Custom-Header': 'New-Value }

MMantisus

I don't know what the developers plans are for the next releases. I don't think we'll get access to context management from request_handler by the approaches that are being used now

For rewriting headers now I would use this approach

Plain Text

import asyncio

from crawlee.browsers import BrowserPool
from crawlee.playwrightcrawler import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee import Request

custom_headers = {}

class CustomBrowserPool(BrowserPool):
    async def new_page(self, args, **kwargs):
        page = await super().new_page(args, **kwargs)
        await page.page.set_extra_http_headers(custom_headers)
        return page

async def main() -> None:
    crawler = PlaywrightCrawler(
        browser_pool=CustomBrowserPool(),
        max_requests_per_crawl=10
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')
        content = await context.page.content()
        custom_headers['cookie'] = "auth=to_rule_over_everyone"
        print(content)
        await context.add_requests([Request.from_url(
                    "https://httpbin.org/get?page=2"
                )])

    await crawler.run(['https://httpbin.org/get'])


if __name == '__main':
    asyncio.run(main())

To contact the development team, the best way is to use - https://github.com/apify/crawlee-python/discussions

Those who reply here are mostly developers like you and me who are just using the library.

ccrawleexl

Thanks for your help, it gives many clues, great help.

AApifyBot

@crawleexl just advanced to level 1! Thanks for your contributions! 🎉

MMantisus

@crawleexl

Pay attention to

https://github.com/apify/crawlee-python/blob/master/src/crawlee/playwright_crawler/_playwright_pre_navigation_context.py - which will be in the next release

and https://crawlee.dev/python/docs/examples/playwright-crawler (obviously published by mistake as this functionality is not yet available in v0.3.9)

When this code is released, it should make it possible to do something like this

Plain Text

@crawler.pre_navigation_hook
async def log_navigation_url(context: PlaywrightPreNavigationContext) -> None:
    await context.page.set_extra_http_headers(custom_headers)
    context.log.info(f'Navigating to {context.request.url} ...')

AApifyBot

@Mantisus just advanced to level 5! Thanks for your contributions! 🎉

Add a reply

Apify Discord Mirror

Adding session-cookies