crawleexl

Debug and troubleshooting Crawlee

When there is an error in the response-handler, then it does not show an error, but it will fail unnoticed, and Crawlee will retry 9 more times.

To illustrate the following syntax-error:
element = await context.page.query_selector('div[id="main"')

It should be with a closing bracket after "main":
element = await context.page.query_selector('div[id="main"]')

However, instead of failing, Crawlee will proceed and keep trying to query the page and it will not show any message on where it failed.
Is there any way to troubleshoot these kind of issues?

2 comments

ccrawleexl

Simple POST-example

Flaw in tutorial on basic POST-functionality:
https://crawlee.dev/python/docs/examples/fill-and-submit-web-form

It makes an actual POST-request, but the data is not reaching the server, tried on various endpoints.

Two questions:
1) What is broken here and how to fix it?
2) My biggest concern using Crawlee is that I have no clue how to troubleshoot these kind of bugs.

Where can one check what goes wrong, for example how to check under the hood if CURL (?) or whatever library makes the actual request is populating the payload correctly, etc.
It has many benefits this framework, but due to all the abstractions, its very hard to troubleshoot. Probably my mistake and inexperience with the framework, but any guidance on how to troubleshoot would be great as simple things not working without anyway to troubleshoot makes using this Crawlee-framework quite cumbersome.

Plain Text

import asyncio
import json

from crawlee import Request
from crawlee.http_crawler import HttpCrawler, HttpCrawlingContext


async def main() -> None:
    crawler = HttpCrawler()

    # Define the default request handler, which will be called for every request.
    @crawler.router.default_handler
    async def request_handler(context: HttpCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')
        response = context.http_response.read().decode('utf-8')
        context.log.info(f'Response: {response}')  # To see the response in the logs.

    # Prepare a POST request to the form endpoint.
    request = Request.from_url(
        url='https://httpbin.org/post',
        method='POST',
        payload=json.dumps(
            {
                'custname': 'John Doe',
            }
        ).encode(),
    )

    # Run the crawler with the initial list of requests.
    await crawler.run([request])


if __name__ == '__main__':
    asyncio.run(main())

4 comments

ccrawleexl

Adding session-cookies

After following the tutorial on scraping using crawlee, I cannot figure out how to add specific cookies (key-value pair) to the request. E.g., sid=1234

There is something like a session, and a session-pool, but how to reach these objects?

Then max_pool_size of session-pool has default size of 1000, should one then iterate through the sessions in the session-pool to set the session-id to the session.cookies (dict)?

Imagine the below from the tutorial, default handler handles the incoming request, it wants to enqueue requests to the category-pages. Lets say these category-pages require the sid-cookie to be set, how to achieve this?

Any help is very much appreciated, as no examples can be found via Google / ChatGPT / Perplexity.

Plain Text

@router.default_handler
async def default_handler(context: PlaywrightCrawlingContext) -> None:
    # This is a fallback route which will handle the start URL.
    context.log.info(f'default_handler is processing {context.request.url}')
    
    await context.page.wait_for_selector('.collection-block-item')

    await context.enqueue_links(
        selector='.collection-block-item',
        label='CATEGORY',
    )

17 comments

Apify Discord Mirror

Debug and troubleshooting Crawlee

Simple POST-example

Adding session-cookies