Apify Discord Mirror

Updated 2 weeks ago

Error on cleanup PlaywrightCrawler

At a glance

The community member is using PlaywrightCrawler with headless=True and the crawlee[playwright]==0.6.1 package. They are experiencing an issue where the crawler sometimes receives an error when waiting for remaining tasks to finish. This issue is related to another problem they are facing, where their memory is slowly increasing on each batch and they are seeing a lot of headless_shell processes with <defunct> (zombie processes).

The community member has provided their code for the batching system and believes the issue is related to the cleanup failing on each crawl. They have noticed a PR (pull request) that they believe will fix the initial issue, and they hope it will also resolve the zombie processes problem.

<answer>The community member has found a PR that they believe will fix the initial issue and potentially the zombie processes problem as well.</answer>
Useful resources
I use PlaywrightCrawler with headless=True
The package that I use is: crawlee[playwright]==0.6.1

When running the crawler I noticed when waiting for remaining tasks to finish it sometimes receives an error like you can see in the screenshot. Is this something that can be resolved easily?

Because I think this error is also related to another issue I have.
In my code I have my own batching system in place. But I noticed that my memory slowly started to increase on each batch.
After some investigation I saw that ps -fC headless_shell gave me a lot headless_shell with <defunct> (zombie processes). So I assume this is related to the cleanup that is failing on each crawl.

Below you can see my code for the batching system:
Plain Text
    # Create key values stores for batches
    scheduled_batches = await prepare_requests_from_mongo(crawler_name)
    processed_batches = await KeyValueStore.open(
        name=f'{crawler_name}-processed_batches'
    )

    # Create crawler
    crawler = await create_playwright_crawler(crawler_name)

    # Iterate over the batches
    async for key_info in scheduled_batches.iterate_keys():
        urls: List[str] = await scheduled_batches.get_value(key_info.key)
        requests = [
            Request.from_url(
                url,
                user_data={
                    'page_tags': [PageTag.HOME.value],
                    'chosen_page_tag': PageTag.HOME.value,
                    'label': PageTag.HOME.value,
                },
            )
            for url in urls
        ]
        LOGGER.info(f'Processing batch {key_info.key}')
        await crawler.run(requests)
        await scheduled_batches.set_value(key_info.key, None)
        await processed_batches.set_value(key_info.key, urls)
Marked as solution
UPDATE:
Noticed this PR: https://github.com/apify/crawlee-python/pull/1046
This will fix my initial issue. Hopefully will this also fix the zombie processes on each batch πŸ™
View full solution
A
R
M
8 comments
@ROYOSTI just advanced to level 1! Thanks for your contributions! πŸŽ‰
πŸ€¦β€β™‚οΈforgot to upload the screenshot
Attachment
Screenshot_2025-03-05_at_12.46.17.png
UPDATE:
Noticed this PR: https://github.com/apify/crawlee-python/pull/1046
This will fix my initial issue. Hopefully will this also fix the zombie processes on each batch πŸ™
Yes, unfortunately this bug did not show up in tests during development. And I only discovered it while testing the release on one of my projects 😒

I think this should help with zombie processes, as the error during file closing prevents the browser closing to complete correctly. But if after the PR release, it persists, feel free to create an Issue in the repository.
@ROYOSTI This should already be available in the beta release crawlee==0.6.3b3.

If you decide to try this, please let me know if you observe any problems
@Mantisus, I did a small rerun and used crawlee==0.6.3b4. The issue for removing the tmp folder for PlaywrightCrawler is solved.
But on each batch it still keeps a lot of zombie processes.
Could I fix something in my code to prevent this? Or is this something that I best report in an Issue in the repository?
Got it. Yes, please report in the Issue repository.
You can try using - use_incognito_pages=True, maybe it will improve the situation with zombie processes (But will reduce the speed of your crawler as there will be no brawser cache sharing between different requests)

But I am not sure, because if it is not related to crash due to file closing error, we need to study the situation in detail.
Add a reply
Sign up and join the conversation on Discord