I use PlaywrightCrawler with
headless=True
The package that I use is:
crawlee[playwright]==0.6.1
When running the crawler I noticed when waiting for remaining tasks to finish it sometimes receives an error like you can see in the screenshot. Is this something that can be resolved easily?
Because I think this error is also related to another issue I have.
In my code I have my own batching system in place. But I noticed that my memory slowly started to increase on each batch.
After some investigation I saw that
ps -fC headless_shell
gave me a lot headless_shell with
<defunct>
(zombie processes). So I assume this is related to the cleanup that is failing on each crawl.
Below you can see my code for the batching system:
# Create key values stores for batches
scheduled_batches = await prepare_requests_from_mongo(crawler_name)
processed_batches = await KeyValueStore.open(
name=f'{crawler_name}-processed_batches'
)
# Create crawler
crawler = await create_playwright_crawler(crawler_name)
# Iterate over the batches
async for key_info in scheduled_batches.iterate_keys():
urls: List[str] = await scheduled_batches.get_value(key_info.key)
requests = [
Request.from_url(
url,
user_data={
'page_tags': [PageTag.HOME.value],
'chosen_page_tag': PageTag.HOME.value,
'label': PageTag.HOME.value,
},
)
for url in urls
]
LOGGER.info(f'Processing batch {key_info.key}')
await crawler.run(requests)
await scheduled_batches.set_value(key_info.key, None)
await processed_batches.set_value(key_info.key, urls)