Apify Discord Mirror

Updated 6 months ago

download xml.gz sitemaps.

At a glance

A community member is trying to parse sitemaps from a website that uses .xml.gz files, and they are looking for a way to decompress these files in the Crawlee library, which only has the "downloadListOfUrls" method. Other community members have shared that they have parsed these files using tools from Node.js, and one community member has offered to share their solution. However, there is no explicitly marked answer in the comments.

Useful resources

NNeoNomade | Scraping hellhound

I'm trying to parse the sitemaps from a website that has .xml.gz sitemaps, in python I could use gunzip to decompress and use them.
In crawlee we only have the "downloadListOfUrls" method, how I could make it to decompress those files before using them >?
sitemap: https://www.zoro.com/sitemaps/usa/sitemap-product-10.xml.gz

6 comments

NNeoNomade | Scraping hellhound

I parsed them but using tools from Node.
It would be nice to have those built in, in crawlee

AAndrey Bykov

Replied in a different thread. Also passed the question/suggestion to the team 👍

NNeoNomade | Scraping hellhound

I can share my solution if needed

AAndrey Bykov

If you don't mind - I could definitely pass it to the team 👍 thankls

NNeoNomade | Scraping hellhound

https://pastebin.com/Ym7JaSRd

AAndrey Bykov

Thanks 👍

Add a reply