Simple sitemap parser for Python
Project description
Sitemap Parser
This is a Python library designed to parse XML sitemaps and sitemap index files from a given URL. It supports both standard XML sitemaps (which contain URLs) and sitemap index files (which contain links to other sitemaps). This tool is useful for extracting data such as URLs and modification dates from website sitemaps.
Acknowledgments
This is a fork of Dave O'Connor's site-map-parser. I couldn't have done this without his original work.
Installation
uv add py-sitemap-parser
Usage
The library provides a SiteMapParser class that can be used to parse sitemaps and sitemap indexes. You can pass a URL or raw XML data to the parser to extract the URLs or links to other sitemaps.
Parsing a Sitemap from a URL
import logging
from typing import TYPE_CHECKING
from sitemap_parser import SiteMapParser
if TYPE_CHECKING:
from sitemap_parser import SitemapIndex
from sitemap_parser import UrlSet
logging.basicConfig(level=logging.INFO, format="%(message)s")
logger: logging.Logger = logging.getLogger(__name__)
# url = "https://ttvdrops.lovinator.space/sitemap.xml" # Sitemap index
url = "https://ttvdrops.lovinator.space/sitemap-static.xml" # Sitemap with URLs
parser = SiteMapParser(source=url)
if parser.has_sitemaps():
sitemaps: SitemapIndex = parser.get_sitemaps()
for sitemap in sitemaps:
logger.info(sitemap)
elif parser.has_urls():
urls: UrlSet = parser.get_urls()
for url in urls:
logger.info(url)
Parsing a Raw XML String
from sitemap_parser import SiteMapParser, UrlSet
xml_data = """
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/</loc>
<lastmod>2023-09-27</lastmod>
<changefreq>daily</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>https://example.com/about</loc>
<lastmod>2023-09-27</lastmod>
<changefreq>daily</changefreq>
<priority>0.8</priority>
</url>
</urlset>
"""
parser = SiteMapParser(source=xml_data, is_data_string=True)
urls: UrlSet = parser.get_urls()
for url in urls:
print(url)
# Output:
# - https://example.com/
# - https://example.com/about
Exporting Sitemap Data to JSON
You can export the parsed sitemap data to a JSON file using the JSONExporter class.
import json
import logging
from sitemap_parser import JSONExporter
from sitemap_parser import SiteMapParser
logging.basicConfig(level=logging.INFO, format="%(message)s")
logger: logging.Logger = logging.getLogger(__name__)
# Sitemap with URLs to other sitemaps
parser = SiteMapParser(source="https://ttvdrops.lovinator.space/sitemap.xml")
if parser.has_sitemaps():
json_data: str = JSONExporter(data=parser).export_sitemaps()
json_data = json.loads(json_data)
logger.info("Exported sitemaps: %s", json_data)
logger.info("----" * 10)
# Sitemap with "real" URLs
parser2 = SiteMapParser(
source="https://ttvdrops.lovinator.space/sitemap-static.xml",
)
if parser2.has_urls():
json_data: str = JSONExporter(data=parser2).export_urls()
json_data = json.loads(json_data)
logger.info("Exported URLs: %s", json_data)
Converting Sitemap XML to a Python dict
If you'd like to work with the parsed sitemap as a plain Python dictionary, you can use SiteMapParser.to_dict().
from sitemap_parser import SiteMapParser
xml = """
<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">
<url>
<loc>https://example.com/</loc>
</url>
</urlset>
"""
parser = SiteMapParser(source=xml, is_data_string=True)
parsed = parser.to_dict()
# xmltodict represents repeated elements as lists
print(parsed["urlset"]["url"][0]["loc"])
You can also enable namespace processing for expanded namespace keys:
parsed = parser.to_dict(process_namespaces=True)
Disabling Logging
If you want to disable logging, you can adjust the logging level to logging.CRITICAL or higher. This will suppress all log messages below the CRITICAL level.
Here's an example of how to do this:
import logging
# Set the logging level to CRITICAL to disable logging
logging.getLogger("sitemap_parser").setLevel(logging.CRITICAL)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file py_sitemap_parser-2.0.0.tar.gz.
File metadata
- Download URL: py_sitemap_parser-2.0.0.tar.gz
- Upload date:
- Size: 8.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.22
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
87e8231f7d7ebeef55d3e26c72c92d93289f2e4ea70b7cfc87a1a60108e045c8
|
|
| MD5 |
f7331c85b6399f9c8995320e05fc950a
|
|
| BLAKE2b-256 |
c4c4016f72b0abef6d2ca899ee023f5daada0f4a1967929232238bf546b5f64b
|
File details
Details for the file py_sitemap_parser-2.0.0-py3-none-any.whl.
File metadata
- Download URL: py_sitemap_parser-2.0.0-py3-none-any.whl
- Upload date:
- Size: 8.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.22
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0007b70123f247f97e58ef9b1dbdc7adaf37ed59ab98737120e6d11c1cec6989
|
|
| MD5 |
213ed4d9929482ac7b555bd5cd6b51b7
|
|
| BLAKE2b-256 |
8de198c9331a14c4bc6a8f022aae75813a009a1f510d2f0e648614ea51e511c6
|