Skip to main content

Simple sitemap parser for Python

Project description

Sitemap Parser

Robot searching for sitemaps

This is a Python library designed to parse XML sitemaps and sitemap index files from a given URL. It supports both standard XML sitemaps (which contain URLs) and sitemap index files (which contain links to other sitemaps). This tool is useful for extracting data such as URLs and modification dates from website sitemaps.

Acknowledgments

This is a fork of Dave O'Connor's site-map-parser. I couldn't have done this without his original work.

Installation

uv add py-sitemap-parser

Usage

The library provides a SiteMapParser class that can be used to parse sitemaps and sitemap indexes. You can pass a URL or raw XML data to the parser to extract the URLs or links to other sitemaps.

Parsing a Sitemap from a URL

import logging
from typing import TYPE_CHECKING

from sitemap_parser import SiteMapParser

if TYPE_CHECKING:
    from sitemap_parser import SitemapIndex
    from sitemap_parser import UrlSet

logging.basicConfig(level=logging.INFO, format="%(message)s")
logger: logging.Logger = logging.getLogger(__name__)


# url = "https://ttvdrops.lovinator.space/sitemap.xml"  # Sitemap index
url = "https://ttvdrops.lovinator.space/sitemap-static.xml"  # Sitemap with URLs
parser = SiteMapParser(source=url)

if parser.has_sitemaps():
    sitemaps: SitemapIndex = parser.get_sitemaps()
    for sitemap in sitemaps:
        logger.info(sitemap)

elif parser.has_urls():
    urls: UrlSet = parser.get_urls()
    for url in urls:
        logger.info(url)

Parsing a Raw XML String

from sitemap_parser import SiteMapParser, UrlSet

xml_data = """
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <url>
        <loc>https://example.com/</loc>
        <lastmod>2023-09-27</lastmod>
        <changefreq>daily</changefreq>
        <priority>1.0</priority>
    </url>
    <url>
        <loc>https://example.com/about</loc>
        <lastmod>2023-09-27</lastmod>
        <changefreq>daily</changefreq>
        <priority>0.8</priority>
    </url>
</urlset>
"""
parser = SiteMapParser(source=xml_data, is_data_string=True)
urls: UrlSet = parser.get_urls()
for url in urls:
    print(url)

# Output:
# - https://example.com/
# - https://example.com/about

Exporting Sitemap Data to JSON

You can export the parsed sitemap data to a JSON file using the JSONExporter class.

import json
import logging

from sitemap_parser import JSONExporter
from sitemap_parser import SiteMapParser

logging.basicConfig(level=logging.INFO, format="%(message)s")
logger: logging.Logger = logging.getLogger(__name__)

# Sitemap with URLs to other sitemaps
parser = SiteMapParser(source="https://ttvdrops.lovinator.space/sitemap.xml")

if parser.has_sitemaps():
    json_data: str = JSONExporter(data=parser).export_sitemaps()
    json_data = json.loads(json_data)
    logger.info("Exported sitemaps: %s", json_data)

logger.info("----" * 10)

# Sitemap with "real" URLs
parser2 = SiteMapParser(
    source="https://ttvdrops.lovinator.space/sitemap-static.xml",
)

if parser2.has_urls():
    json_data: str = JSONExporter(data=parser2).export_urls()
    json_data = json.loads(json_data)
    logger.info("Exported URLs: %s", json_data)

Converting Sitemap XML to a Python dict

If you'd like to work with the parsed sitemap as a plain Python dictionary, you can use SiteMapParser.to_dict().

from sitemap_parser import SiteMapParser

xml = """
<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">
    <url>
        <loc>https://example.com/</loc>
    </url>
</urlset>
"""

parser = SiteMapParser(source=xml, is_data_string=True)
parsed = parser.to_dict()

# xmltodict represents repeated elements as lists
print(parsed["urlset"]["url"][0]["loc"])

You can also enable namespace processing for expanded namespace keys:

parsed = parser.to_dict(process_namespaces=True)

Disabling Logging

If you want to disable logging, you can adjust the logging level to logging.CRITICAL or higher. This will suppress all log messages below the CRITICAL level.

Here's an example of how to do this:

import logging

# Set the logging level to CRITICAL to disable logging
logging.getLogger("sitemap_parser").setLevel(logging.CRITICAL)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

py_sitemap_parser-2.0.0.tar.gz (8.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

py_sitemap_parser-2.0.0-py3-none-any.whl (8.2 kB view details)

Uploaded Python 3

File details

Details for the file py_sitemap_parser-2.0.0.tar.gz.

File metadata

  • Download URL: py_sitemap_parser-2.0.0.tar.gz
  • Upload date:
  • Size: 8.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.22

File hashes

Hashes for py_sitemap_parser-2.0.0.tar.gz
Algorithm Hash digest
SHA256 87e8231f7d7ebeef55d3e26c72c92d93289f2e4ea70b7cfc87a1a60108e045c8
MD5 f7331c85b6399f9c8995320e05fc950a
BLAKE2b-256 c4c4016f72b0abef6d2ca899ee023f5daada0f4a1967929232238bf546b5f64b

See more details on using hashes here.

File details

Details for the file py_sitemap_parser-2.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for py_sitemap_parser-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0007b70123f247f97e58ef9b1dbdc7adaf37ed59ab98737120e6d11c1cec6989
MD5 213ed4d9929482ac7b555bd5cd6b51b7
BLAKE2b-256 8de198c9331a14c4bc6a8f022aae75813a009a1f510d2f0e648614ea51e511c6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page