Skip to main content

Validate sitemap XML files and inspect their discovered URLs.

Project description

sitemap-verify

A Python 3.10+ tool/library to validate sitemap protocol compliance and check discovered URL reachability.

中文文档:README.zh-CN.md

Features

  • Async library API: validate_target(...)
  • CLI command: sitemap-verify check <target>
  • SQLite-backed runtime persistence for long-running validations
  • Resume interrupted validations with --resume-from <sqlite-file>
  • Supports sitemap inputs: XML urlset, sitemapindex, text sitemap, RSS, Atom
  • Uses XSD validation (xmlschema) plus protocol semantic validation
  • Recursively traverses sitemap indexes with depth/count safeguards
  • URL reachability checks with SEO-oriented severity:
    • 2xx => pass
    • 3xx / 429 => warn
    • 4xx / 5xx / network errors => error
  • Unified error / warn diagnostics report with JSON output support
  • Search-engine-specific validation profiles for google and bing
  • Google media sitemap extension checks for image, video, and news namespaces

Requirements

  • Python 3.10+
  • uv for environment and dependency management

Quick Start

uv sync --dev
uv run sitemap-verify check path/to/sitemap.xml

Install from PyPI:

pip install sitemap-verify

Validate a remote sitemap URL:

uv run sitemap-verify check https://example.com/sitemap.xml --mode url --format json

Validate against Google's sitemap rules and media extension requirements:

uv run sitemap-verify check https://example.com/sitemap.xml --mode url --engine google

Validate against Bing-specific sitemap guidance and best practices:

uv run sitemap-verify check https://example.com/sitemap.xml --mode url --engine bing

Validate a domain (discover sitemap from robots.txt, fallback /sitemap.xml):

uv run sitemap-verify check example.com --mode domain

Enable runtime logs, progress, and write output to a file:

uv run sitemap-verify check https://example.com/sitemap.xml \
  --mode url \
  --probe-method get \
  --format json \
  --output reports/result.json \
  --log-file logs/run.log \
  --verbose \
  --show-progress

Persist validation state to SQLite and resume after interruption:

uv run sitemap-verify check https://example.com/sitemap.xml \
  --mode url \
  --store reports/example-run.sqlite3

uv run sitemap-verify check https://example.com/sitemap.xml \
  --mode url \
  --resume-from reports/example-run.sqlite3

If --store is not provided, the CLI creates a timestamped SQLite file under reports/. During resume, sitemap files are parsed again, but URL reachability checks are skipped when a cached result already exists in the SQLite store.

Reachability probe modes:

  • --probe-method get (default): always use GET (recommended for sites that block or mis-handle HEAD)
  • --probe-method head: HEAD only
  • --probe-method auto: HEAD first, fallback to GET when HEAD returns 4xx/5xx (except 429) or 405/501

Engine Profiles

  • --engine google: applies Google-specific sitemap guidance, including checks for image, video, and news sitemap extensions
  • --engine bing: applies Bing-specific sitemap guidance, including lastmod and IndexNow recommendations
  • Engine-specific findings are reported as warn unless the sitemap structure is clearly invalid for the given Google extension
  • Google media support currently covers these namespaces:
    • http://www.google.com/schemas/sitemap-image/1.1
    • http://www.google.com/schemas/sitemap-video/1.1
    • http://www.google.com/schemas/sitemap-news/0.9
  • Bing media-extension field rules are not enforced yet because the accessible official Bing documentation we found does not provide the same field-level schema guidance as Google

Library Usage

import asyncio

from sitemap_verify import validate_target


async def main() -> None:
    report = await validate_target(
        "https://example.com/sitemap.xml",
        mode="url",
        engine="google",
        recursive=True,
        check_reachability=True,
        store_path="reports/example-run.sqlite3",
    )
    print(report.model_dump())


asyncio.run(main())

Optional persistence arguments:

  • store_path: write validation state to a specific SQLite file
  • resume_from: reopen an interrupted SQLite file and reuse existing URL reachability results

When store_path is omitted, validate_target(...) creates a timestamped SQLite file under reports/.

Development

Run the test suite:

uv run pytest

Run lint checks:

uv run ruff check .

Project Structure

  • src/sitemap_verify/: application package and CLI entrypoint
  • src/sitemap_verify/schemas/: bundled XSD files used by the validator
  • tests/: automated tests
  • docs/feat/: feature planning notes
  • docs/agent-lessons/: lessons from past fixed agent mistakes
  • .github/: GitHub workflows and collaboration templates

GitHub Collaboration

  • Bug reports and feature requests use issue templates under .github/ISSUE_TEMPLATE/
  • Pull requests follow .github/pull_request_template.md
  • CI runs lint and tests on pushes and pull requests

Release Process

  • Update project.version in pyproject.toml
  • Commit the release changes to main
  • Create and push a matching tag such as v0.1.0
  • GitHub Actions builds the package with uv, runs tests, validates distributions, smoke-tests pip install, and publishes via PyPI Trusted Publishing

Example:

git tag v0.1.0
git push origin v0.1.0

License

This project is licensed under the MIT License. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sitemap_verify-0.1.1.tar.gz (76.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sitemap_verify-0.1.1-py3-none-any.whl (27.0 kB view details)

Uploaded Python 3

File details

Details for the file sitemap_verify-0.1.1.tar.gz.

File metadata

  • Download URL: sitemap_verify-0.1.1.tar.gz
  • Upload date:
  • Size: 76.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sitemap_verify-0.1.1.tar.gz
Algorithm Hash digest
SHA256 00e4505abe14edae7053ce2fac034a28883bf39bba733e4549bb33b0f3b0971a
MD5 0ad7c7682615eef5dfa92c405a7ca5d9
BLAKE2b-256 2bb30539af24eb0b53ca688daa31f59eb247fadcf0e6e1239ba4f9e2297b80fd

See more details on using hashes here.

Provenance

The following attestation bundles were made for sitemap_verify-0.1.1.tar.gz:

Publisher: publish.yml on zegging/sitemap-verify

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sitemap_verify-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: sitemap_verify-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 27.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sitemap_verify-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 641f3fdb45d6f2ac6f4a2eff86f3b49d9d737fe34934ac628a78f95138afe743
MD5 075c66d9cc793255a2a5a62e568b98b0
BLAKE2b-256 95b8458ef89118dc6b8bbc80efdb4e5b31ca36a051b856d47b3ed59cf70df344

See more details on using hashes here.

Provenance

The following attestation bundles were made for sitemap_verify-0.1.1-py3-none-any.whl:

Publisher: publish.yml on zegging/sitemap-verify

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page