Skip to main content

Validate sitemap XML files and inspect their discovered URLs.

Project description

sitemap-verify

A Python 3.10+ tool/library to validate sitemap protocol compliance and check discovered URL reachability.

中文文档:README.zh-CN.md

Features

  • Async library API: validate_target(...)
  • CLI command: sitemap-verify check <target>
  • SQLite-backed runtime persistence for long-running validations
  • Resume interrupted validations with --resume-from <sqlite-file>
  • Supports sitemap inputs: XML urlset, sitemapindex, text sitemap, RSS, Atom
  • Uses XSD validation (xmlschema) plus protocol semantic validation
  • Recursively traverses sitemap indexes with depth/count safeguards
  • URL reachability checks with SEO-oriented severity:
    • 2xx => pass
    • 3xx / 429 => warn
    • 4xx / 5xx / network errors => error
  • Unified error / warn diagnostics report with JSON output support

Requirements

  • Python 3.10+
  • uv for environment and dependency management

Quick Start

uv sync --dev
uv run sitemap-verify check path/to/sitemap.xml

Install from PyPI:

pip install sitemap-verify

Validate a remote sitemap URL:

uv run sitemap-verify check https://example.com/sitemap.xml --mode url --format json

Validate a domain (discover sitemap from robots.txt, fallback /sitemap.xml):

uv run sitemap-verify check example.com --mode domain

Enable runtime logs, progress, and write output to a file:

uv run sitemap-verify check https://example.com/sitemap.xml \
  --mode url \
  --probe-method get \
  --format json \
  --output reports/result.json \
  --log-file logs/run.log \
  --verbose \
  --show-progress

Persist validation state to SQLite and resume after interruption:

uv run sitemap-verify check https://example.com/sitemap.xml \
  --mode url \
  --store reports/example-run.sqlite3

uv run sitemap-verify check https://example.com/sitemap.xml \
  --mode url \
  --resume-from reports/example-run.sqlite3

If --store is not provided, the CLI creates a timestamped SQLite file under reports/. During resume, sitemap files are parsed again, but URL reachability checks are skipped when a cached result already exists in the SQLite store.

Reachability probe modes:

  • --probe-method get (default): always use GET (recommended for sites that block or mis-handle HEAD)
  • --probe-method head: HEAD only
  • --probe-method auto: HEAD first, fallback to GET when HEAD returns 4xx/5xx (except 429) or 405/501

Library Usage

import asyncio

from sitemap_verify import validate_target


async def main() -> None:
    report = await validate_target(
        "https://example.com/sitemap.xml",
        mode="url",
        recursive=True,
        check_reachability=True,
        store_path="reports/example-run.sqlite3",
    )
    print(report.model_dump())


asyncio.run(main())

Optional persistence arguments:

  • store_path: write validation state to a specific SQLite file
  • resume_from: reopen an interrupted SQLite file and reuse existing URL reachability results

When store_path is omitted, validate_target(...) creates a timestamped SQLite file under reports/.

Development

Run the test suite:

uv run pytest

Run lint checks:

uv run ruff check .

Project Structure

  • src/sitemap_verify/: application package and CLI entrypoint
  • src/sitemap_verify/schemas/: bundled XSD files used by the validator
  • tests/: automated tests
  • docs/feat/: feature planning notes
  • docs/agent-lessons/: lessons from past fixed agent mistakes
  • .github/: GitHub workflows and collaboration templates

GitHub Collaboration

  • Bug reports and feature requests use issue templates under .github/ISSUE_TEMPLATE/
  • Pull requests follow .github/pull_request_template.md
  • CI runs lint and tests on pushes and pull requests

Release Process

  • Update project.version in pyproject.toml
  • Commit the release changes to main
  • Create and push a matching tag such as v0.1.0
  • GitHub Actions builds the package with uv, runs tests, validates distributions, smoke-tests pip install, and publishes via PyPI Trusted Publishing

Example:

git tag v0.1.0
git push origin v0.1.0

License

This project is licensed under the MIT License. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sitemap_verify-0.1.0.tar.gz (68.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sitemap_verify-0.1.0-py3-none-any.whl (21.3 kB view details)

Uploaded Python 3

File details

Details for the file sitemap_verify-0.1.0.tar.gz.

File metadata

  • Download URL: sitemap_verify-0.1.0.tar.gz
  • Upload date:
  • Size: 68.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sitemap_verify-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4ad7730a84df60ddcc4139ac58cddd3ec786b8094d7fc051b4d8827f141f84a0
MD5 141a2c973bae7d384e7d1c16db480d2c
BLAKE2b-256 4ac4b760a24ead20a223e8de2a81a435b7b8bf57848e02f1d558fac07ee38871

See more details on using hashes here.

Provenance

The following attestation bundles were made for sitemap_verify-0.1.0.tar.gz:

Publisher: publish.yml on zegging/sitemap-verify

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sitemap_verify-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: sitemap_verify-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 21.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sitemap_verify-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 95233f3e8dbfd2f702af0ecc99c9a45526282b9ed326a7d7ffbdbbc83ea7b06b
MD5 c63a9f62b5975f5ee1026fc042f0fa07
BLAKE2b-256 6cdfa0734f5bb99c99a7f5395962cebdbacb0ecd370ad2891f842ffea434e015

See more details on using hashes here.

Provenance

The following attestation bundles were made for sitemap_verify-0.1.0-py3-none-any.whl:

Publisher: publish.yml on zegging/sitemap-verify

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page