Validate sitemap XML files and inspect their discovered URLs.
Project description
sitemap-verify
A Python 3.10+ tool/library to validate sitemap protocol compliance and check discovered URL reachability.
中文文档:README.zh-CN.md
Features
- Async library API:
validate_target(...) - CLI command:
sitemap-verify check <target> - SQLite-backed runtime persistence for long-running validations
- Resume interrupted validations with
--resume-from <sqlite-file> - Supports sitemap inputs: XML
urlset,sitemapindex, text sitemap, RSS, Atom - Uses XSD validation (
xmlschema) plus protocol semantic validation - Recursively traverses sitemap indexes with depth/count safeguards
- URL reachability checks with SEO-oriented severity:
2xx=> pass3xx/429=>warn4xx/5xx/ network errors =>error
- Unified
error/warndiagnostics report with JSON output support - Search-engine-specific validation profiles for
googleandbing - Google media sitemap extension checks for image, video, and news namespaces
Requirements
- Python 3.10+
uvfor environment and dependency management
Quick Start
uv sync --dev
uv run sitemap-verify check path/to/sitemap.xml
Install from PyPI:
pip install sitemap-verify
Validate a remote sitemap URL:
uv run sitemap-verify check https://example.com/sitemap.xml --mode url --format json
Validate against Google's sitemap rules and media extension requirements:
uv run sitemap-verify check https://example.com/sitemap.xml --mode url --engine google
Validate against Bing-specific sitemap guidance and best practices:
uv run sitemap-verify check https://example.com/sitemap.xml --mode url --engine bing
Validate a domain (discover sitemap from robots.txt, fallback /sitemap.xml):
uv run sitemap-verify check example.com --mode domain
Enable runtime logs, progress, and write output to a file:
uv run sitemap-verify check https://example.com/sitemap.xml \
--mode url \
--probe-method get \
--format json \
--output reports/result.json \
--log-file logs/run.log \
--verbose \
--show-progress
Persist validation state to SQLite and resume after interruption:
uv run sitemap-verify check https://example.com/sitemap.xml \
--mode url \
--store reports/example-run.sqlite3
uv run sitemap-verify check https://example.com/sitemap.xml \
--mode url \
--resume-from reports/example-run.sqlite3
If --store is not provided, the CLI creates a timestamped SQLite file under reports/.
During resume, sitemap files are parsed again, but URL reachability checks are skipped when a
cached result already exists in the SQLite store.
Reachability probe modes:
--probe-method get(default): always use GET (recommended for sites that block or mis-handle HEAD)--probe-method head: HEAD only--probe-method auto: HEAD first, fallback to GET when HEAD returns 4xx/5xx (except 429) or 405/501
Engine Profiles
--engine google: applies Google-specific sitemap guidance, including checks forimage,video, andnewssitemap extensions--engine bing: applies Bing-specific sitemap guidance, includinglastmodand IndexNow recommendations- Engine-specific findings are reported as
warnunless the sitemap structure is clearly invalid for the given Google extension - Google media support currently covers these namespaces:
http://www.google.com/schemas/sitemap-image/1.1http://www.google.com/schemas/sitemap-video/1.1http://www.google.com/schemas/sitemap-news/0.9
- Bing media-extension field rules are not enforced yet because the accessible official Bing documentation we found does not provide the same field-level schema guidance as Google
Library Usage
import asyncio
from sitemap_verify import validate_target
async def main() -> None:
report = await validate_target(
"https://example.com/sitemap.xml",
mode="url",
engine="google",
recursive=True,
check_reachability=True,
store_path="reports/example-run.sqlite3",
)
print(report.model_dump())
asyncio.run(main())
Optional persistence arguments:
store_path: write validation state to a specific SQLite fileresume_from: reopen an interrupted SQLite file and reuse existing URL reachability results
When store_path is omitted, validate_target(...) creates a timestamped SQLite file under
reports/.
Development
Run the test suite:
uv run pytest
Run lint checks:
uv run ruff check .
Project Structure
src/sitemap_verify/: application package and CLI entrypointsrc/sitemap_verify/schemas/: bundled XSD files used by the validatortests/: automated testsdocs/feat/: feature planning notesdocs/agent-lessons/: lessons from past fixed agent mistakes.github/: GitHub workflows and collaboration templates
GitHub Collaboration
- Bug reports and feature requests use issue templates under
.github/ISSUE_TEMPLATE/ - Pull requests follow
.github/pull_request_template.md - CI runs lint and tests on pushes and pull requests
Release Process
- Update
project.versioninpyproject.toml - Commit the release changes to
main - Create and push a matching tag such as
v0.1.0 - GitHub Actions builds the package with
uv, runs tests, validates distributions, smoke-testspip install, and publishes via PyPI Trusted Publishing
Example:
git tag v0.1.0
git push origin v0.1.0
License
This project is licensed under the MIT License. See LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sitemap_verify-0.1.1.tar.gz.
File metadata
- Download URL: sitemap_verify-0.1.1.tar.gz
- Upload date:
- Size: 76.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
00e4505abe14edae7053ce2fac034a28883bf39bba733e4549bb33b0f3b0971a
|
|
| MD5 |
0ad7c7682615eef5dfa92c405a7ca5d9
|
|
| BLAKE2b-256 |
2bb30539af24eb0b53ca688daa31f59eb247fadcf0e6e1239ba4f9e2297b80fd
|
Provenance
The following attestation bundles were made for sitemap_verify-0.1.1.tar.gz:
Publisher:
publish.yml on zegging/sitemap-verify
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sitemap_verify-0.1.1.tar.gz -
Subject digest:
00e4505abe14edae7053ce2fac034a28883bf39bba733e4549bb33b0f3b0971a - Sigstore transparency entry: 1169318782
- Sigstore integration time:
-
Permalink:
zegging/sitemap-verify@61dac49cfdb63dff29a51631e5e091b37522835d -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/zegging
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@61dac49cfdb63dff29a51631e5e091b37522835d -
Trigger Event:
push
-
Statement type:
File details
Details for the file sitemap_verify-0.1.1-py3-none-any.whl.
File metadata
- Download URL: sitemap_verify-0.1.1-py3-none-any.whl
- Upload date:
- Size: 27.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
641f3fdb45d6f2ac6f4a2eff86f3b49d9d737fe34934ac628a78f95138afe743
|
|
| MD5 |
075c66d9cc793255a2a5a62e568b98b0
|
|
| BLAKE2b-256 |
95b8458ef89118dc6b8bbc80efdb4e5b31ca36a051b856d47b3ed59cf70df344
|
Provenance
The following attestation bundles were made for sitemap_verify-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on zegging/sitemap-verify
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sitemap_verify-0.1.1-py3-none-any.whl -
Subject digest:
641f3fdb45d6f2ac6f4a2eff86f3b49d9d737fe34934ac628a78f95138afe743 - Sigstore transparency entry: 1169318837
- Sigstore integration time:
-
Permalink:
zegging/sitemap-verify@61dac49cfdb63dff29a51631e5e091b37522835d -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/zegging
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@61dac49cfdb63dff29a51631e5e091b37522835d -
Trigger Event:
push
-
Statement type: