Generate sitemap.xml (sitemaps.org 0.9) by scanning a site's content directories
Project description
hugesitemap
hugesitemap scans a site's content directories and writes a valid
sitemap.xml (sitemaps.org 0.9 protocol), reproducing the behaviour of an
earlier directory-walking sitemap generator but built to modern clean-architecture standards.
Built for huge sites. That is the whole point of the name: entries are streamed and written one 50,000-URL chunk at a time, so peak memory stays flat (measured ~150 MB) whether a site has 5,000 or 5,000,000 URLs - the full sitemap is never held in memory. A nightly run over a multi-million-URL site stays in the low hundreds of MB on any server; only the output file on disk grows. See Memory footprint for the measured numbers.
- Recursive directory walk per configured
[[directory]], emitting both directory URLs (trailing slash, the directory's own mtime) and file URLs. <lastmod>from each entry's real mtime in ISO8601 UTC (...Z), 4-decimal<priority>, and explicit[[url]]entries with their own changefreq/priority.- Ordered drop filters supporting shell wildcards and
re:-prefixed regexps. - 50,000-URL split into a sitemap index plus numbered child sitemaps.
- Optional gzip output (libdeflate at maximum ratio - smallest standard-gzip
.gzfor a write-once, serve-many file); atomic write with lxml re-parse validation before the live file is replaced. - Constant, small memory footprint: entries are streamed and written one 50,000-URL chunk at a time, so peak RAM stays flat (~150 MB) whether a site has 5,000 or 5,000,000 URLs - it never loads the whole sitemap into memory.
- CLI entry point styled with rich-click; layered configuration with lib_layered_config; structured logging with lib_log_rich.
Python 3.10+ Baseline
- The project targets Python 3.10 and newer.
- Runtime dependencies require current stable releases (
rich-click>=1.9.6andlib_cli_exit_tools>=2.2.4). Dev dependencies (pytest, ruff, pyright, bandit, etc.) specify minimum version constraints to ensure compatibility. - CI workflows exercise GitHub's rolling runner images (
ubuntu-latest,macos-latest,windows-latest) and cover CPython 3.10 through 3.14 alongside the latest available 3.x release provided by Actions.
Install - recommended via uv
uv is an ultrafast Python package manager written in Rust (10-20x faster than pip/poetry).
Install uv (if not already installed)
# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# Copy the actual binaries
cp /root/.local/bin/uv /usr/local/bin/uv
cp /root/.local/bin/uvx /usr/local/bin/uvx
# Ensure world-executable
chmod 755 /usr/local/bin/uv /usr/local/bin/uvx
# Windows (PowerShell)
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
One-shot run (no install needed)
uvx hugesitemap@latest --help
Persistent install as CLI tool
# Install latest python
install_latest_python_gcc.sh
# pin uv to the latest python
uv python pin /opt/python-latest/bin/python3
# One-time install, persists from the git repo
uv tool install --python /opt/python-latest/bin/python3 --from "git+https://github.com/bitranox/hugesitemap.git" hugesitemap
# or One-time install, persists from PyPi
uv tool install --python /opt/python-latest/bin/python3 hugesitemap
# Update (requires network)
uv tool upgrade hugesitemap
# Run
hugesitemap --help
Persistent install as CLI tool
# install the CLI tool (isolated environment, added to PATH)
uv tool install hugesitemap
# upgrade to latest
uv tool upgrade hugesitemap
Install as project dependency
uv venv && source .venv/bin/activate # Windows: .venv\Scripts\Activate.ps1
uv pip install hugesitemap
For alternative install paths (pip, pipx, source builds, etc.), see
INSTALL.md. All supported methods register the hugesitemap
command on your PATH.
Configuration
See CONFIG.md for detailed documentation on the layered configuration system, including precedence rules, profile support, and customization best practices.
Quick Start
# Install
uv tool install hugesitemap
# Verify
hugesitemap --version
# deploy config files
hugesitemap deploy-config --target app
# Try it out
hugesitemap generate --dry-run
hugesitemap info
hugesitemap config
Usage
The CLI leverages rich-click so help output, validation errors, and prompts render with Rich styling while keeping the familiar click ergonomics.
Available Commands
# Display package information
hugesitemap info
# Generate sitemaps for the configured sites
hugesitemap generate # all configured sites (default)
hugesitemap generate --site media,www # only the named sites
hugesitemap generate --dry-run # walk + validate, write nothing
hugesitemap generate --site media --gzip # write sitemap.xml.gz
# Error-handling demo
hugesitemap fail
hugesitemap --traceback fail
# Configuration management
hugesitemap config # Show current configuration
hugesitemap config --format json # Show as JSON
hugesitemap config --section lib_log_rich # Show specific section
hugesitemap config --profile production # Use a named profile
# Deploy configuration templates to target directories
# Without profile:
hugesitemap config-deploy --target app # → /etc/xdg/{slug}/config.toml
hugesitemap config-deploy --target host # → /etc/xdg/{slug}/hosts/{hostname}.toml
hugesitemap config-deploy --target user # → ~/.config/{slug}/config.toml
# With profile:
hugesitemap config-deploy --target app --profile production # → /etc/xdg/{slug}/profile/production/config.toml
hugesitemap config-deploy --target host --profile production # → /etc/xdg/{slug}/profile/production/hosts/{hostname}.toml
hugesitemap config-deploy --target user --profile production # → ~/.config/{slug}/profile/production/config.toml
# With custom permissions (POSIX only):
hugesitemap config-deploy --target user --file-mode 640 # Files with rw-r----- (640)
hugesitemap config-deploy --target user --dir-mode 750 # Directories with rwxr-x--- (750)
hugesitemap config-deploy --target app --no-permissions # Skip permission setting (use umask)
# Profile names: alphanumeric, hyphens, underscores; max 64 chars; must start with letter/digit
# See CONFIG.md for full validation rules
# Deploy configuration examples
hugesitemap config-generate-examples --destination ./examples
# Load configuration from an explicit .env file (skips upward directory search)
hugesitemap --env-file /path/to/.env config
hugesitemap --env-file ./environments/production.env generate --dry-run
# Override configuration at runtime (repeatable --set)
hugesitemap --set lib_log_rich.console_level=DEBUG config
# Logging demo
hugesitemap logdemo
hugesitemap --set lib_log_rich.console_level=DEBUG logdemo
# All commands work with any entry point
python -m hugesitemap info
uvx hugesitemap info
Generating a Sitemap
Sites are defined in the layered configuration as an array of tables (one
[[site]] per site), so all sites live in one place and are discovered through
lib_layered_config (no separate config file to pass, no profiles). The
generate command processes every configured site by default; --site selects
specific ones. A ready-to-edit example lives in examples/sites.toml.
hugesitemap generate # all configured sites (default)
hugesitemap generate --site media,www # only the named sites
hugesitemap generate --site all # explicit "all"
hugesitemap generate --dry-run # walk + validate, write nothing
hugesitemap generate --site media --gzip
For each selected site, the walk emits a directory URL (trailing slash, the
directory's own mtime) for every surviving directory and a file URL for every
surviving file. <lastmod> is each entry's real mtime in ISO8601 UTC (...Z);
<priority> is the 4-decimal default_priority. When a site exceeds 50,000
URLs the output is split into numbered child sitemaps plus a <sitemapindex> at
output_path. Each file is validated by re-parsing it with lxml and written via
an atomic rename, so the live file is only ever replaced by well-formed XML.
Memory footprint
The generator streams end to end: the directory walk yields entries one at a time, and they are written one 50,000-URL chunk at a time, so the whole sitemap is never held in memory. Peak RAM is therefore roughly constant regardless of how large the site is - dominated by a single chunk, not the total URL count.
Measured peak RSS on a typical machine (realistic ~50-character URLs):
| URLs | Peak RSS | Output on disk |
|---|---|---|
| 50,000 | ~135 MB | 8 MB |
| 1,000,000 | ~160 MB | 154 MB |
| 5,000,000 | ~160 MB | 770 MB |
So even on a big server generating millions of URLs, the process stays in the low hundreds of MB; only the output file grows. (Naively buffering every entry would instead cost on the order of 1 GB+ at 5,000,000 URLs.)
Configuration Format
Deploy this to the application layer (for example
/etc/xdg/hugesitemap/config.toml) or a config.d drop-in:
# Optional global defaults shared by every site.
[sitemap]
gzip = false
default_priority = 0.5
[sitemap.filters]
drop = ["*~", "re:/\\.[^/]*", "*.txt*", "*.log*", "*.py*"] # prepended to each site's filters
[[site]]
name = "media" # unique; used by --site
base_url = "https://media.example.com/"
output_path = "/srv/www/media/sitemap.xml"
[[site.directory]] # repeatable: one on-disk path -> URL prefix
path = "/srv/www/media/a000"
url = "https://media.example.com/a000/"
[[site.url]] # repeatable: explicit extra URLs
loc = "https://media.example.com/index.html"
changefreq = "yearly"
priority = 0.1
[site.filters] # appended after the global drops
drop = ["*/zsvc/z_content/*"]
[[site]]
name = "www"
base_url = "https://www.example.com/"
output_path = "/srv/www/www/sitemap.xml"
# ... directories / urls / filters ...
The optional [sitemap] section holds global defaults. Scalars (gzip,
default_priority) are inherited unless a site overrides them. Filters
extend rather than replace: the global drop patterns are prepended to each
site's own filters, so common drop patterns are written once and each site lists
only its extras. (A site therefore cannot remove an individual global pattern;
this is intentional for shared junk filters.)
| Key | Type | Default | Description |
|---|---|---|---|
[sitemap] |
table | absent | Global defaults shared by all sites (optional). |
[sitemap].gzip |
bool | false |
Default gzip; a site's own value overrides it. |
[sitemap].default_priority |
float | 0.5 |
Default priority; a site's own value overrides it. |
[sitemap.filters].drop |
array | [] |
Drop patterns prepended to every site's own filters. |
[[site]] |
table array | [] |
One entry per site; generate processes all by default. |
name |
string | required | Unique site identifier used by --site. |
base_url |
string | required | Site base URL; used to build child sitemap URLs on split. |
output_path |
string | required | Destination path for sitemap.xml. |
gzip |
bool | inherits | Write gzip-compressed output (sitemap.xml.gz). |
default_priority |
float | inherits | Priority assigned to every walked entry. |
[[site.directory]] |
table array | [] |
path (on disk) mapped to url (prefix). |
[[site.url]] |
table array | [] |
Explicit loc with optional changefreq/priority. |
[site.filters].drop |
array | [] |
Site drop patterns; appended after the global ones. |
A path is dropped when any drop pattern matches it (patterns are matched
against the path relative to its directory root, with a leading /). Dropped
directories are pruned, so their entire subtree is skipped.
Keep all sites in one layer.
lib_layered_configdeep-merges nested tables but replaces lists wholesale (last writer wins), so a higher layer carrying asitearray replaces a lower one rather than appending to it.
Programmatic Usage
from hugesitemap.adapters.config.loader import get_config
from hugesitemap.adapters.config.site_loader import load_sites
from hugesitemap.adapters.filesystem import walk_directory
from hugesitemap.adapters.sitemap_lxml import write_sitemap
from hugesitemap.application.generate import (
DirectoryRequest,
GenerateRequest,
generate_sitemap,
)
from hugesitemap.domain.model import SitemapEntry
for site in load_sites(get_config()):
request = GenerateRequest(
base_url=site.base_url,
output_path=site.output_path,
gzip=site.gzip,
default_priority=site.default_priority,
directories=tuple(DirectoryRequest(root=d.path, url_prefix=d.url) for d in site.directories),
explicit_entries=tuple(
SitemapEntry(loc=u.loc, lastmod=None, priority=u.priority, changefreq=u.changefreq)
for u in site.explicit_urls
),
drop_patterns=tuple(site.filters.drop),
)
result = generate_sitemap(request, content_source=walk_directory, write_sitemap=write_sitemap)
print(site.name, result.url_count, result.paths_written)
Further Documentation
AI transparency
This project is built with AI-assisted tooling under the maintainer's direction and review. For the general position, see ai-stance.md; for an honest account of how AI was used in this specific repository, see ai-disclosure.md.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hugesitemap-1.0.0.tar.gz.
File metadata
- Download URL: hugesitemap-1.0.0.tar.gz
- Upload date:
- Size: 129.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
55a17c89fc362edba41ac9139f2f54bee67111fddcfa57bf8860893b8ff4155b
|
|
| MD5 |
6739664b53e8136f6dabbff47f406d3f
|
|
| BLAKE2b-256 |
4b3303fb2330f0a8e3ba16074170055b799c13a178a7552cbb78371c9c4e59b5
|
Provenance
The following attestation bundles were made for hugesitemap-1.0.0.tar.gz:
Publisher:
default_release_public.yml on bitranox/hugesitemap
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
hugesitemap-1.0.0.tar.gz -
Subject digest:
55a17c89fc362edba41ac9139f2f54bee67111fddcfa57bf8860893b8ff4155b - Sigstore transparency entry: 1958211680
- Sigstore integration time:
-
Permalink:
bitranox/hugesitemap@bc5ef9df3be18467bb1851176cb49bc3aefbbed4 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/bitranox
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
default_release_public.yml@bc5ef9df3be18467bb1851176cb49bc3aefbbed4 -
Trigger Event:
release
-
Statement type:
File details
Details for the file hugesitemap-1.0.0-py3-none-any.whl.
File metadata
- Download URL: hugesitemap-1.0.0-py3-none-any.whl
- Upload date:
- Size: 76.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8cef8f84936549557f1465db70c6e6a4e738e50451b25164f952aee2d098aea7
|
|
| MD5 |
8f94e3e56bfd49329d0ad3c2ffcd8261
|
|
| BLAKE2b-256 |
fc47e62389e6507d2e5554e2710ff252a6146bf3112c9b67c8fa7b8358804739
|
Provenance
The following attestation bundles were made for hugesitemap-1.0.0-py3-none-any.whl:
Publisher:
default_release_public.yml on bitranox/hugesitemap
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
hugesitemap-1.0.0-py3-none-any.whl -
Subject digest:
8cef8f84936549557f1465db70c6e6a4e738e50451b25164f952aee2d098aea7 - Sigstore transparency entry: 1958211809
- Sigstore integration time:
-
Permalink:
bitranox/hugesitemap@bc5ef9df3be18467bb1851176cb49bc3aefbbed4 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/bitranox
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
default_release_public.yml@bc5ef9df3be18467bb1851176cb49bc3aefbbed4 -
Trigger Event:
release
-
Statement type: