Generate sitemap.xml (sitemaps.org 0.9) by scanning a site's content directories

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

hugesitemap

hugesitemap scans a site's content directories and writes a valid sitemap.xml (sitemaps.org 0.9 protocol), reproducing the behaviour of an earlier directory-walking sitemap generator but built to modern clean-architecture standards.

Built for huge sites. That is the whole point of the name: entries are streamed and written one 50,000-URL chunk at a time, so peak memory stays flat (measured ~150 MB) whether a site has 5,000 or 5,000,000 URLs - the full sitemap is never held in memory. A nightly run over a multi-million-URL site stays in the low hundreds of MB on any server; only the output file on disk grows. See Memory footprint for the measured numbers.

Recursive directory walk per configured [[directory]], emitting both directory URLs (trailing slash, the directory's own mtime) and file URLs.
<lastmod> from each entry's real mtime in ISO8601 UTC (...Z), 4-decimal <priority>, and explicit [[url]] entries with their own changefreq/priority.
Git .gitignore-style filters (via igittigitt): anchored patterns, subtree pruning, ! allowlists, optional per-directory .sitemapignore files.
50,000-URL split into a sitemap index plus numbered child sitemaps.
Optional gzip output (libdeflate at maximum ratio - smallest standard-gzip .gz for a write-once, serve-many file); atomic write with lxml re-parse validation before the live file is replaced.
Constant, small memory footprint: entries are streamed and written one 50,000-URL chunk at a time, so peak RAM stays flat (~150 MB) whether a site has 5,000 or 5,000,000 URLs - it never loads the whole sitemap into memory.
CLI entry point styled with rich-click; layered configuration with lib_layered_config; structured logging with lib_log_rich.

Python 3.10+ Baseline

The project targets Python 3.10 and newer.
Runtime dependencies require current stable releases (rich-click>=1.9.6 and lib_cli_exit_tools>=2.2.4). Dev dependencies (pytest, ruff, pyright, bandit, etc.) specify minimum version constraints to ensure compatibility.
CI workflows exercise GitHub's rolling runner images (ubuntu-latest, macos-latest, windows-latest) and cover CPython 3.10 through 3.14 alongside the latest available 3.x release provided by Actions.

Install - recommended via uv

uv is an ultrafast Python package manager written in Rust (10-20x faster than pip/poetry).

Install uv (if not already installed)

# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# Copy the actual binaries
cp /root/.local/bin/uv /usr/local/bin/uv
cp /root/.local/bin/uvx /usr/local/bin/uvx

# Ensure world-executable
chmod 755 /usr/local/bin/uv /usr/local/bin/uvx

# Windows (PowerShell)
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

One-shot run (no install needed)

uvx hugesitemap@latest --help

Persistent install as CLI tool

# Install latest python
install_latest_python_gcc.sh
# pin uv to the latest python
uv python pin /opt/python-latest/bin/python3
# One-time install, persists from the git repo
uv tool install --python /opt/python-latest/bin/python3 --from "git+https://github.com/bitranox/hugesitemap.git" hugesitemap
# or One-time install, persists from PyPi
uv tool install --python /opt/python-latest/bin/python3 hugesitemap
# Update (requires network)
uv tool upgrade hugesitemap
# Run
hugesitemap --help

Persistent install as CLI tool

# install the CLI tool (isolated environment, added to PATH)
uv tool install hugesitemap

# upgrade to latest
uv tool upgrade hugesitemap

Install as project dependency

uv venv && source .venv/bin/activate   # Windows: .venv\Scripts\Activate.ps1
uv pip install hugesitemap

For alternative install paths (pip, pipx, source builds, etc.), see INSTALL.md. All supported methods register the hugesitemap command on your PATH.

Configuration

See CONFIG.md for detailed documentation on the layered configuration system, including precedence rules, profile support, and customization best practices.

Quick Start

# Install
uv tool install hugesitemap

# Verify
hugesitemap --version

# deploy config files
hugesitemap deploy-config --target app

# Try it out
hugesitemap generate --dry-run
hugesitemap info
hugesitemap config

Usage

The CLI leverages rich-click so help output, validation errors, and prompts render with Rich styling while keeping the familiar click ergonomics.

Available Commands

# Display package information
hugesitemap info

# Generate sitemaps for the configured sites
hugesitemap generate                       # all configured sites (default)
hugesitemap generate --site media,www      # only the named sites
hugesitemap generate --dry-run             # walk + validate, write nothing
hugesitemap generate --site media --gzip   # write sitemap.xml.gz

# Error-handling demo
hugesitemap fail
hugesitemap --traceback fail

# Configuration management
hugesitemap config                         # Show current configuration
hugesitemap config --format json           # Show as JSON
hugesitemap config --section lib_log_rich  # Show specific section
hugesitemap config --profile production    # Use a named profile

# Deploy configuration templates to target directories
# Without profile:
hugesitemap config-deploy --target app    # → /etc/xdg/{slug}/config.toml
hugesitemap config-deploy --target host   # → /etc/xdg/{slug}/hosts/{hostname}.toml
hugesitemap config-deploy --target user   # → ~/.config/{slug}/config.toml

# With profile:
hugesitemap config-deploy --target app --profile production   # → /etc/xdg/{slug}/profile/production/config.toml
hugesitemap config-deploy --target host --profile production  # → /etc/xdg/{slug}/profile/production/hosts/{hostname}.toml
hugesitemap config-deploy --target user --profile production  # → ~/.config/{slug}/profile/production/config.toml

# With custom permissions (POSIX only):
hugesitemap config-deploy --target user --file-mode 640       # Files with rw-r----- (640)
hugesitemap config-deploy --target user --dir-mode 750        # Directories with rwxr-x--- (750)
hugesitemap config-deploy --target app --no-permissions       # Skip permission setting (use umask)

# Profile names: alphanumeric, hyphens, underscores; max 64 chars; must start with letter/digit
# See CONFIG.md for full validation rules

# Deploy configuration examples
hugesitemap config-generate-examples --destination ./examples

# Load configuration from an explicit .env file (skips upward directory search)
hugesitemap --env-file /path/to/.env config
hugesitemap --env-file ./environments/production.env generate --dry-run

# Override configuration at runtime (repeatable --set)
hugesitemap --set lib_log_rich.console_level=DEBUG config

# Logging demo
hugesitemap logdemo
hugesitemap --set lib_log_rich.console_level=DEBUG logdemo

# All commands work with any entry point
python -m hugesitemap info
uvx hugesitemap info

Generating a Sitemap

Sites are defined in the layered configuration as an array of tables (one [[site]] per site), so all sites live in one place and are discovered through lib_layered_config (no separate config file to pass, no profiles). The generate command processes every configured site by default; --site selects specific ones. A ready-to-edit example lives in examples/sites.toml.

hugesitemap generate                    # all configured sites (default)
hugesitemap generate --site media,www   # only the named sites
hugesitemap generate --site all         # explicit "all"
hugesitemap generate --dry-run          # walk + validate, write nothing
hugesitemap generate --site media --gzip

For each selected site, the walk emits a directory URL (trailing slash, the directory's own mtime) for every surviving directory and a file URL for every surviving file. <lastmod> is each entry's real mtime in ISO8601 UTC (...Z); <priority> is the 4-decimal default_priority. When a site exceeds 50,000 URLs the output is split into numbered child sitemaps plus a <sitemapindex> at output_path. Each file is validated by re-parsing it with lxml and written via an atomic rename, so the live file is only ever replaced by well-formed XML.

Memory footprint

The generator streams end to end: the directory walk yields entries one at a time, and they are written one 50,000-URL chunk at a time, so the whole sitemap is never held in memory. Peak RAM is therefore roughly constant regardless of how large the site is - dominated by a single chunk, not the total URL count.

Measured peak RSS on a typical machine (realistic ~50-character URLs):

URLs	Peak RSS	Output on disk
50,000	~135 MB	8 MB
1,000,000	~160 MB	154 MB
5,000,000	~160 MB	770 MB

So even on a big server generating millions of URLs, the process stays in the low hundreds of MB; only the output file grows. (Naively buffering every entry would instead cost on the order of 1 GB+ at 5,000,000 URLs.)

Configuration Format

Deploy this to the application layer (for example /etc/xdg/hugesitemap/config.toml) or a config.d drop-in:

# Optional global defaults shared by every site.
[sitemap]
gzip             = false
default_priority = 0.5

  [sitemap.filters]
  ignore = ["*~", ".*", "*.txt", "*.log", "*.py"]   # .gitignore patterns, prepended to each site's filters

[[site]]
name        = "media"                     # unique; used by --site
base_url    = "https://media.example.com/"
output_path = "/srv/www/media/sitemap.xml"

  [[site.directory]]                      # repeatable: one on-disk path -> URL prefix
  path = "/srv/www/media/a000"
  url  = "https://media.example.com/a000/"

  [[site.url]]                            # repeatable: explicit extra URLs
  loc        = "https://media.example.com/index.html"
  changefreq = "yearly"
  priority   = 0.1

  [site.filters]                          # appended after the global patterns
  ignore = ["zsvc/"]                      # trailing slash prunes the whole subtree

[[site]]
name        = "www"
base_url    = "https://www.example.com/"
output_path = "/srv/www/www/sitemap.xml"
# ... directories / urls / filters ...

The optional [sitemap] section holds global defaults. Scalars (gzip, default_priority) are inherited unless a site overrides them. Filters extend rather than replace: the global ignore patterns are prepended to each site's own, so common patterns are written once and each site lists only its extras. Because matching is last-match-wins, a site can re-include a globally ignored path with a ! negation.

Key	Type	Default	Description
`[sitemap]`	table	absent	Global defaults shared by all sites (optional).
`[sitemap].gzip`	bool	`false`	Default `gzip`; a site's own value overrides it.
`[sitemap].default_priority`	float	`0.5`	Default priority; a site's own value overrides it.
`[sitemap.filters].ignore`	array	`[]`	`.gitignore` patterns prepended to every site's own.
`[[site]]`	table array	`[]`	One entry per site; `generate` processes all by default.
`name`	string	required	Unique site identifier used by `--site`.
`base_url`	string	required	Site base URL; used to build child sitemap URLs on split.
`output_path`	string	required	Destination path for `sitemap.xml`.
`gzip`	bool	inherits	Write gzip-compressed output (`sitemap.xml.gz`).
`default_priority`	float	inherits	Priority assigned to every walked entry.
`[[site.directory]]`	table array	`[]`	`path` (on disk) mapped to `url` (prefix).
`[[site.url]]`	table array	`[]`	Explicit `loc` with optional `changefreq`/`priority`.
`[site.filters].ignore`	array	`[]`	Site `.gitignore` patterns; appended after the global ones.
`[site.filters].ignore_file`	string	absent	Path to a `.gitignore`-format rule file for this site.
`[site.filters].nested_ignore_filename`	string	absent	Per-directory ignore filename to discover (e.g. `.sitemapignore`).

Filtering uses git .gitignore semantics (via igittigitt): patterns are anchored at each directory root, a trailing-slash pattern (zsvc/) prunes a whole subtree, and !pattern re-includes. Ignored directories are pruned, so their entire subtree is skipped. To index only one kind of file, invert with an allowlist: ignore = ["*", "!*/", "!*.html"] keeps just .html (the !*/ keeps directories so the walk can descend). Like git, a file under an ignored directory cannot be re-included.

Keep all sites in one layer. lib_layered_config deep-merges nested tables but replaces lists wholesale (last writer wins), so a higher layer carrying a site array replaces a lower one rather than appending to it.

Programmatic Usage

from hugesitemap.adapters.config.loader import get_config
from hugesitemap.adapters.config.site_loader import load_sites
from hugesitemap.adapters.filesystem import walk_directory
from hugesitemap.adapters.sitemap_lxml import write_sitemap
from hugesitemap.application.generate import (
    DirectoryRequest,
    GenerateRequest,
    generate_sitemap,
)
from hugesitemap.domain.filters import FilterSpec
from hugesitemap.domain.model import SitemapEntry

for site in load_sites(get_config()):
    request = GenerateRequest(
        base_url=site.base_url,
        output_path=site.output_path,
        gzip=site.gzip,
        default_priority=site.default_priority,
        directories=tuple(DirectoryRequest(root=d.path, url_prefix=d.url) for d in site.directories),
        explicit_entries=tuple(
            SitemapEntry(loc=u.loc, lastmod=None, priority=u.priority, changefreq=u.changefreq)
            for u in site.explicit_urls
        ),
        filter_spec=FilterSpec(
            patterns=tuple(site.filters.ignore),
            ignore_file=site.filters.ignore_file,
            nested_filename=site.filters.nested_ignore_filename,
        ),
    )
    result = generate_sitemap(request, content_source=walk_directory, write_sitemap=write_sitemap)
    print(site.name, result.url_count, result.paths_written)

Further Documentation

AI transparency

This project is built with AI-assisted tooling under the maintainer's direction and review. For the general position, see ai-stance.md; for an honest account of how AI was used in this specific repository, see ai-disclosure.md.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

bitranox

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

2.0.1

Jun 26, 2026

2.0.0

Jun 26, 2026

1.0.0

Jun 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hugesitemap-2.0.1.tar.gz (133.7 kB view details)

Uploaded Jun 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hugesitemap-2.0.1-py3-none-any.whl (77.7 kB view details)

Uploaded Jun 26, 2026 Python 3

File details

Details for the file hugesitemap-2.0.1.tar.gz.

File metadata

Download URL: hugesitemap-2.0.1.tar.gz
Upload date: Jun 26, 2026
Size: 133.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for hugesitemap-2.0.1.tar.gz
Algorithm	Hash digest
SHA256	`ea2a01f25597bd5210d5db55476cc8cf9117e2f550fdb97f0c4881bf3aa5c3a0`
MD5	`313d93af3c5eb509e0d8808503f2bd34`
BLAKE2b-256	`01766e93e4414e3a6ac865626ad2f5bd4628e3aedf22e08dce71ca3ac5ce08cf`

See more details on using hashes here.

Provenance

The following attestation bundles were made for hugesitemap-2.0.1.tar.gz:

Publisher: default_release_public.yml on bitranox/hugesitemap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: hugesitemap-2.0.1.tar.gz
- Subject digest: ea2a01f25597bd5210d5db55476cc8cf9117e2f550fdb97f0c4881bf3aa5c3a0
- Sigstore transparency entry: 1972863961
- Sigstore integration time: Jun 26, 2026
Source repository:
- Permalink: bitranox/hugesitemap@db9d10131286e725475596badd85b7c5168d8bba
- Branch / Tag: refs/tags/v2.0.1
- Owner: https://github.com/bitranox
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: default_release_public.yml@db9d10131286e725475596badd85b7c5168d8bba
- Trigger Event: push

File details

Details for the file hugesitemap-2.0.1-py3-none-any.whl.

File metadata

Download URL: hugesitemap-2.0.1-py3-none-any.whl
Upload date: Jun 26, 2026
Size: 77.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for hugesitemap-2.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`36099f340700f99f77e3d194d964866fdc865713b5125020833fd4700b5c5e5d`
MD5	`c03d2fe563b150c8f70a03cf12379f22`
BLAKE2b-256	`804f110eb853d82801f24644d4564a31eefc0d79bfa8eb784b2c40b373a72370`

See more details on using hashes here.

Provenance

The following attestation bundles were made for hugesitemap-2.0.1-py3-none-any.whl:

Publisher: default_release_public.yml on bitranox/hugesitemap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: hugesitemap-2.0.1-py3-none-any.whl
- Subject digest: 36099f340700f99f77e3d194d964866fdc865713b5125020833fd4700b5c5e5d
- Sigstore transparency entry: 1972864069
- Sigstore integration time: Jun 26, 2026
Source repository:
- Permalink: bitranox/hugesitemap@db9d10131286e725475596badd85b7c5168d8bba
- Branch / Tag: refs/tags/v2.0.1
- Owner: https://github.com/bitranox
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: default_release_public.yml@db9d10131286e725475596badd85b7c5168d8bba
- Trigger Event: push

hugesitemap 2.0.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

hugesitemap

Python 3.10+ Baseline

Install - recommended via uv

Install uv (if not already installed)

One-shot run (no install needed)

Persistent install as CLI tool

Persistent install as CLI tool

Install as project dependency

Configuration

Quick Start

Usage

Available Commands

Generating a Sitemap

Memory footprint

Configuration Format

Programmatic Usage

Further Documentation

AI transparency

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance