Skip to main content

Scalable sitemap generation and web discovery infrastructure for Django — background XML sitemaps (standard/image/video/news), robots.txt, llms.txt, ads.txt, security.txt, humans.txt.

Project description

django-icv-sitemaps

CI PyPI Python Django License: MIT

Django's built-in django.contrib.sitemaps loads every URL into memory at request time. On a site with tens of thousands of pages that means slow responses, high memory pressure, and no incremental updates when content changes. At a million URLs it simply does not work.

django-icv-sitemaps replaces that approach entirely. Sitemaps are built in the background by Celery tasks, written atomically to any Django storage backend (local, S3, GCS), and served as static files. Only sections whose content has changed are ever rebuilt. The full protocol is covered: standard, image, video, and news sitemaps, automatic file splitting, gzip compression, and search engine pinging — plus a complete set of web discovery files (robots.txt, llms.txt, ads.txt, security.txt, humans.txt) managed from the database.

Part of the ICV-Django ecosystem, but fully standalone — no other ICV packages required.


Features

  • Background generation — sitemaps are generated by Celery tasks (optional), written to Django storage backends (local, S3, GCS), and served statically
  • Incremental updatespost_save/post_delete signals mark affected sections as stale; only changed sections are regenerated
  • All four sitemap types — standard, image, video, and news sitemaps with correct XML namespaces per the sitemap protocol
  • Automatic splitting — files are split at 50,000 URLs or 50 MB per the protocol limits
  • SitemapMixin — declare any Django model as sitemap-includable with a small set of class attributes
  • Auto-sectionsICV_SITEMAPS_AUTO_SECTIONS wires signal handlers automatically, like Django's ICV_SEARCH_AUTO_INDEX
  • robots.txt — dynamic, database-driven rules merged with settings; includes Sitemap: directive automatically
  • llms.txt — AI crawler guidance served at /llms.txt
  • ads.txt / app-ads.txt — IAB-format authorised seller declarations
  • security.txt — RFC 9116 compliant, served at /.well-known/security.txt
  • humans.txt — team credits
  • URL redirects — database-driven redirect rules (301/302/307/308/410) with exact, prefix, and regex matching, priority ordering, expiry, hit tracking, and CSV import/export
  • 404 tracking — automatic detection of recurring 404s with hit counts and referrer tracking; create redirect rules directly from admin
  • RedirectMiddleware — opt-in middleware evaluates redirect rules before Django's URL resolver; fail-open design never breaks the request cycle
  • Search engine ping — Google, Bing, Yandex notified on content changes (conditional on checksum comparison)
  • Multi-tenancy — all discovery files are tenant-scoped; sitemap paths include tenant prefix to prevent collisions; tenant IDs are sanitised to prevent path-traversal attacks
  • Gzip support — compressed .xml.gz output with correct headers
  • Atomic writes — temp file then rename; no partially-written files served
  • 6 management commandssetup, generate, ping, validate, stats, redirects
  • Django admin — all 8 models registered with actions, list filters, and read-only views
  • Celery graceful degradation — tasks work synchronously when Celery is not installed
  • Testing utilities — 8 factory-boy factories, pytest fixtures, and helpers in icv_sitemaps.testing

Installation

pip install django-icv-sitemaps

Add to INSTALLED_APPS:

INSTALLED_APPS = [
    # ...
    "icv_sitemaps",
]

Run migrations:

python manage.py migrate icv_sitemaps

Include the URL configuration:

# urls.py
from django.urls import include, path

urlpatterns = [
    path("", include("icv_sitemaps.urls")),
    # ...
]

This registers all discovery file endpoints at the root (/sitemap.xml, /robots.txt, /llms.txt, /ads.txt, /app-ads.txt, /.well-known/security.txt, /humans.txt).


Quick Start

1. Make your model sitemap-includable

# myapp/models.py
from django.db import models
from icv_sitemaps.mixins import SitemapMixin


class Article(SitemapMixin, models.Model):
    sitemap_section_name = "articles"
    sitemap_changefreq = "weekly"
    sitemap_priority = 0.7

    title = models.CharField(max_length=200)
    slug = models.SlugField(unique=True)
    is_published = models.BooleanField(default=True)
    updated_at = models.DateTimeField(auto_now=True)

    def get_absolute_url(self):
        return f"/articles/{self.slug}/"

    @classmethod
    def get_sitemap_queryset(cls):
        return cls.objects.filter(is_published=True)

2. Configure auto-sections

# settings.py
ICV_SITEMAPS_BASE_URL = "https://example.com"

ICV_SITEMAPS_AUTO_SECTIONS = {
    "articles": {
        "model": "blog.Article",
        "sitemap_type": "standard",
        "changefreq": "weekly",
        "priority": 0.7,
    },
    "product_images": {
        "model": "catalogue.ProductImage",
        "sitemap_type": "image",
    },
    "videos": {
        "model": "media.Video",
        "sitemap_type": "video",
    },
    "breaking_news": {
        "model": "news.BreakingStory",
        "sitemap_type": "news",
    },
}

3. Set up and generate

# Create SitemapSection records from config
python manage.py icv_sitemaps_setup

# Generate all sitemaps
python manage.py icv_sitemaps_generate --all

# Validate output
python manage.py icv_sitemaps_validate

# Check stats
python manage.py icv_sitemaps_stats

4. Automatic regeneration

When an Article is saved or deleted, its section is marked stale. The regenerate_stale_sitemaps task picks it up on the next run.

# Celery beat schedule (optional)
from celery.schedules import crontab

CELERY_BEAT_SCHEDULE = {
    "icv-sitemaps-regenerate-stale": {
        "task": "icv_sitemaps.tasks.regenerate_stale_sitemaps",
        "schedule": crontab(minute="*/15"),
    },
    "icv-sitemaps-regenerate-all": {
        "task": "icv_sitemaps.tasks.regenerate_all_sitemaps",
        "schedule": crontab(hour=3, minute=0),
    },
    "icv-sitemaps-cleanup-logs": {
        "task": "icv_sitemaps.tasks.cleanup_generation_logs",
        "schedule": crontab(hour=4, minute=0),
    },
    "icv-sitemaps-cleanup-orphans": {
        "task": "icv_sitemaps.tasks.cleanup_orphan_files",
        "schedule": crontab(day_of_week=0, hour=5, minute=0),
    },
}

Sitemap Types

Standard

Standard XML sitemaps with <loc>, <lastmod>, <changefreq>, and <priority> per the sitemaps.org protocol.

Image

Uses the http://www.google.com/schemas/sitemap-image/1.1 namespace. Configure image fields on your mixin:

class ProductImage(SitemapMixin, models.Model):
    sitemap_section_name = "product_images"
    sitemap_type = "image"
    sitemap_image_field = "image_url"
    sitemap_image_caption_field = "caption"
    sitemap_image_title_field = "title"

Video

Uses the http://www.google.com/schemas/sitemap-video/1.1 namespace:

class Video(SitemapMixin, models.Model):
    sitemap_section_name = "videos"
    sitemap_type = "video"
    sitemap_video_url_field = "video_url"
    sitemap_video_thumbnail_field = "thumbnail_url"
    sitemap_video_title_field = "title"
    sitemap_video_description_field = "description"
    sitemap_video_duration_field = "duration_seconds"

News

Uses the http://www.google.com/schemas/sitemap-news/0.9 namespace. Entries older than ICV_SITEMAPS_NEWS_MAX_AGE_DAYS (default 2) are automatically excluded:

class BreakingStory(SitemapMixin, models.Model):
    sitemap_section_name = "breaking_news"
    sitemap_type = "news"
    sitemap_news_publication_name = "Example News"
    sitemap_news_language = "en"
    sitemap_news_title_field = "headline"
    sitemap_news_date_field = "published_at"

Discovery Files

robots.txt

Database-driven rules managed via Django admin or the service layer:

from icv_sitemaps.services import add_robots_rule

# Block AI crawlers from /private/
add_robots_rule("GPTBot", "disallow", "/private/")
add_robots_rule("CCBot", "disallow", "/")

# Block all bots from /admin/
add_robots_rule("*", "disallow", "/admin/")

Extra directives from settings are appended after database rules:

ICV_SITEMAPS_ROBOTS_EXTRA_DIRECTIVES = [
    "Crawl-delay: 10",
]

ads.txt / app-ads.txt

IAB-format authorised seller declarations:

from icv_sitemaps.services import add_ads_entry

add_ads_entry("google.com", "pub-1234567890", "DIRECT", certification_id="f08c47fec0942fa0")
add_ads_entry("adnetwork.com", "pub-9876543210", "RESELLER")

# For app-ads.txt
add_ads_entry("google.com", "pub-1234567890", "DIRECT", is_app_ads=True)

llms.txt, security.txt, humans.txt

Free-form text content managed via DiscoveryFileConfig:

from icv_sitemaps.services import set_discovery_file_content

set_discovery_file_content("llms_txt", """# llms.txt
# AI training and crawl guidance for example.com

Allow: /blog/
Disallow: /private/
""")

set_discovery_file_content("security_txt", """Contact: mailto:security@example.com
Expires: 2027-01-01T00:00:00.000Z
Preferred-Languages: en
""")

set_discovery_file_content("humans_txt", """/* TEAM */
Lead: Nigel Copley
Site: example.com
""")

URL Redirects & 404 Tracking

Redirect Rules

Database-driven redirect rules evaluated by RedirectMiddleware before Django's URL resolver:

from icv_sitemaps.services import add_redirect

# Permanent redirect
add_redirect("/old-page/", "/new-page/", 301)

# Temporary redirect
add_redirect("/promo/", "/summer-sale/", 302)

# 410 Gone — page permanently removed
add_redirect("/deleted-product/", "", 410)

# Prefix match — all paths under /blog/2023/ redirect
add_redirect("/blog/2023/", "/archive/2023/", 301, match_type="prefix")

# Regex match
add_redirect(r"/product/\d+/", "/products/", 301, match_type="regex")

# Bulk import from CSV
from icv_sitemaps.services import bulk_import_redirects

with open("redirects.csv") as f:
    import csv
    rows = list(csv.DictReader(f))
    result = bulk_import_redirects(rows)
    # {"created": 150, "updated": 3, "errors": []}

Enable the Middleware

# settings.py
MIDDLEWARE = [
    # ... security/WAF middleware first ...
    "icv_sitemaps.middleware.RedirectMiddleware",
    "django.middleware.common.CommonMiddleware",
    # ...
]

ICV_SITEMAPS_REDIRECT_ENABLED = True

404 Tracking

Enable automatic 404 tracking to identify broken URLs:

# settings.py
ICV_SITEMAPS_404_TRACKING_ENABLED = True
ICV_SITEMAPS_404_TRACKING_SAMPLE_RATE = 1.0   # Track all 404s (reduce for high traffic)
ICV_SITEMAPS_404_IGNORE_PATTERNS = [
    r"\.(?:css|js|ico|png|jpg|jpeg|gif|svg|woff2?|ttf|eot|map)$",
]

Review top 404s and create redirects:

from icv_sitemaps.services import get_top_404s

# Top 50 unresolved 404s with at least 5 hits
for entry in get_top_404s(min_hits=5):
    print(f"{entry.path}{entry.hit_count} hits, referrers: {entry.referrers}")

Or from the command line:

python manage.py icv_sitemaps_redirects --top-404s
python manage.py icv_sitemaps_redirects --list
python manage.py icv_sitemaps_redirects --import redirects.csv
python manage.py icv_sitemaps_redirects --export redirects.csv
python manage.py icv_sitemaps_redirects --prune   # Remove expired rules

Configuration

Settings Reference

All settings are namespaced under ICV_SITEMAPS_*. Every setting has a sensible default so the package works out of the box for local development.

Setting Type Default Description
ICV_SITEMAPS_BASE_URL str "" Base URL for absolute sitemap URLs (e.g. "https://example.com"). Required — raises ImproperlyConfigured at generation time if empty
ICV_SITEMAPS_STORAGE_BACKEND str "django.core.files.storage.default_storage" Dotted path to Django storage backend for generated files
ICV_SITEMAPS_STORAGE_PATH str "sitemaps/" Base path within the storage backend
ICV_SITEMAPS_MAX_URLS_PER_FILE int 50000 Maximum URLs per file (protocol limit: 50,000)
ICV_SITEMAPS_MAX_FILE_SIZE_BYTES int 52428800 Maximum file size in bytes (protocol limit: 50 MB)
ICV_SITEMAPS_BATCH_SIZE int 5000 Queryset iteration batch size
ICV_SITEMAPS_GZIP bool True Compress files with gzip
ICV_SITEMAPS_PING_ENGINES list ["google", "bing"] Engines to ping after regeneration
ICV_SITEMAPS_PING_ENABLED bool True Enable/disable pinging
ICV_SITEMAPS_AUTO_SECTIONS dict {} Auto-register model sections (see Quick Start)
ICV_SITEMAPS_ROBOTS_EXTRA_DIRECTIVES list [] Extra lines appended to robots.txt
ICV_SITEMAPS_ROBOTS_SITEMAP_URL str "" Override sitemap URL in robots.txt (auto-detected if empty)
ICV_SITEMAPS_CACHE_TIMEOUT int 3600 Cache TTL for discovery files (seconds)
ICV_SITEMAPS_TENANT_PREFIX_FUNC str "" Dotted path to tenant prefix callable
ICV_SITEMAPS_ASYNC_GENERATION bool True Use Celery for background generation
ICV_SITEMAPS_STREAMING_THRESHOLD int 100000 URL count above which streaming generation is used
ICV_SITEMAPS_NEWS_MAX_AGE_DAYS int 2 Maximum age for news entries (Google requires < 2 days)
ICV_SITEMAPS_REDIRECT_ENABLED bool False Enable redirect middleware evaluation (opt-in)
ICV_SITEMAPS_REDIRECT_CACHE_TIMEOUT int 300 Cache TTL for redirect rule lookups (seconds)
ICV_SITEMAPS_404_TRACKING_ENABLED bool False Enable 404 tracking in the redirect middleware
ICV_SITEMAPS_404_TRACKING_SAMPLE_RATE float 1.0 Fraction of 404s to track (0.0--1.0)
ICV_SITEMAPS_404_IGNORE_PATTERNS list [r"\.(?:css|js|...)$"] Regex patterns for paths to ignore when tracking 404s

Auto-Sections Configuration

Each key in ICV_SITEMAPS_AUTO_SECTIONS is the section name. The value is a configuration dict:

Key Type Default Description
model str required "app_label.ModelName"
sitemap_type str "standard" standard, image, video, or news
changefreq str "daily" Default change frequency
priority float 0.5 Default priority (0.0--1.0)
on_save bool True Mark section stale on model save
on_delete bool True Mark section stale on model delete

Service Functions

All functions are importable from icv_sitemaps.services:

from icv_sitemaps.services import (
    # Sitemap generation
    generate_section,
    generate_all_sections,
    generate_index,
    mark_section_stale,
    get_generation_stats,
    # Section management
    create_section,
    delete_section,
    # Search engine ping
    ping_search_engines,
    # robots.txt
    render_robots_txt,
    add_robots_rule,
    get_robots_rules,
    # ads.txt
    render_ads_txt,
    add_ads_entry,
    # Discovery files
    get_discovery_file_content,
    set_discovery_file_content,
    # Redirects
    check_redirect,
    add_redirect,
    bulk_import_redirects,
    record_404,
    get_top_404s,
)

Management Commands

Command Purpose
icv_sitemaps_setup [--dry-run] Create SitemapSection records from ICV_SITEMAPS_AUTO_SECTIONS and verify storage
icv_sitemaps_generate [--section NAME] [--all] [--index-only] [--force] [--tenant ID] Generate sitemaps; defaults to stale sections only
icv_sitemaps_ping [--url URL] [--tenant ID] Ping search engines
icv_sitemaps_validate [--section NAME] Validate generated sitemaps against protocol
icv_sitemaps_stats [--tenant ID] Show generation statistics
icv_sitemaps_redirects [--list] [--import FILE] [--export FILE] [--prune] [--top-404s] Manage redirect rules

Signals

All signals are defined in icv_sitemaps.signals:

Signal When
sitemap_section_generated After a section is successfully generated
sitemap_generation_complete After all sections are generated
sitemap_section_deleted After a section and its files are deleted
sitemap_pinged After search engines are pinged
sitemap_section_stale After a section is marked stale
redirect_rule_saved After a redirect rule is saved
redirect_rule_deleted After a redirect rule is deleted
redirect_matched When a redirect rule matches a request

Celery Tasks

Task Purpose Schedule
regenerate_stale_sitemaps Regenerate stale sections Every 15 minutes
regenerate_all_sitemaps Full regeneration Daily at 03:00
ping_engines_task Ping search engines After generation
cleanup_generation_logs Delete old logs (30-day default) Daily at 04:00
cleanup_orphan_files Remove unreferenced storage files Weekly
cleanup_expired_redirects Delete expired redirect rules Daily
cleanup_redirect_logs Delete old resolved 404 logs (90-day default) Weekly

Multi-Tenancy

Enable tenant-scoped discovery files by setting ICV_SITEMAPS_TENANT_PREFIX_FUNC to a dotted path to a callable that returns the tenant identifier:

# myapp/tenancy.py
def get_tenant_id(request):
    return getattr(request, "tenant_id", "")

# settings.py
ICV_SITEMAPS_TENANT_PREFIX_FUNC = "myapp.tenancy.get_tenant_id"

Each tenant gets isolated robots.txt, ads.txt, sitemaps, and all other discovery files. Sitemap files are stored with tenant-prefixed paths (e.g. sitemaps/acme/products-0.xml).


Production Configuration

# settings.py
ICV_SITEMAPS_BASE_URL = "https://example.com"
ICV_SITEMAPS_STORAGE_BACKEND = "storages.backends.s3boto3.S3Boto3Storage"
ICV_SITEMAPS_STORAGE_PATH = "sitemaps/"
ICV_SITEMAPS_GZIP = True
ICV_SITEMAPS_PING_ENGINES = ["google", "bing"]
ICV_SITEMAPS_BATCH_SIZE = 10000
ICV_SITEMAPS_ASYNC_GENERATION = True

Testing

The package provides testing utilities for consuming projects:

from icv_sitemaps.testing import (
    SitemapSectionFactory,
    SitemapFileFactory,
    SitemapGenerationLogFactory,
    RobotsRuleFactory,
    AdsEntryFactory,
    DiscoveryFileConfigFactory,
    RedirectRuleFactory,
    RedirectLogFactory,
)

To run the package's own tests:

cd packages/icv-sitemaps
pytest tests/ -v

Models

Model Purpose
SitemapSection Logical sitemap section (e.g. "products", "articles") with staleness tracking
SitemapFile Individual generated XML file with URL count and checksum
SitemapGenerationLog Audit trail for generation runs
RobotsRule Database-driven robots.txt directives
AdsEntry ads.txt / app-ads.txt authorised seller entries
DiscoveryFileConfig Content store for llms.txt, security.txt, humans.txt
RedirectRule HTTP redirect and 410 Gone rules with pattern matching
RedirectLog Aggregated 404 tracking with hit counts and referrers

URL Endpoints

URL Content-Type Description
/sitemap.xml application/xml Sitemap index
/sitemaps/<filename> application/xml Individual sitemap files
/robots.txt text/plain Robots exclusion protocol
/llms.txt text/plain AI crawler guidance
/ads.txt text/plain Authorised digital sellers
/app-ads.txt text/plain Authorised app sellers
/.well-known/security.txt text/plain Security contact (RFC 9116)
/security.txt 301 redirect Redirects to /.well-known/security.txt
/humans.txt text/plain Team credits

Requirements

  • Python 3.11+
  • Django 5.1+
  • httpx 0.27+ (for search engine pings)
  • Celery 5.3+ (optional, for background generation)

Licence

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

django_icv_sitemaps-0.5.0.tar.gz (90.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

django_icv_sitemaps-0.5.0-py3-none-any.whl (79.2 kB view details)

Uploaded Python 3

File details

Details for the file django_icv_sitemaps-0.5.0.tar.gz.

File metadata

  • Download URL: django_icv_sitemaps-0.5.0.tar.gz
  • Upload date:
  • Size: 90.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for django_icv_sitemaps-0.5.0.tar.gz
Algorithm Hash digest
SHA256 06ac8ccbc6696a8ffbd79cc4ae95412decdd1d255e855a7d59c16c4525b02584
MD5 602564e2d6503ff290f94d5342678d65
BLAKE2b-256 dc78e0cc633740ce909b0bec13d3f2ba16891a4f7949edfa53204e8e52f6acc2

See more details on using hashes here.

Provenance

The following attestation bundles were made for django_icv_sitemaps-0.5.0.tar.gz:

Publisher: publish-sitemaps.yml on nigelcopley/icv-oss

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file django_icv_sitemaps-0.5.0-py3-none-any.whl.

File metadata

File hashes

Hashes for django_icv_sitemaps-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f79bdfc5ca2dd747e1bc6f7b6e2ae9e6b759b25a8b36e9829927986cfa361dcc
MD5 686f538630df8c9c3754a9abd062930a
BLAKE2b-256 e4221852a4b469df202d08ef52f32114af7f7b02445081a99b73d858e27f1080

See more details on using hashes here.

Provenance

The following attestation bundles were made for django_icv_sitemaps-0.5.0-py3-none-any.whl:

Publisher: publish-sitemaps.yml on nigelcopley/icv-oss

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page