Skip to main content

Scalable sitemap generation and web discovery infrastructure for Django — background XML sitemaps (standard/image/video/news), robots.txt, llms.txt, ads.txt, security.txt, humans.txt.

Project description

django-icv-sitemaps

PyPI License: MIT

Scalable sitemap generation and web discovery infrastructure for Django — background XML sitemaps (standard/image/video/news), robots.txt, llms.txt, ads.txt, security.txt, and humans.txt.

Designed for sites with millions of URLs where Django's built-in django.contrib.sitemaps is impractical (memory-hungry querysets, blocking request-time generation, no incremental updates).

Part of the ICV-Django ecosystem, but fully standalone — no other ICV packages required.


Features

  • Background generation — sitemaps are generated by Celery tasks (optional), written to Django storage backends (local, S3, GCS), and served statically
  • Incremental updatespost_save/post_delete signals mark affected sections as stale; only changed sections are regenerated
  • All four sitemap types — standard, image, video, and news sitemaps with correct XML namespaces per the sitemap protocol
  • Automatic splitting — files are split at 50,000 URLs or 50 MB per the protocol limits
  • SitemapMixin — declare any Django model as sitemap-includable with a small set of class attributes
  • Auto-sectionsICV_SITEMAPS_AUTO_SECTIONS wires signal handlers automatically, like Django's ICV_SEARCH_AUTO_INDEX
  • robots.txt — dynamic, database-driven rules merged with settings; includes Sitemap: directive automatically
  • llms.txt — AI crawler guidance served at /llms.txt
  • ads.txt / app-ads.txt — IAB-format authorised seller declarations
  • security.txt — RFC 9116 compliant, served at /.well-known/security.txt
  • humans.txt — team credits
  • Search engine ping — Google, Bing, Yandex notified on content changes (conditional on checksum comparison)
  • Multi-tenancy — all discovery files are tenant-scoped; sitemap paths include tenant prefix to prevent collisions; tenant IDs are sanitised to prevent path-traversal attacks
  • Gzip support — compressed .xml.gz output with correct headers
  • Atomic writes — temp file then rename; no partially-written files served
  • 5 management commandssetup, generate, ping, validate, stats
  • Django admin — all 6 models registered with actions, list filters, and read-only views
  • Celery graceful degradation — tasks work synchronously when Celery is not installed
  • Testing utilities — 6 factory-boy factories, pytest fixtures, and helpers in icv_sitemaps.testing

Installation

pip install django-icv-sitemaps

Add to INSTALLED_APPS:

INSTALLED_APPS = [
    # ...
    "icv_sitemaps",
]

Run migrations:

python manage.py migrate icv_sitemaps

Include the URL configuration:

# urls.py
from django.urls import include, path

urlpatterns = [
    path("", include("icv_sitemaps.urls")),
    # ...
]

This registers all discovery file endpoints at the root (/sitemap.xml, /robots.txt, /llms.txt, /ads.txt, /app-ads.txt, /.well-known/security.txt, /humans.txt).


Quick Start

1. Make your model sitemap-includable

# myapp/models.py
from django.db import models
from icv_sitemaps.mixins import SitemapMixin


class Article(SitemapMixin, models.Model):
    sitemap_section_name = "articles"
    sitemap_changefreq = "weekly"
    sitemap_priority = 0.7

    title = models.CharField(max_length=200)
    slug = models.SlugField(unique=True)
    is_published = models.BooleanField(default=True)
    updated_at = models.DateTimeField(auto_now=True)

    def get_absolute_url(self):
        return f"/articles/{self.slug}/"

    @classmethod
    def get_sitemap_queryset(cls):
        return cls.objects.filter(is_published=True)

2. Configure auto-sections

# settings.py
ICV_SITEMAPS_BASE_URL = "https://example.com"

ICV_SITEMAPS_AUTO_SECTIONS = {
    "articles": {
        "model": "blog.Article",
        "sitemap_type": "standard",
        "changefreq": "weekly",
        "priority": 0.7,
    },
    "product_images": {
        "model": "catalogue.ProductImage",
        "sitemap_type": "image",
    },
    "videos": {
        "model": "media.Video",
        "sitemap_type": "video",
    },
    "breaking_news": {
        "model": "news.BreakingStory",
        "sitemap_type": "news",
    },
}

3. Set up and generate

# Create SitemapSection records from config
python manage.py icv_sitemaps_setup

# Generate all sitemaps
python manage.py icv_sitemaps_generate --all

# Validate output
python manage.py icv_sitemaps_validate

# Check stats
python manage.py icv_sitemaps_stats

4. Automatic regeneration

When an Article is saved or deleted, its section is marked stale. The regenerate_stale_sitemaps task picks it up on the next run.

# Celery beat schedule (optional)
from celery.schedules import crontab

CELERY_BEAT_SCHEDULE = {
    "icv-sitemaps-regenerate-stale": {
        "task": "icv_sitemaps.tasks.regenerate_stale_sitemaps",
        "schedule": crontab(minute="*/15"),
    },
    "icv-sitemaps-regenerate-all": {
        "task": "icv_sitemaps.tasks.regenerate_all_sitemaps",
        "schedule": crontab(hour=3, minute=0),
    },
    "icv-sitemaps-cleanup-logs": {
        "task": "icv_sitemaps.tasks.cleanup_generation_logs",
        "schedule": crontab(hour=4, minute=0),
    },
    "icv-sitemaps-cleanup-orphans": {
        "task": "icv_sitemaps.tasks.cleanup_orphan_files",
        "schedule": crontab(day_of_week=0, hour=5, minute=0),
    },
}

Sitemap Types

Standard

Standard XML sitemaps with <loc>, <lastmod>, <changefreq>, and <priority> per the sitemaps.org protocol.

Image

Uses the http://www.google.com/schemas/sitemap-image/1.1 namespace. Configure image fields on your mixin:

class ProductImage(SitemapMixin, models.Model):
    sitemap_section_name = "product_images"
    sitemap_type = "image"
    sitemap_image_field = "image_url"
    sitemap_image_caption_field = "caption"
    sitemap_image_title_field = "title"

Video

Uses the http://www.google.com/schemas/sitemap-video/1.1 namespace:

class Video(SitemapMixin, models.Model):
    sitemap_section_name = "videos"
    sitemap_type = "video"
    sitemap_video_url_field = "video_url"
    sitemap_video_thumbnail_field = "thumbnail_url"
    sitemap_video_title_field = "title"
    sitemap_video_description_field = "description"
    sitemap_video_duration_field = "duration_seconds"

News

Uses the http://www.google.com/schemas/sitemap-news/0.9 namespace. Entries older than ICV_SITEMAPS_NEWS_MAX_AGE_DAYS (default 2) are automatically excluded:

class BreakingStory(SitemapMixin, models.Model):
    sitemap_section_name = "breaking_news"
    sitemap_type = "news"
    sitemap_news_publication_name = "Example News"
    sitemap_news_language = "en"
    sitemap_news_title_field = "headline"
    sitemap_news_date_field = "published_at"

Discovery Files

robots.txt

Database-driven rules managed via Django admin or the service layer:

from icv_sitemaps.services import add_robots_rule

# Block AI crawlers from /private/
add_robots_rule("GPTBot", "disallow", "/private/")
add_robots_rule("CCBot", "disallow", "/")

# Block all bots from /admin/
add_robots_rule("*", "disallow", "/admin/")

Extra directives from settings are appended after database rules:

ICV_SITEMAPS_ROBOTS_EXTRA_DIRECTIVES = [
    "Crawl-delay: 10",
]

ads.txt / app-ads.txt

IAB-format authorised seller declarations:

from icv_sitemaps.services import add_ads_entry

add_ads_entry("google.com", "pub-1234567890", "DIRECT", certification_id="f08c47fec0942fa0")
add_ads_entry("adnetwork.com", "pub-9876543210", "RESELLER")

# For app-ads.txt
add_ads_entry("google.com", "pub-1234567890", "DIRECT", is_app_ads=True)

llms.txt, security.txt, humans.txt

Free-form text content managed via DiscoveryFileConfig:

from icv_sitemaps.services import set_discovery_file_content

set_discovery_file_content("llms_txt", """# llms.txt
# AI training and crawl guidance for example.com

Allow: /blog/
Disallow: /private/
""")

set_discovery_file_content("security_txt", """Contact: mailto:security@example.com
Expires: 2027-01-01T00:00:00.000Z
Preferred-Languages: en
""")

set_discovery_file_content("humans_txt", """/* TEAM */
Lead: Nigel Copley
Site: example.com
""")

Configuration

Settings Reference

All settings are namespaced under ICV_SITEMAPS_*. Every setting has a sensible default so the package works out of the box for local development.

Setting Type Default Description
ICV_SITEMAPS_BASE_URL str "" Base URL for absolute sitemap URLs (e.g. "https://example.com"). Required — raises ImproperlyConfigured at generation time if empty
ICV_SITEMAPS_STORAGE_BACKEND str "django.core.files.storage.default_storage" Dotted path to Django storage backend for generated files
ICV_SITEMAPS_STORAGE_PATH str "sitemaps/" Base path within the storage backend
ICV_SITEMAPS_MAX_URLS_PER_FILE int 50000 Maximum URLs per file (protocol limit: 50,000)
ICV_SITEMAPS_MAX_FILE_SIZE_BYTES int 52428800 Maximum file size in bytes (protocol limit: 50 MB)
ICV_SITEMAPS_BATCH_SIZE int 5000 Queryset iteration batch size
ICV_SITEMAPS_GZIP bool True Compress files with gzip
ICV_SITEMAPS_PING_ENGINES list ["google", "bing"] Engines to ping after regeneration
ICV_SITEMAPS_PING_ENABLED bool True Enable/disable pinging
ICV_SITEMAPS_AUTO_SECTIONS dict {} Auto-register model sections (see Quick Start)
ICV_SITEMAPS_ROBOTS_EXTRA_DIRECTIVES list [] Extra lines appended to robots.txt
ICV_SITEMAPS_ROBOTS_SITEMAP_URL str "" Override sitemap URL in robots.txt (auto-detected if empty)
ICV_SITEMAPS_CACHE_TIMEOUT int 3600 Cache TTL for discovery files (seconds)
ICV_SITEMAPS_TENANT_PREFIX_FUNC str "" Dotted path to tenant prefix callable
ICV_SITEMAPS_ASYNC_GENERATION bool True Use Celery for background generation
ICV_SITEMAPS_STREAMING_THRESHOLD int 100000 URL count above which streaming generation is used
ICV_SITEMAPS_NEWS_MAX_AGE_DAYS int 2 Maximum age for news entries (Google requires < 2 days)

Auto-Sections Configuration

Each key in ICV_SITEMAPS_AUTO_SECTIONS is the section name. The value is a configuration dict:

Key Type Default Description
model str required "app_label.ModelName"
sitemap_type str "standard" standard, image, video, or news
changefreq str "daily" Default change frequency
priority float 0.5 Default priority (0.0--1.0)
on_save bool True Mark section stale on model save
on_delete bool True Mark section stale on model delete

Service Functions

All functions are importable from icv_sitemaps.services:

from icv_sitemaps.services import (
    # Sitemap generation
    generate_section,
    generate_all_sections,
    generate_index,
    mark_section_stale,
    get_generation_stats,
    # Section management
    create_section,
    delete_section,
    # Search engine ping
    ping_search_engines,
    # robots.txt
    render_robots_txt,
    add_robots_rule,
    get_robots_rules,
    # ads.txt
    render_ads_txt,
    add_ads_entry,
    # Discovery files
    get_discovery_file_content,
    set_discovery_file_content,
)

Management Commands

Command Purpose
icv_sitemaps_setup [--dry-run] Create SitemapSection records from ICV_SITEMAPS_AUTO_SECTIONS and verify storage
icv_sitemaps_generate [--section NAME] [--all] [--index-only] [--force] [--tenant ID] Generate sitemaps; defaults to stale sections only
icv_sitemaps_ping [--url URL] [--tenant ID] Ping search engines
icv_sitemaps_validate [--section NAME] Validate generated sitemaps against protocol
icv_sitemaps_stats [--tenant ID] Show generation statistics

Signals

All signals are defined in icv_sitemaps.signals:

Signal When
sitemap_section_generated After a section is successfully generated
sitemap_generation_complete After all sections are generated
sitemap_section_deleted After a section and its files are deleted
sitemap_pinged After search engines are pinged
sitemap_section_stale After a section is marked stale

Celery Tasks

Task Purpose Schedule
regenerate_stale_sitemaps Regenerate stale sections Every 15 minutes
regenerate_all_sitemaps Full regeneration Daily at 03:00
ping_engines_task Ping search engines After generation
cleanup_generation_logs Delete old logs (30-day default) Daily at 04:00
cleanup_orphan_files Remove unreferenced storage files Weekly

Multi-Tenancy

Enable tenant-scoped discovery files by setting ICV_SITEMAPS_TENANT_PREFIX_FUNC to a dotted path to a callable that returns the tenant identifier:

# myapp/tenancy.py
def get_tenant_id(request):
    return getattr(request, "tenant_id", "")

# settings.py
ICV_SITEMAPS_TENANT_PREFIX_FUNC = "myapp.tenancy.get_tenant_id"

Each tenant gets isolated robots.txt, ads.txt, sitemaps, and all other discovery files. Sitemap files are stored with tenant-prefixed paths (e.g. sitemaps/acme/products-0.xml).


Production Configuration

# settings.py
ICV_SITEMAPS_BASE_URL = "https://example.com"
ICV_SITEMAPS_STORAGE_BACKEND = "storages.backends.s3boto3.S3Boto3Storage"
ICV_SITEMAPS_STORAGE_PATH = "sitemaps/"
ICV_SITEMAPS_GZIP = True
ICV_SITEMAPS_PING_ENGINES = ["google", "bing"]
ICV_SITEMAPS_BATCH_SIZE = 10000
ICV_SITEMAPS_ASYNC_GENERATION = True

Testing

The package provides testing utilities for consuming projects:

from icv_sitemaps.testing import (
    SitemapSectionFactory,
    SitemapFileFactory,
    SitemapGenerationLogFactory,
    RobotsRuleFactory,
    AdsEntryFactory,
    DiscoveryFileConfigFactory,
)

To run the package's own tests:

cd packages/icv-sitemaps
pytest tests/ -v

Models

Model Purpose
SitemapSection Logical sitemap section (e.g. "products", "articles") with staleness tracking
SitemapFile Individual generated XML file with URL count and checksum
SitemapGenerationLog Audit trail for generation runs
RobotsRule Database-driven robots.txt directives
AdsEntry ads.txt / app-ads.txt authorised seller entries
DiscoveryFileConfig Content store for llms.txt, security.txt, humans.txt

URL Endpoints

URL Content-Type Description
/sitemap.xml application/xml Sitemap index
/sitemaps/<filename> application/xml Individual sitemap files
/robots.txt text/plain Robots exclusion protocol
/llms.txt text/plain AI crawler guidance
/ads.txt text/plain Authorised digital sellers
/app-ads.txt text/plain Authorised app sellers
/.well-known/security.txt text/plain Security contact (RFC 9116)
/security.txt 301 redirect Redirects to /.well-known/security.txt
/humans.txt text/plain Team credits

Requirements

  • Python 3.11+
  • Django 4.2+
  • httpx 0.27+ (for search engine pings)
  • Celery 5.3+ (optional, for background generation)

Licence

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

django_icv_sitemaps-0.2.0.tar.gz (65.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

django_icv_sitemaps-0.2.0-py3-none-any.whl (62.3 kB view details)

Uploaded Python 3

File details

Details for the file django_icv_sitemaps-0.2.0.tar.gz.

File metadata

  • Download URL: django_icv_sitemaps-0.2.0.tar.gz
  • Upload date:
  • Size: 65.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for django_icv_sitemaps-0.2.0.tar.gz
Algorithm Hash digest
SHA256 644fcce8a99bee8f956f6b9623d03596228725d7d8c8c7e14f9fbb7c1d1a58ba
MD5 a3233c8136452a205a226128f016f671
BLAKE2b-256 a2627f183c21b3199bc797be7f55bab11bc462bee9823ab4ed438a084a8ec353

See more details on using hashes here.

Provenance

The following attestation bundles were made for django_icv_sitemaps-0.2.0.tar.gz:

Publisher: publish-sitemaps.yml on nigelcopley/icv-django

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file django_icv_sitemaps-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for django_icv_sitemaps-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0452b6139f7ae6a623d7edf8566bef6428c2ee86be25dad3328322f9211795e5
MD5 e9596d4ef7176940b5d3fad2ae6397c8
BLAKE2b-256 09f652c0e380a454644bb096e745494139e576a26e75983685b19bdd58a545ab

See more details on using hashes here.

Provenance

The following attestation bundles were made for django_icv_sitemaps-0.2.0-py3-none-any.whl:

Publisher: publish-sitemaps.yml on nigelcopley/icv-django

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page