Scalable sitemap generation and web discovery infrastructure for Django — background XML sitemaps (standard/image/video/news), robots.txt, llms.txt, ads.txt, security.txt, humans.txt.
Project description
django-icv-sitemaps
Django's built-in django.contrib.sitemaps loads every URL into memory at
request time. On a site with tens of thousands of pages that means slow
responses, high memory pressure, and no incremental updates when content
changes. At a million URLs it simply does not work.
django-icv-sitemaps replaces that approach entirely. Sitemaps are built in
the background by Celery tasks, written atomically to any Django storage
backend (local, S3, GCS), and served as static files. Only sections whose
content has changed are ever rebuilt. The full protocol is covered: standard,
image, video, and news sitemaps, automatic file splitting, gzip compression,
and search engine pinging — plus a complete set of web discovery files
(robots.txt, llms.txt, ads.txt, security.txt, humans.txt) managed
from the database.
Part of the ICV-Django ecosystem, but fully standalone — no other ICV packages required.
Features
- Background generation — sitemaps are generated by Celery tasks (optional), written to Django storage backends (local, S3, GCS), and served statically
- Incremental updates —
post_save/post_deletesignals mark affected sections as stale; only changed sections are regenerated - All four sitemap types — standard, image, video, and news sitemaps with correct XML namespaces per the sitemap protocol
- Automatic splitting — files are split at 50,000 URLs or 50 MB per the protocol limits
- SitemapMixin — declare any Django model as sitemap-includable with a small set of class attributes
- Auto-sections —
ICV_SITEMAPS_AUTO_SECTIONSwires signal handlers automatically, like Django'sICV_SEARCH_AUTO_INDEX - robots.txt — dynamic, database-driven rules merged with settings; includes
Sitemap:directive automatically - llms.txt — AI crawler guidance served at
/llms.txt - ads.txt / app-ads.txt — IAB-format authorised seller declarations
- security.txt — RFC 9116 compliant, served at
/.well-known/security.txt - humans.txt — team credits
- Search engine ping — Google, Bing, Yandex notified on content changes (conditional on checksum comparison)
- Multi-tenancy — all discovery files are tenant-scoped; sitemap paths include tenant prefix to prevent collisions; tenant IDs are sanitised to prevent path-traversal attacks
- Gzip support — compressed
.xml.gzoutput with correct headers - Atomic writes — temp file then rename; no partially-written files served
- 5 management commands —
setup,generate,ping,validate,stats - Django admin — all 6 models registered with actions, list filters, and read-only views
- Celery graceful degradation — tasks work synchronously when Celery is not installed
- Testing utilities — 6 factory-boy factories, pytest fixtures, and helpers
in
icv_sitemaps.testing
Installation
pip install django-icv-sitemaps
Add to INSTALLED_APPS:
INSTALLED_APPS = [
# ...
"icv_sitemaps",
]
Run migrations:
python manage.py migrate icv_sitemaps
Include the URL configuration:
# urls.py
from django.urls import include, path
urlpatterns = [
path("", include("icv_sitemaps.urls")),
# ...
]
This registers all discovery file endpoints at the root (/sitemap.xml,
/robots.txt, /llms.txt, /ads.txt, /app-ads.txt,
/.well-known/security.txt, /humans.txt).
Quick Start
1. Make your model sitemap-includable
# myapp/models.py
from django.db import models
from icv_sitemaps.mixins import SitemapMixin
class Article(SitemapMixin, models.Model):
sitemap_section_name = "articles"
sitemap_changefreq = "weekly"
sitemap_priority = 0.7
title = models.CharField(max_length=200)
slug = models.SlugField(unique=True)
is_published = models.BooleanField(default=True)
updated_at = models.DateTimeField(auto_now=True)
def get_absolute_url(self):
return f"/articles/{self.slug}/"
@classmethod
def get_sitemap_queryset(cls):
return cls.objects.filter(is_published=True)
2. Configure auto-sections
# settings.py
ICV_SITEMAPS_BASE_URL = "https://example.com"
ICV_SITEMAPS_AUTO_SECTIONS = {
"articles": {
"model": "blog.Article",
"sitemap_type": "standard",
"changefreq": "weekly",
"priority": 0.7,
},
"product_images": {
"model": "catalogue.ProductImage",
"sitemap_type": "image",
},
"videos": {
"model": "media.Video",
"sitemap_type": "video",
},
"breaking_news": {
"model": "news.BreakingStory",
"sitemap_type": "news",
},
}
3. Set up and generate
# Create SitemapSection records from config
python manage.py icv_sitemaps_setup
# Generate all sitemaps
python manage.py icv_sitemaps_generate --all
# Validate output
python manage.py icv_sitemaps_validate
# Check stats
python manage.py icv_sitemaps_stats
4. Automatic regeneration
When an Article is saved or deleted, its section is marked stale. The
regenerate_stale_sitemaps task picks it up on the next run.
# Celery beat schedule (optional)
from celery.schedules import crontab
CELERY_BEAT_SCHEDULE = {
"icv-sitemaps-regenerate-stale": {
"task": "icv_sitemaps.tasks.regenerate_stale_sitemaps",
"schedule": crontab(minute="*/15"),
},
"icv-sitemaps-regenerate-all": {
"task": "icv_sitemaps.tasks.regenerate_all_sitemaps",
"schedule": crontab(hour=3, minute=0),
},
"icv-sitemaps-cleanup-logs": {
"task": "icv_sitemaps.tasks.cleanup_generation_logs",
"schedule": crontab(hour=4, minute=0),
},
"icv-sitemaps-cleanup-orphans": {
"task": "icv_sitemaps.tasks.cleanup_orphan_files",
"schedule": crontab(day_of_week=0, hour=5, minute=0),
},
}
Sitemap Types
Standard
Standard XML sitemaps with <loc>, <lastmod>, <changefreq>, and
<priority> per the sitemaps.org protocol.
Image
Uses the http://www.google.com/schemas/sitemap-image/1.1 namespace.
Configure image fields on your mixin:
class ProductImage(SitemapMixin, models.Model):
sitemap_section_name = "product_images"
sitemap_type = "image"
sitemap_image_field = "image_url"
sitemap_image_caption_field = "caption"
sitemap_image_title_field = "title"
Video
Uses the http://www.google.com/schemas/sitemap-video/1.1 namespace:
class Video(SitemapMixin, models.Model):
sitemap_section_name = "videos"
sitemap_type = "video"
sitemap_video_url_field = "video_url"
sitemap_video_thumbnail_field = "thumbnail_url"
sitemap_video_title_field = "title"
sitemap_video_description_field = "description"
sitemap_video_duration_field = "duration_seconds"
News
Uses the http://www.google.com/schemas/sitemap-news/0.9 namespace. Entries
older than ICV_SITEMAPS_NEWS_MAX_AGE_DAYS (default 2) are automatically
excluded:
class BreakingStory(SitemapMixin, models.Model):
sitemap_section_name = "breaking_news"
sitemap_type = "news"
sitemap_news_publication_name = "Example News"
sitemap_news_language = "en"
sitemap_news_title_field = "headline"
sitemap_news_date_field = "published_at"
Discovery Files
robots.txt
Database-driven rules managed via Django admin or the service layer:
from icv_sitemaps.services import add_robots_rule
# Block AI crawlers from /private/
add_robots_rule("GPTBot", "disallow", "/private/")
add_robots_rule("CCBot", "disallow", "/")
# Block all bots from /admin/
add_robots_rule("*", "disallow", "/admin/")
Extra directives from settings are appended after database rules:
ICV_SITEMAPS_ROBOTS_EXTRA_DIRECTIVES = [
"Crawl-delay: 10",
]
ads.txt / app-ads.txt
IAB-format authorised seller declarations:
from icv_sitemaps.services import add_ads_entry
add_ads_entry("google.com", "pub-1234567890", "DIRECT", certification_id="f08c47fec0942fa0")
add_ads_entry("adnetwork.com", "pub-9876543210", "RESELLER")
# For app-ads.txt
add_ads_entry("google.com", "pub-1234567890", "DIRECT", is_app_ads=True)
llms.txt, security.txt, humans.txt
Free-form text content managed via DiscoveryFileConfig:
from icv_sitemaps.services import set_discovery_file_content
set_discovery_file_content("llms_txt", """# llms.txt
# AI training and crawl guidance for example.com
Allow: /blog/
Disallow: /private/
""")
set_discovery_file_content("security_txt", """Contact: mailto:security@example.com
Expires: 2027-01-01T00:00:00.000Z
Preferred-Languages: en
""")
set_discovery_file_content("humans_txt", """/* TEAM */
Lead: Nigel Copley
Site: example.com
""")
Configuration
Settings Reference
All settings are namespaced under ICV_SITEMAPS_*. Every setting has a
sensible default so the package works out of the box for local development.
| Setting | Type | Default | Description |
|---|---|---|---|
ICV_SITEMAPS_BASE_URL |
str |
"" |
Base URL for absolute sitemap URLs (e.g. "https://example.com"). Required — raises ImproperlyConfigured at generation time if empty |
ICV_SITEMAPS_STORAGE_BACKEND |
str |
"django.core.files.storage.default_storage" |
Dotted path to Django storage backend for generated files |
ICV_SITEMAPS_STORAGE_PATH |
str |
"sitemaps/" |
Base path within the storage backend |
ICV_SITEMAPS_MAX_URLS_PER_FILE |
int |
50000 |
Maximum URLs per file (protocol limit: 50,000) |
ICV_SITEMAPS_MAX_FILE_SIZE_BYTES |
int |
52428800 |
Maximum file size in bytes (protocol limit: 50 MB) |
ICV_SITEMAPS_BATCH_SIZE |
int |
5000 |
Queryset iteration batch size |
ICV_SITEMAPS_GZIP |
bool |
True |
Compress files with gzip |
ICV_SITEMAPS_PING_ENGINES |
list |
["google", "bing"] |
Engines to ping after regeneration |
ICV_SITEMAPS_PING_ENABLED |
bool |
True |
Enable/disable pinging |
ICV_SITEMAPS_AUTO_SECTIONS |
dict |
{} |
Auto-register model sections (see Quick Start) |
ICV_SITEMAPS_ROBOTS_EXTRA_DIRECTIVES |
list |
[] |
Extra lines appended to robots.txt |
ICV_SITEMAPS_ROBOTS_SITEMAP_URL |
str |
"" |
Override sitemap URL in robots.txt (auto-detected if empty) |
ICV_SITEMAPS_CACHE_TIMEOUT |
int |
3600 |
Cache TTL for discovery files (seconds) |
ICV_SITEMAPS_TENANT_PREFIX_FUNC |
str |
"" |
Dotted path to tenant prefix callable |
ICV_SITEMAPS_ASYNC_GENERATION |
bool |
True |
Use Celery for background generation |
ICV_SITEMAPS_STREAMING_THRESHOLD |
int |
100000 |
URL count above which streaming generation is used |
ICV_SITEMAPS_NEWS_MAX_AGE_DAYS |
int |
2 |
Maximum age for news entries (Google requires < 2 days) |
Auto-Sections Configuration
Each key in ICV_SITEMAPS_AUTO_SECTIONS is the section name. The value is a
configuration dict:
| Key | Type | Default | Description |
|---|---|---|---|
model |
str |
required | "app_label.ModelName" |
sitemap_type |
str |
"standard" |
standard, image, video, or news |
changefreq |
str |
"daily" |
Default change frequency |
priority |
float |
0.5 |
Default priority (0.0--1.0) |
on_save |
bool |
True |
Mark section stale on model save |
on_delete |
bool |
True |
Mark section stale on model delete |
Service Functions
All functions are importable from icv_sitemaps.services:
from icv_sitemaps.services import (
# Sitemap generation
generate_section,
generate_all_sections,
generate_index,
mark_section_stale,
get_generation_stats,
# Section management
create_section,
delete_section,
# Search engine ping
ping_search_engines,
# robots.txt
render_robots_txt,
add_robots_rule,
get_robots_rules,
# ads.txt
render_ads_txt,
add_ads_entry,
# Discovery files
get_discovery_file_content,
set_discovery_file_content,
)
Management Commands
| Command | Purpose |
|---|---|
icv_sitemaps_setup [--dry-run] |
Create SitemapSection records from ICV_SITEMAPS_AUTO_SECTIONS and verify storage |
icv_sitemaps_generate [--section NAME] [--all] [--index-only] [--force] [--tenant ID] |
Generate sitemaps; defaults to stale sections only |
icv_sitemaps_ping [--url URL] [--tenant ID] |
Ping search engines |
icv_sitemaps_validate [--section NAME] |
Validate generated sitemaps against protocol |
icv_sitemaps_stats [--tenant ID] |
Show generation statistics |
Signals
All signals are defined in icv_sitemaps.signals:
| Signal | When |
|---|---|
sitemap_section_generated |
After a section is successfully generated |
sitemap_generation_complete |
After all sections are generated |
sitemap_section_deleted |
After a section and its files are deleted |
sitemap_pinged |
After search engines are pinged |
sitemap_section_stale |
After a section is marked stale |
Celery Tasks
| Task | Purpose | Schedule |
|---|---|---|
regenerate_stale_sitemaps |
Regenerate stale sections | Every 15 minutes |
regenerate_all_sitemaps |
Full regeneration | Daily at 03:00 |
ping_engines_task |
Ping search engines | After generation |
cleanup_generation_logs |
Delete old logs (30-day default) | Daily at 04:00 |
cleanup_orphan_files |
Remove unreferenced storage files | Weekly |
Multi-Tenancy
Enable tenant-scoped discovery files by setting ICV_SITEMAPS_TENANT_PREFIX_FUNC
to a dotted path to a callable that returns the tenant identifier:
# myapp/tenancy.py
def get_tenant_id(request):
return getattr(request, "tenant_id", "")
# settings.py
ICV_SITEMAPS_TENANT_PREFIX_FUNC = "myapp.tenancy.get_tenant_id"
Each tenant gets isolated robots.txt, ads.txt, sitemaps, and all other
discovery files. Sitemap files are stored with tenant-prefixed paths
(e.g. sitemaps/acme/products-0.xml).
Production Configuration
# settings.py
ICV_SITEMAPS_BASE_URL = "https://example.com"
ICV_SITEMAPS_STORAGE_BACKEND = "storages.backends.s3boto3.S3Boto3Storage"
ICV_SITEMAPS_STORAGE_PATH = "sitemaps/"
ICV_SITEMAPS_GZIP = True
ICV_SITEMAPS_PING_ENGINES = ["google", "bing"]
ICV_SITEMAPS_BATCH_SIZE = 10000
ICV_SITEMAPS_ASYNC_GENERATION = True
Testing
The package provides testing utilities for consuming projects:
from icv_sitemaps.testing import (
SitemapSectionFactory,
SitemapFileFactory,
SitemapGenerationLogFactory,
RobotsRuleFactory,
AdsEntryFactory,
DiscoveryFileConfigFactory,
)
To run the package's own tests:
cd packages/icv-sitemaps
pytest tests/ -v
Models
| Model | Purpose |
|---|---|
SitemapSection |
Logical sitemap section (e.g. "products", "articles") with staleness tracking |
SitemapFile |
Individual generated XML file with URL count and checksum |
SitemapGenerationLog |
Audit trail for generation runs |
RobotsRule |
Database-driven robots.txt directives |
AdsEntry |
ads.txt / app-ads.txt authorised seller entries |
DiscoveryFileConfig |
Content store for llms.txt, security.txt, humans.txt |
URL Endpoints
| URL | Content-Type | Description |
|---|---|---|
/sitemap.xml |
application/xml |
Sitemap index |
/sitemaps/<filename> |
application/xml |
Individual sitemap files |
/robots.txt |
text/plain |
Robots exclusion protocol |
/llms.txt |
text/plain |
AI crawler guidance |
/ads.txt |
text/plain |
Authorised digital sellers |
/app-ads.txt |
text/plain |
Authorised app sellers |
/.well-known/security.txt |
text/plain |
Security contact (RFC 9116) |
/security.txt |
301 redirect | Redirects to /.well-known/security.txt |
/humans.txt |
text/plain |
Team credits |
Requirements
- Python 3.11+
- Django 5.1+
- httpx 0.27+ (for search engine pings)
- Celery 5.3+ (optional, for background generation)
Licence
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file django_icv_sitemaps-0.4.0.tar.gz.
File metadata
- Download URL: django_icv_sitemaps-0.4.0.tar.gz
- Upload date:
- Size: 78.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4a21e4349f112e67fc271ae736ca695adb50061e7e436733a2b0d09cb86aa1bc
|
|
| MD5 |
3e3a67c7f9ffa43b8d481f551e090ddf
|
|
| BLAKE2b-256 |
10362b87af918b33c2b2cf9327e851b0d84d0a922a6eaadb2ded079829df776a
|
Provenance
The following attestation bundles were made for django_icv_sitemaps-0.4.0.tar.gz:
Publisher:
publish-sitemaps.yml on nigelcopley/icv-oss
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
django_icv_sitemaps-0.4.0.tar.gz -
Subject digest:
4a21e4349f112e67fc271ae736ca695adb50061e7e436733a2b0d09cb86aa1bc - Sigstore transparency entry: 1293634706
- Sigstore integration time:
-
Permalink:
nigelcopley/icv-oss@043e2cb3babaef3d5aefaa7ed10094bf0e79750d -
Branch / Tag:
refs/tags/icv-sitemaps/v0.4.0 - Owner: https://github.com/nigelcopley
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-sitemaps.yml@043e2cb3babaef3d5aefaa7ed10094bf0e79750d -
Trigger Event:
push
-
Statement type:
File details
Details for the file django_icv_sitemaps-0.4.0-py3-none-any.whl.
File metadata
- Download URL: django_icv_sitemaps-0.4.0-py3-none-any.whl
- Upload date:
- Size: 76.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
20bb880a003308a7f8a3ab982896291da169cf99e96a9141604e165ce33110cd
|
|
| MD5 |
0242bee5fe0581f991b6fd0c689b4f28
|
|
| BLAKE2b-256 |
fd9c35bbb49093eb3b52d483126d1f25db941803481f2fb66559bba218cfc841
|
Provenance
The following attestation bundles were made for django_icv_sitemaps-0.4.0-py3-none-any.whl:
Publisher:
publish-sitemaps.yml on nigelcopley/icv-oss
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
django_icv_sitemaps-0.4.0-py3-none-any.whl -
Subject digest:
20bb880a003308a7f8a3ab982896291da169cf99e96a9141604e165ce33110cd - Sigstore transparency entry: 1293634714
- Sigstore integration time:
-
Permalink:
nigelcopley/icv-oss@043e2cb3babaef3d5aefaa7ed10094bf0e79750d -
Branch / Tag:
refs/tags/icv-sitemaps/v0.4.0 - Owner: https://github.com/nigelcopley
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-sitemaps.yml@043e2cb3babaef3d5aefaa7ed10094bf0e79750d -
Trigger Event:
push
-
Statement type: