Policy-aware web retrieval for AI: robots.txt, llms.txt, sitemap.xml, fetching, extraction, and provenance in one layer.
Project description
WebCanon
Policy-aware web retrieval for AI.
日本語版 README はこちら — Documentation site: https://bon2016.github.io/webcanon/
WebCanon is an open-source retrieval layer that turns a URL into trustworthy, policy-checked, citation-ready context for LLMs.
It evaluates robots.txt (RFC 9309), resolves LLM-friendly alternatives via
llms.txt (optionally with your own AI), fetches content behind an SSRF guard,
converts HTML into structured Markdown, and returns full provenance for
every retrieved document.
Scope: WebCanon focuses on correct, policy-aware scraping of a given URL. Web search engines are out of scope (finding candidate URLs is a separate concern). Scraping and AI reasoning are injectable.
日本語: WebCanon は、与えられた URL を AI に渡せる高品質なコンテキストへ変換する OSS です。
robots.txt・llms.txt・sitemap.xmlを確認し、(任意で独自 AI による) LLM 向け URL への解決、本文取得、HTML→Markdown 変換、出典証跡の生成までを一貫して 行います。WEB 検索エンジンはスコープ外です。スクレイピング処理と AI 処理は差し替え可能です。
Why
Most AI pipelines mix concerns: they pass raw search snippets to the model,
clone URLs blindly, never check robots.txt, ignore sitemap.xml, and lose
all provenance. WebCanon separates these into a single quality contract:
| Concept | Role |
|---|---|
| Search | Find candidate URLs |
| Fetch | Retrieve URL content |
| Respect | Evaluate robots.txt policy before fetching |
| Resolve | Re-route to LLM-friendly URLs via llms.txt / canonical |
| Extract | Convert HTML/PDF into LLM-ready Markdown |
| Ground | Keep source, retrieval path, and transform evidence |
The retrieval constitution
- Search results are leads, not sources.
robots.txtis evaluated before fetch.llms.txtcan guide retrieval, not override policy.- Every transformed document must retain provenance.
- Web content is untrusted input.
- Markdown is an interface, not the source of truth.
- Extraction quality must be measurable.
Install
pip install webcanon
From source:
pip install -e ".[dev]"
Quick start
from webcanon import WebCanon
client = WebCanon()
result = client.retrieve_url("https://example.com/docs/api", ai_reasoning=True)
print(result.document.markdown) # extracted Markdown
print(result.policy.robots.verdict) # e.g. "allowed_implicit"
print(result.provenance.source_hash) # sha256 of the source body
result is a RetrievalResult — the Retrieval Bill of Materials. Call
result.to_dict() for a JSON-serialisable audit record (why this URL was
chosen, whether robots allowed it, whether llms.txt rerouted it, extraction
quality, and reproducibility hashes).
The default User-Agent product token is WebCanon.
Customization hooks
The scraping transport, the HTML→Markdown converter, and the AI that reasons
over llms.txt are all injectable callables — pass them on
RetrievalConfig:
from webcanon import WebCanon, AiHint
from webcanon.config import RetrievalConfig
def my_ai(ctx):
# ctx has the requested URL, the parsed llms.txt, and the robots verdict.
# Decide a URL read-through and/or special request headers.
return AiHint(url=ctx.requested_url + ".md", headers={"Accept": "text/markdown"},
reason="prefer markdown variant")
client = WebCanon(RetrievalConfig(
ai_resolver=my_ai, # AI reasoning over llms.txt + URL
# fetcher=my_fetcher, # custom scraping transport
# extractor=my_extractor, # custom HTML -> Markdown
))
result = client.retrieve_url("https://example.com/docs/api", ai_reasoning=True)
robots.txt always wins: an AiHint that points at a disallowed URL is ignored.
See docs/customization.md.
CLI
webcanon fetch https://example.com/docs/api --ai --llms prefer --robots respect
webcanon fetch https://example.com/docs/api --json --report report.json
webcanon inspect https://example.com/docs/api
Status
This is v0.1 — the URL retrieval quality baseline:
- URL normalization & origin extraction
robots.txtfetch + RFC 9309 evaluation enginellms.txtparsing + LLM-friendly URL resolutionsitemap.xmlparsing (URL discovery)- SSRF-guarded HTTP fetch with per-redirect re-checks
- HTML → Markdown extraction (stdlib) with hidden-text warnings
- Provenance-bearing JSON output
- CLI (
fetch,inspect)
See docs/ for the architecture, policy model, robots compliance,
llms.txt resolution, extraction quality, security model, and the roadmap.
License
Apache-2.0. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file webcanon-0.2.0.tar.gz.
File metadata
- Download URL: webcanon-0.2.0.tar.gz
- Upload date:
- Size: 57.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c4b416da0e071702f96c63d89f20849d13c5d838a8fea4231314667083acaf82
|
|
| MD5 |
937c7ae25cfcb7fb0adc7e65f6bf4d27
|
|
| BLAKE2b-256 |
e415df156d623ff3a7a8007e433097c2165a6c37f3ca261fd00f65f2538e9e45
|
Provenance
The following attestation bundles were made for webcanon-0.2.0.tar.gz:
Publisher:
publish.yml on bon2016/webcanon
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
webcanon-0.2.0.tar.gz -
Subject digest:
c4b416da0e071702f96c63d89f20849d13c5d838a8fea4231314667083acaf82 - Sigstore transparency entry: 1831303916
- Sigstore integration time:
-
Permalink:
bon2016/webcanon@2b3886755be7e1eb2b41ff3a4e8c03e8bb413d78 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/bon2016
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2b3886755be7e1eb2b41ff3a4e8c03e8bb413d78 -
Trigger Event:
push
-
Statement type:
File details
Details for the file webcanon-0.2.0-py3-none-any.whl.
File metadata
- Download URL: webcanon-0.2.0-py3-none-any.whl
- Upload date:
- Size: 32.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
63a7c149b0e0a5eaeda6b789f552f150519fa27c534dc000e397a9925053c350
|
|
| MD5 |
793a2cb6ac748667360b95b53ddb0a00
|
|
| BLAKE2b-256 |
6a1a89abb604ba771bb2aadfddb0ba3cbba3a06a8ef2f36852d39eac4a5e630b
|
Provenance
The following attestation bundles were made for webcanon-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on bon2016/webcanon
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
webcanon-0.2.0-py3-none-any.whl -
Subject digest:
63a7c149b0e0a5eaeda6b789f552f150519fa27c534dc000e397a9925053c350 - Sigstore transparency entry: 1831304019
- Sigstore integration time:
-
Permalink:
bon2016/webcanon@2b3886755be7e1eb2b41ff3a4e8c03e8bb413d78 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/bon2016
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2b3886755be7e1eb2b41ff3a4e8c03e8bb413d78 -
Trigger Event:
push
-
Statement type: