Policy-aware web retrieval for AI: robots.txt, llms.txt, sitemap.xml, fetching, extraction, and provenance in one layer.

These details have not been verified by PyPI

Project links

Project description

WebCanon

Policy-aware web retrieval for AI.

日本語版 README はこちら — Documentation site: https://bon2016.github.io/webcanon/

WebCanon is an open-source retrieval layer that turns a URL into trustworthy, policy-checked, citation-ready context for LLMs.

It evaluates robots.txt (RFC 9309), resolves LLM-friendly alternatives via llms.txt (optionally with your own AI), fetches content behind an SSRF guard, converts HTML into structured Markdown, and returns full provenance for every retrieved document.

Scope: WebCanon focuses on correct, policy-aware scraping of a given URL. Web search engines are out of scope (finding candidate URLs is a separate concern). Scraping and AI reasoning are injectable.

日本語: WebCanon は、与えられた URL を AI に渡せる高品質なコンテキストへ変換する OSS です。robots.txt・llms.txt・sitemap.xml を確認し、（任意で独自 AI による） LLM 向け URL への解決、本文取得、HTML→Markdown 変換、出典証跡の生成までを一貫して行います。WEB 検索エンジンはスコープ外です。スクレイピング処理と AI 処理は差し替え可能です。

Why

Most AI pipelines mix concerns: they pass raw search snippets to the model, clone URLs blindly, never check robots.txt, ignore sitemap.xml, and lose all provenance. WebCanon separates these into a single quality contract:

Concept	Role
Search	Find candidate URLs
Fetch	Retrieve URL content
Respect	Evaluate `robots.txt` policy before fetching
Resolve	Re-route to LLM-friendly URLs via `llms.txt` / canonical
Extract	Convert HTML/PDF into LLM-ready Markdown
Ground	Keep source, retrieval path, and transform evidence

The retrieval constitution

Search results are leads, not sources.
robots.txt is evaluated before fetch.
llms.txt can guide retrieval, not override policy.
Every transformed document must retain provenance.
Web content is untrusted input.
Markdown is an interface, not the source of truth.
Extraction quality must be measurable.

Install

pip install webcanon

For JavaScript-rendered pages (headless browser, optional):

pip install "webcanon[headless]"
python -m playwright install chromium

From source:

pip install -e ".[dev]"

Quick start

from webcanon import WebCanon

client = WebCanon()
result = client.retrieve_url("https://example.com/docs/api", ai_reasoning=True)

print(result.document.markdown)        # extracted Markdown
print(result.policy.robots.verdict)    # e.g. "allowed_implicit"
print(result.provenance.source_hash)   # sha256 of the source body

result is a RetrievalResult — the Retrieval Bill of Materials. Call result.to_dict() for a JSON-serialisable audit record (why this URL was chosen, whether robots allowed it, whether llms.txt rerouted it, extraction quality, and reproducibility hashes).

The default User-Agent product token is WebCanon.

Customization hooks

The scraping transport, the HTML→Markdown converter, and the AI that reasons over llms.txt are all injectable callables — pass them on RetrievalConfig:

from webcanon import WebCanon, AiHint
from webcanon.config import RetrievalConfig

def my_ai(ctx):
    # ctx has the requested URL, the parsed llms.txt, and the robots verdict.
    # Decide a URL read-through and/or special request headers.
    return AiHint(url=ctx.requested_url + ".md", headers={"Accept": "text/markdown"},
                  reason="prefer markdown variant")

client = WebCanon(RetrievalConfig(
    ai_resolver=my_ai,        # AI reasoning over llms.txt + URL
    # fetcher=my_fetcher,     # custom scraping transport
    # extractor=my_extractor, # custom HTML -> Markdown
))
result = client.retrieve_url("https://example.com/docs/api", ai_reasoning=True)

robots.txt always wins: an AiHint that points at a disallowed URL is ignored. See docs/customization.md.

CLI

webcanon fetch https://example.com/docs/api --ai --llms prefer --robots respect
webcanon fetch https://example.com/docs/api --json --report report.json
webcanon inspect https://example.com/docs/api

Status

This is v0.1 — the URL retrieval quality baseline:

URL normalization & origin extraction
robots.txt fetch + RFC 9309 evaluation engine
llms.txt parsing + LLM-friendly URL resolution
sitemap.xml parsing (URL discovery)
SSRF-guarded HTTP fetch with per-redirect re-checks
HTML → Markdown extraction (stdlib) with hidden-text warnings
Provenance-bearing JSON output
CLI (fetch, inspect)

See docs/ for the architecture, policy model, robots compliance, llms.txt resolution, extraction quality, security model, and the roadmap.

License

Apache-2.0. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.0

Jun 16, 2026

0.2.0

Jun 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webcanon-0.3.0.tar.gz (61.5 kB view details)

Uploaded Jun 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

webcanon-0.3.0-py3-none-any.whl (35.4 kB view details)

Uploaded Jun 16, 2026 Python 3

File details

Details for the file webcanon-0.3.0.tar.gz.

File metadata

Download URL: webcanon-0.3.0.tar.gz
Upload date: Jun 16, 2026
Size: 61.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for webcanon-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`a85fc5f006434268e0af047cbb7fba8a255c29b73656b6b6f371ad0d6469044d`
MD5	`526a4033d89b6c71bae5e3defe6860c2`
BLAKE2b-256	`52e8463d709a35904357cb0dd88b03cad1249c848be362421199c2eac0b78a5b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for webcanon-0.3.0.tar.gz:

Publisher: publish.yml on bon2016/webcanon

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: webcanon-0.3.0.tar.gz
- Subject digest: a85fc5f006434268e0af047cbb7fba8a255c29b73656b6b6f371ad0d6469044d
- Sigstore transparency entry: 1833499806
- Sigstore integration time: Jun 16, 2026
Source repository:
- Permalink: bon2016/webcanon@ca52f3831f2b8fecef1436a4d76f7a2af356fbea
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/bon2016
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@ca52f3831f2b8fecef1436a4d76f7a2af356fbea
- Trigger Event: push

File details

Details for the file webcanon-0.3.0-py3-none-any.whl.

File metadata

Download URL: webcanon-0.3.0-py3-none-any.whl
Upload date: Jun 16, 2026
Size: 35.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for webcanon-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`baea7d03cd4cd1c09b07d606aac50f59b4725df98ee78f3b4e3bcb3f49280a1d`
MD5	`91eca0479041bd5a3189c0fcef3b40d2`
BLAKE2b-256	`3cad7dbe15a415df6950679f996f67ac62a44029fc5b0f7e2934b0eeebb807e8`

See more details on using hashes here.

Provenance

The following attestation bundles were made for webcanon-0.3.0-py3-none-any.whl:

Publisher: publish.yml on bon2016/webcanon

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: webcanon-0.3.0-py3-none-any.whl
- Subject digest: baea7d03cd4cd1c09b07d606aac50f59b4725df98ee78f3b4e3bcb3f49280a1d
- Sigstore transparency entry: 1833500260
- Sigstore integration time: Jun 16, 2026
Source repository:
- Permalink: bon2016/webcanon@ca52f3831f2b8fecef1436a4d76f7a2af356fbea
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/bon2016
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@ca52f3831f2b8fecef1436a4d76f7a2af356fbea
- Trigger Event: push

webcanon 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

WebCanon

Why

The retrieval constitution

Install

Quick start

Customization hooks

CLI

Status

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance