Skip to main content

Policy-aware web retrieval for AI: robots.txt, llms.txt, sitemap.xml, fetching, extraction, and provenance in one layer.

Project description

WebCanon

Policy-aware web retrieval for AI.

日本語版 README はこちら — Documentation site: https://bon2016.github.io/webcanon/

WebCanon is an open-source retrieval layer that turns a URL into trustworthy, policy-checked, citation-ready context for LLMs.

It evaluates robots.txt (RFC 9309), resolves LLM-friendly alternatives via llms.txt (optionally with your own AI), fetches content behind an SSRF guard, converts HTML into structured Markdown, and returns full provenance for every retrieved document.

Scope: WebCanon focuses on correct, policy-aware scraping of a given URL. Web search engines are out of scope (finding candidate URLs is a separate concern). Scraping and AI reasoning are injectable.

日本語: WebCanon は、与えられた URL を AI に渡せる高品質なコンテキストへ変換する OSS です。robots.txtllms.txtsitemap.xml を確認し、(任意で独自 AI による) LLM 向け URL への解決、本文取得、HTML→Markdown 変換、出典証跡の生成までを一貫して 行います。WEB 検索エンジンはスコープ外です。スクレイピング処理と AI 処理は差し替え可能です。

Why

Most AI pipelines mix concerns: they pass raw search snippets to the model, clone URLs blindly, never check robots.txt, ignore sitemap.xml, and lose all provenance. WebCanon separates these into a single quality contract:

Concept Role
Search Find candidate URLs
Fetch Retrieve URL content
Respect Evaluate robots.txt policy before fetching
Resolve Re-route to LLM-friendly URLs via llms.txt / canonical
Extract Convert HTML/PDF into LLM-ready Markdown
Ground Keep source, retrieval path, and transform evidence

The retrieval constitution

  1. Search results are leads, not sources.
  2. robots.txt is evaluated before fetch.
  3. llms.txt can guide retrieval, not override policy.
  4. Every transformed document must retain provenance.
  5. Web content is untrusted input.
  6. Markdown is an interface, not the source of truth.
  7. Extraction quality must be measurable.

Install

pip install webcanon

From source:

pip install -e ".[dev]"

Quick start

from webcanon import WebCanon

client = WebCanon()
result = client.retrieve_url("https://example.com/docs/api", ai_reasoning=True)

print(result.document.markdown)        # extracted Markdown
print(result.policy.robots.verdict)    # e.g. "allowed_implicit"
print(result.provenance.source_hash)   # sha256 of the source body

result is a RetrievalResult — the Retrieval Bill of Materials. Call result.to_dict() for a JSON-serialisable audit record (why this URL was chosen, whether robots allowed it, whether llms.txt rerouted it, extraction quality, and reproducibility hashes).

The default User-Agent product token is WebCanon.

Customization hooks

The scraping transport, the HTML→Markdown converter, and the AI that reasons over llms.txt are all injectable callables — pass them on RetrievalConfig:

from webcanon import WebCanon, AiHint
from webcanon.config import RetrievalConfig

def my_ai(ctx):
    # ctx has the requested URL, the parsed llms.txt, and the robots verdict.
    # Decide a URL read-through and/or special request headers.
    return AiHint(url=ctx.requested_url + ".md", headers={"Accept": "text/markdown"},
                  reason="prefer markdown variant")

client = WebCanon(RetrievalConfig(
    ai_resolver=my_ai,        # AI reasoning over llms.txt + URL
    # fetcher=my_fetcher,     # custom scraping transport
    # extractor=my_extractor, # custom HTML -> Markdown
))
result = client.retrieve_url("https://example.com/docs/api", ai_reasoning=True)

robots.txt always wins: an AiHint that points at a disallowed URL is ignored. See docs/customization.md.

CLI

webcanon fetch https://example.com/docs/api --ai --llms prefer --robots respect
webcanon fetch https://example.com/docs/api --json --report report.json
webcanon inspect https://example.com/docs/api

Status

This is v0.1 — the URL retrieval quality baseline:

  • URL normalization & origin extraction
  • robots.txt fetch + RFC 9309 evaluation engine
  • llms.txt parsing + LLM-friendly URL resolution
  • sitemap.xml parsing (URL discovery)
  • SSRF-guarded HTTP fetch with per-redirect re-checks
  • HTML → Markdown extraction (stdlib) with hidden-text warnings
  • Provenance-bearing JSON output
  • CLI (fetch, inspect)

See docs/ for the architecture, policy model, robots compliance, llms.txt resolution, extraction quality, security model, and the roadmap.

License

Apache-2.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webcanon-0.2.0.tar.gz (57.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

webcanon-0.2.0-py3-none-any.whl (32.9 kB view details)

Uploaded Python 3

File details

Details for the file webcanon-0.2.0.tar.gz.

File metadata

  • Download URL: webcanon-0.2.0.tar.gz
  • Upload date:
  • Size: 57.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for webcanon-0.2.0.tar.gz
Algorithm Hash digest
SHA256 c4b416da0e071702f96c63d89f20849d13c5d838a8fea4231314667083acaf82
MD5 937c7ae25cfcb7fb0adc7e65f6bf4d27
BLAKE2b-256 e415df156d623ff3a7a8007e433097c2165a6c37f3ca261fd00f65f2538e9e45

See more details on using hashes here.

Provenance

The following attestation bundles were made for webcanon-0.2.0.tar.gz:

Publisher: publish.yml on bon2016/webcanon

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file webcanon-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: webcanon-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 32.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for webcanon-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 63a7c149b0e0a5eaeda6b789f552f150519fa27c534dc000e397a9925053c350
MD5 793a2cb6ac748667360b95b53ddb0a00
BLAKE2b-256 6a1a89abb604ba771bb2aadfddb0ba3cbba3a06a8ef2f36852d39eac4a5e630b

See more details on using hashes here.

Provenance

The following attestation bundles were made for webcanon-0.2.0-py3-none-any.whl:

Publisher: publish.yml on bon2016/webcanon

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page