Skip to main content

Policy-aware web retrieval for AI: robots.txt, llms.txt, sitemap.xml, fetching, extraction, and provenance in one layer.

Project description

WebCanon

Policy-aware web retrieval for AI.

日本語版 README はこちら — Documentation site: https://bon2016.github.io/webcanon/

WebCanon is an open-source retrieval layer that turns a URL into trustworthy, policy-checked, citation-ready context for LLMs.

It evaluates robots.txt (RFC 9309), resolves LLM-friendly alternatives via llms.txt (optionally with your own AI), fetches content behind an SSRF guard, converts HTML into structured Markdown, and returns full provenance for every retrieved document.

Scope: WebCanon focuses on correct, policy-aware scraping of a given URL. Web search engines are out of scope (finding candidate URLs is a separate concern). Scraping and AI reasoning are injectable.

日本語: WebCanon は、与えられた URL を AI に渡せる高品質なコンテキストへ変換する OSS です。robots.txtllms.txtsitemap.xml を確認し、(任意で独自 AI による) LLM 向け URL への解決、本文取得、HTML→Markdown 変換、出典証跡の生成までを一貫して 行います。WEB 検索エンジンはスコープ外です。スクレイピング処理と AI 処理は差し替え可能です。

Why

Most AI pipelines mix concerns: they pass raw search snippets to the model, clone URLs blindly, never check robots.txt, ignore sitemap.xml, and lose all provenance. WebCanon separates these into a single quality contract:

Concept Role
Search Find candidate URLs
Fetch Retrieve URL content
Respect Evaluate robots.txt policy before fetching
Resolve Re-route to LLM-friendly URLs via llms.txt / canonical
Extract Convert HTML/PDF into LLM-ready Markdown
Ground Keep source, retrieval path, and transform evidence

The retrieval constitution

  1. Search results are leads, not sources.
  2. robots.txt is evaluated before fetch.
  3. llms.txt can guide retrieval, not override policy.
  4. Every transformed document must retain provenance.
  5. Web content is untrusted input.
  6. Markdown is an interface, not the source of truth.
  7. Extraction quality must be measurable.

Install

pip install webcanon

For JavaScript-rendered pages (headless browser, optional):

pip install "webcanon[headless]"
python -m playwright install chromium

From source:

pip install -e ".[dev]"

Quick start

from webcanon import WebCanon

client = WebCanon()
result = client.retrieve_url("https://example.com/docs/api", ai_reasoning=True)

print(result.document.markdown)        # extracted Markdown
print(result.policy.robots.verdict)    # e.g. "allowed_implicit"
print(result.provenance.source_hash)   # sha256 of the source body

result is a RetrievalResult — the Retrieval Bill of Materials. Call result.to_dict() for a JSON-serialisable audit record (why this URL was chosen, whether robots allowed it, whether llms.txt rerouted it, extraction quality, and reproducibility hashes).

The default User-Agent product token is WebCanon.

Customization hooks

The scraping transport, the HTML→Markdown converter, and the AI that reasons over llms.txt are all injectable callables — pass them on RetrievalConfig:

from webcanon import WebCanon, AiHint
from webcanon.config import RetrievalConfig

def my_ai(ctx):
    # ctx has the requested URL, the parsed llms.txt, and the robots verdict.
    # Decide a URL read-through and/or special request headers.
    return AiHint(url=ctx.requested_url + ".md", headers={"Accept": "text/markdown"},
                  reason="prefer markdown variant")

client = WebCanon(RetrievalConfig(
    ai_resolver=my_ai,        # AI reasoning over llms.txt + URL
    # fetcher=my_fetcher,     # custom scraping transport
    # extractor=my_extractor, # custom HTML -> Markdown
))
result = client.retrieve_url("https://example.com/docs/api", ai_reasoning=True)

robots.txt always wins: an AiHint that points at a disallowed URL is ignored. See docs/customization.md.

CLI

webcanon fetch https://example.com/docs/api --ai --llms prefer --robots respect
webcanon fetch https://example.com/docs/api --json --report report.json
webcanon inspect https://example.com/docs/api

Status

This is v0.1 — the URL retrieval quality baseline:

  • URL normalization & origin extraction
  • robots.txt fetch + RFC 9309 evaluation engine
  • llms.txt parsing + LLM-friendly URL resolution
  • sitemap.xml parsing (URL discovery)
  • SSRF-guarded HTTP fetch with per-redirect re-checks
  • HTML → Markdown extraction (stdlib) with hidden-text warnings
  • Provenance-bearing JSON output
  • CLI (fetch, inspect)

See docs/ for the architecture, policy model, robots compliance, llms.txt resolution, extraction quality, security model, and the roadmap.

License

Apache-2.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webcanon-0.3.0.tar.gz (61.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

webcanon-0.3.0-py3-none-any.whl (35.4 kB view details)

Uploaded Python 3

File details

Details for the file webcanon-0.3.0.tar.gz.

File metadata

  • Download URL: webcanon-0.3.0.tar.gz
  • Upload date:
  • Size: 61.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for webcanon-0.3.0.tar.gz
Algorithm Hash digest
SHA256 a85fc5f006434268e0af047cbb7fba8a255c29b73656b6b6f371ad0d6469044d
MD5 526a4033d89b6c71bae5e3defe6860c2
BLAKE2b-256 52e8463d709a35904357cb0dd88b03cad1249c848be362421199c2eac0b78a5b

See more details on using hashes here.

Provenance

The following attestation bundles were made for webcanon-0.3.0.tar.gz:

Publisher: publish.yml on bon2016/webcanon

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file webcanon-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: webcanon-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 35.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for webcanon-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 baea7d03cd4cd1c09b07d606aac50f59b4725df98ee78f3b4e3bcb3f49280a1d
MD5 91eca0479041bd5a3189c0fcef3b40d2
BLAKE2b-256 3cad7dbe15a415df6950679f996f67ac62a44029fc5b0f7e2934b0eeebb807e8

See more details on using hashes here.

Provenance

The following attestation bundles were made for webcanon-0.3.0-py3-none-any.whl:

Publisher: publish.yml on bon2016/webcanon

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page