Skip to main content

Sift — deterministic website indexing for grep-first LLM agents

Project description

sift

Deterministic, content-hashed website indexing for grep-first AI agents — served over MCP.

Tests License: Apache 2.0 Python MCP

sift turns any website you can reach by URL into a complete, always-current, verifiable corpus that an AI agent reads over MCP — files on disk, not vectors. Every page is content-hashed and dated, so any answer can be proved back to the exact source, hash, and snapshot. Self-hosted: your data and your proof stay yours.

  • Provable — same input → same content_hash → same Merkle root; a hash-chained changelog; optional GPG-signed snapshots; per-read verify=true.
  • Any site, self-hosted — point it at any http(s) site (static HTML, or JS-rendered SPAs via the optional browser path). A pluggable SiteProfile handles per-site logic with no core changes.
  • Complete & grep-native — the full crawled corpus as markdown + structured facts that agents read / grep / glob / query — not a few browsed pages, not opaque vector similarity.
  • Incremental & low-ops — conditional GETs re-extract only what changed; bump a transformer version and re-derive from cached raw with no refetch.

Open core. This repository is the open-source engine (pipeline + MCP server), Apache-2.0, and runs fully on its own. A hosted platform built on it is in development.

Quickstart · Architecture · CLI · MCP server · Integrity · Develop · Contributing


Scope — what sift indexes

Today: any http(s) URL — HTML pages and PDFs. Discover URLs from a sitemap.xml, whole-domain sitemap auto-discovery, a Firecrawl map, or a plain URL list. JS-rendered SPAs go through the optional Playwright path; bot-blocked or rate-limited hosts through the optional Firecrawl fallback. Works on public sites and on internal ones your machine can reach (add the host to the allow-list).

Not yet (roadmap — and good first contributions): non-URL sources — local files and folders, git repos, API-only knowledge bases (Notion, Confluence, Slack, Google Drive), and databases. The pipeline is source-agnostic once content is in, so these land as ingestion connectors.


Quickstart

Requires Python 3.11+.

git clone https://github.com/dvlshah/sift.git && cd sift
pip install -e .

# 1. create an index root
sift init --root ./index

# 2. seed URLs — ships with an ATO reference profile that needs no config
sift seed --root ./index --from-sitemap https://www.ato.gov.au/sitemap.xml

# 3. build a small index first — cap the crawl with --limit; --coverage-base
#    planned tells the coverage gate the cap was intentional
sift run --root ./index --limit 25 --coverage-base planned

# 4. verify end-to-end integrity
sift verify --root ./index --skip-signature

# 5. serve it to an agent over MCP (read-only)
sift-mcp --root ./index

Indexing a different site? Drop a sift.toml next to your index with the generic profile + host allow-list:

[site]
profile = "sift.sites.generic:GenericProfile"

[seed]
host_allow = ["docs.example.com"]
sift seed --root ./index --config sift.toml --from-domain https://docs.example.com
sift run  --root ./index --config sift.toml --limit 25 --coverage-base planned

Indexing JS-rendered SPAs needs the optional browser stack:

pip install -e ".[browser]" && python -m playwright install chromium

What you get

After a run, the index root contains:

<root>/
├── manifest.db                  SQLite — single source of truth for URL state
├── raw/<aa>/<sha256>.html.gz    Content-addressed raw HTML/PDF blobs
├── changelog.jsonl              Append-only, hash-chained per-content-change log
├── current/                     Symlink → the most-recent passing snapshot
├── runs/<run_id>/
│   ├── INDEX.md                 Always-loaded pointer table for agents
│   ├── routes.tsv               url → md_path map (grep/awk friendly)
│   ├── sections/<top>/INDEX.md  Per-section drill-down indexes
│   ├── md/<url-path>.md         Markdown mirror of the URL tree
│   ├── facts/<schema>/*.json    Atomic structured records (rate tables, etc.)
│   ├── artifacts/by_guide/*.md  Multi-page guide rollups
│   └── snapshot.json            Gate results, version pins, Merkle root, gpg sig (opt)
└── backups/manifest-*.db        Online SQLite backups (run on cron)

Every markdown file leads with YAML frontmatter: URL, fetch timestamp, raw + content hashes, tier, audience, FY years, anchors, and four version pins (crawler, extractor, normalizer, classifier). Re-verify any file in O(1) by re-normalizing the body and comparing its SHA-256 to the stored content_hash.


Architecture

Five sequential phases, each idempotent and resumable from a checkpoint:

 seed   ──►  Add URLs to the manifest (tier + parent_guide assigned per site profile)
 plan   ──►  Per-URL decision: FETCH / FETCH_CONDITIONAL / SKIP / TOMBSTONE_PURGE
             (pure function of manifest state, sitemap lastmod, clock, versions)
 fetch  ──►  HTTP (async httpx + per-host token bucket + conditional GETs) or,
             per profile, the Playwright browser path. Raw stored by SHA-256.
 extract──►  HTML→markdown (trafilatura) / PDF→text (pypdf); deterministic
             anchor injection + hash normalization → content_hash
 commit ──►  One SQLite transaction applies all outcomes; appends chained
             entries to changelog.jsonl per content change
 publish──►  5 verification gates → atomic symlink swap to current/;
             Merkle root over all content_hashes written to snapshot.json

Each transformation is versioned independently (CRAWLER_VERSION, EXTRACTOR_VERSION, NORMALIZER_VERSION, CLASSIFIER_VERSION, INTEGRITY_VERSION) — bump one and sift re-extract re-derives from cached raw with no network. Failures are contained per-URL: one bad page never breaks a snapshot, and the coverage gate blocks publish if too many URLs are non-terminal.


CLI reference

--root is required on every command; --config PATH (default ./sift.toml / ./sift.local.toml) is accepted on the pipeline commands. CLI flags override config.

Pipeline

Command Purpose
sift init Create manifest.db; surface changelog state
sift seed Add URLs via --from-sitemap / --from-domain / --from-firecrawl-map / --from-json
sift plan / fetch / extract / commit Run a single phase (--run-id for fetch/extract/commit)
sift run plan → fetch → extract → commit → publish, with per-phase timings (--limit, --tier, --rate, --coverage-base, --firecrawl-fallback, --only-urls)
sift publish --run-id ID 5 verification gates + atomic symlink swap
sift status Counts by state + tier, version pins, recent runs

Operational

Command Purpose
sift re-extract Re-derive content_hashes from cached raw (no network); preserves the changelog. Run after an extractor/normalizer version bump
sift purge Drop manifest rows whose plan decision is TOMBSTONE_PURGE (--dry-run to preview)
sift backup [--to PATH] [--keep N] Online SQLite backup, safe under concurrent writes
sift verify-backup BACKUP PRAGMA integrity_check + schema sanity on a backup

Integrity & read access

Command Purpose
sift verify [--skip-signature] Merkle root + changelog chain + optional GPG, in one
sift verify-snapshot / verify-changelog / verify-signature The individual integrity checks
sift manifest-query "SELECT ..." Read-only SQL against manifest.db (refuses non-SELECT/WITH)

Configuration

A single TOML file (sift.toml in cwd, or --config PATH) controls everything tunable:

[site]
profile = "sift.sites.ato:ATOProfile"   # or sift.sites.generic:GenericProfile

[fy]
current_start_year = 2025                # FY cutoff for the FROZEN tier

[crawl]
rate_per_sec = 5.0                       # per-host token bucket
concurrency  = 8
timeout_sec  = 30.0
retries      = 3

[publish]
coverage_floor   = 0.99                  # fraction of seeded URLs that must reach a terminal state
hash_sample_rate = 0.01                  # 1% of md files re-hashed each publish
gpg_key_id       = ""                    # optional: detach-sign snapshot.json

[seed]
host_allow             = ["www.ato.gov.au"]
use_default_excludes   = true
extra_exclude_patterns = ["^/other-languages/"]

[browser]                                # optional; only used if a profile opts a URL in
enabled        = false                   # default off → SPAs become SKIPPED_BROWSER_DISABLED
wait_until     = "domcontentloaded"      # profiles can override (ATO uses "networkidle")

# [tiers.NEWS] / [tiers.LIVING] / [tiers.CURRENT_FORMS] / [tiers.FROZEN]
# each: floor_days, ceiling_days, tombstone_ttl_days, max_failures

MCP server

sift-mcp --root /path/to/index exposes 7 read-only tools over stdio, for grep-first agents:

Tool Purpose
snapshot_status Published yes/no, run_id, gate results, artifact inventory. Call first. Never errors.
grep_corpus Regex over the markdown tree — best for identifiers/exact phrases (capped at 200 matches)
read_md Read one markdown file (offset/limit to page; verify=true re-hashes before you cite)
read_facts Read one facts/<schema>/*.json with $schema + source_url + content_hash provenance
glob_corpus List files by fnmatch glob (capped at 500)
list_dir Cheap directory enumeration
query_manifest Read-only SQL against manifest.db for cross-cutting queries

Read-only by default; hard-fails with an actionable message if no current/ snapshot exists. Output is capped per tool — locate with grep_corpus, then drill in with read_md (offset/limit).

Multi-index mode — point --root at a parent directory of several index roots and the server auto-exposes list_indexes plus an index=<slug> parameter on every content tool (index="*" fans out the read tools).

Write tools — add --enable-index to expose index_url (seed allow-listed URLs + trigger a background crawl; returns a run_id immediately) and index_status (poll by run_id). One in-flight crawl per index, capped across indexes by --max-concurrent-crawls (default 4); each crawl is an isolated sift seed && sift run subprocess, so a failed fetch can't take down the read server. Off by default — the standard deployment is strictly read-only.

Wire into Claude Code / Cursor / Codex:

{
  "mcpServers": {
    "sift": { "command": "sift-mcp", "args": ["--root", "/abs/path/to/index"] }
  }
}

Integrity guarantees

Property Mechanism Verified by
Same input → same content_hash Deterministic extract + versioned normalize_for_hash tests/test_integrity.py, sift-evals determinism
Snapshot is bit-identical to publish time Merkle root over all (url, content_hash) in snapshot.json sift verify-snapshot
Changelog hasn't been tampered with SHA-256 chain: entry_hash = sha256(prev_hash ‖ canonical(entry)) sift verify-changelog
Per-file integrity on agent reads read_md verify=true re-hashes the body vs. frontmatter MCP returns isError on mismatch
Every FRESH row has a real md file Publish gate manifest_fs_integrity publish blocks on orphan/missing files
Every facts/*.json validates against its $schema Publish gate facts_validation (Draft 2020-12) publish blocks on invalid facts
Optional cryptographic signature [publish].gpg_key_idgpg --detach-sign sift verify-signature

Known gaps: no content-pinning against the source server (TLS is the fetch-time root of trust); the MCP per-read hash isn't chained back to the GPG signature automatically; no built-in off-machine storage (pair sift backup with rclone/rsync).


Site profiles

Every site-specific decision lives in a SiteProfile subclass under sift/sites/ — URL→tier classification, parent_guide extraction, default excludes, dynamic-content patterns stripped before hashing, section taxonomy, facts schemas, and browser routing. The core pipeline never names a site. Ships generic (every URL LIVING, no facts, HTTP only — the right starting point for any site), generic_browser, and reference profiles (ato, augov, mdn, python_docs, stripe); the default is sift.sites.ato:ATOProfile (~330 lines).

Adding a site is usually a small subclass — no core changes:

# sift/sites/irs.py
import re
from . import SiteProfile

class IRSProfile(SiteProfile):
    name = "irs"
    primary_host = "www.irs.gov"

    @property
    def default_excludes(self):
        return (r"^/coronavirus/", r"^/spanish/")

    def classify_tier(self, url, current_year_start):
        ...   # IRS uses calendar years, not FY

Then set profile = "sift.sites.irs:IRSProfile" in sift.toml, reseed, and run.


Development

pip install -e ".[dev,evals]"   # runtime + test + eval-suite deps
pytest -q                        # full suite — hermetic (HTTP mocked), no network needed
ruff check . && ruff format .    # lint + format

The optional eval harness is the sift-evals CLI (installed via the [evals] extra) — performance, determinism, structural-fidelity, facts, and agent-in-the-loop benchmarks (sift-evals --help). See CONTRIBUTING.md for the full guide: conventional commits, the SiteProfile extension path, the determinism invariant, and CI (every PR runs the suite on Python 3.11 / 3.12 / 3.13).


Project status

0.1.0 — initial public release. Full test suite green on Python 3.11–3.13. Known limitations (PRs welcome):

  • No run-dir / raw-blob garbage collection yet — storage grows; reclaim with rm -rf runs/<old> + manifest VACUUM.
  • Logging is stdout-only (no structured logging); no alerting beyond cron exit codes.
  • MCP transport is stdio only — wrap with an HTTP/MCP proxy to host it.
  • One facts extractor is wired (rate tables); other schemas exist without extractors.
  • Kasada-class anti-bot remains out of reach; the Firecrawl path handles most Cloudflare/Akamai.

Contributing

Bug reports and features via GitHub Issues; see CONTRIBUTING.md. Found a security issue? Follow the private disclosure process in SECURITY.md — please don't open a public issue.

License

Apache-2.0 — Copyright © 2026 Deval Shah.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sift_engine-0.1.0.tar.gz (343.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sift_engine-0.1.0-py3-none-any.whl (268.3 kB view details)

Uploaded Python 3

File details

Details for the file sift_engine-0.1.0.tar.gz.

File metadata

  • Download URL: sift_engine-0.1.0.tar.gz
  • Upload date:
  • Size: 343.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sift_engine-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ee1cd411afeede5e5f796c646a10cde6bb9a6ee69ef8e256b4922f79b1c5f761
MD5 95fc6c5a1ec4777c18d2450639769179
BLAKE2b-256 02264e1bf2ad2a824e160e9ed2e05a9621fdd6fc7c28c0c98d9f15b0d848fe43

See more details on using hashes here.

Provenance

The following attestation bundles were made for sift_engine-0.1.0.tar.gz:

Publisher: release.yml on dvlshah/sift

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sift_engine-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: sift_engine-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 268.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sift_engine-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 117db2dcdb486442376b258018e097dd754ba16fc619688e1fd4264cf3861159
MD5 722d70373e24cd8fc21b2c4d849161f1
BLAKE2b-256 1cd09ca44cd9a9793c46d3fd550d9cab5c7466149a05b633a768e7015c7f8f1e

See more details on using hashes here.

Provenance

The following attestation bundles were made for sift_engine-0.1.0-py3-none-any.whl:

Publisher: release.yml on dvlshah/sift

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page