schemabrain

The SQL firewall between AI agents and your production database.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

arunkc

These details have not been verified by PyPI

Project description

schemabrain — the safety and schema intelligence layer for AI systems that interact with databases

Works with Claude Desktop · Claude Code · Cursor · Zed · any MCP host

Stop giving AI agents raw database connection strings.
SchemaBrain is the SQL firewall between AI agents and your production database — twelve read-only tools, validated metrics, tamper-evident audit.

The agent never writes SQL. SchemaBrain does, from definitions you control.

Three guarantees that close the trust gap between AI agents and your database:

Read-only by architecture — twelve MCP tools, none of which can write. No execute() tool, no query() tool, no path from agent prompt to a write at your database.
PII refusal at retrieval — PII tags propagate from the physical schema through joins and metrics. If a query touches a blocked category, SchemaBrain refuses before the database is queried.
Cryptographic audit chain — every call, refusal, and recovery lands in a SHA256-hashed append-only log. audit verify exits non-zero if any past row was rewritten.

pip install schemabrain
schemabrain init
# then: Cmd+Q Claude Desktop, relaunch, and ask: "list the entities SchemaBrain knows about"

Cost: ~$0.03 to index the bundled demo · $0 to re-index unchanged schemas. Detail in Sample session.

Status: 0.4.0 (alpha). Postgres + SQLite supported today. Snowflake / BigQuery / MySQL on the roadmap.

Quickstart — 3 steps from pip install to a working agent
The firewall — what SchemaBrain enforces at the SQL boundary
Sample session — real Claude Desktop interaction against the bundled fixture
Where it's going — honest disclaimer about what's not built yet
Roadmap — shipped + in progress + planned
Troubleshooting — the five most common first-run failures
Documentation — deeper guides

Read next based on what you need:

Goal	Where to go
Try it on the bundled fixture	Quickstart
Understand the firewall properties	The firewall
Plug into your own agent loop	`docs/setup.md`
Build a semantic layer	`docs/semantic-layer.md`
Run in production (audit, drift, Docker)	`docs/operations.md`
Observe the agent (tail, audit log, OTel)	`docs/observability.md`
Compare with Vanna / Atlan / dbt-mcp / WrenAI	`docs/landscape.md`

Quickstart

Three steps from pip install to a working Claude Desktop integration. If you paste your own Postgres URL — no Docker needed, ~30s. Press Enter for the bundled demo and init invokes Docker + downloads a ~67 MB embedding model first time; ~45s once cached.

1. Install

pip install schemabrain
schemabrain --version

Source install (git clone + uv sync --extra dev) is documented in docs/setup.md.

2. Run the activation wizard

schemabrain init

init is a seven-stage wizard that takes you from "I have a Postgres database" to "Claude Desktop can answer questions about it" in one command. On first run it prompts for what it needs:

A Postgres URL — paste your own connection string, or press Enter to spin up a local demo Postgres container with the bundled e-commerce fixture (Docker is invoked automatically; idempotent on re-runs).
An ANTHROPIC_API_KEY — optional. Skip and the wizard still wires Claude Desktop; entity curation can run later.

SchemaBrain init — activation wizard

  [1/7] Source check       ✓ source reachable + read-only
  [2/7] Index schema       ✓ 7 tables, 30 columns indexed
  [3/7] Curate entities    ✓ 6 entities suggested + applied (cost: $0.01)
  [4/7] Curate metrics     ✓ 10 metrics suggested + applied (cost: $0.03)
  [5/7] Curate joins       ✓ 5 canonical joins created (FK-mined, no LLM)
  [6/7] Wire host          ✓ wrote schemabrain entry to claude_desktop_config.json
  [7/7] Next               ✓ restart your MCP host, then ask: "list the entities SchemaBrain knows about"

Full wizard reference (stages explained, flags, dbt auto-detection, --print-only for non-Claude-Desktop hosts, --no-entities / --no-metrics / --no-joins opt-outs, cost-cap pauses): docs/setup.md.

3. Restart Claude Desktop and ask

Quit Claude Desktop fully — Cmd+Q, not just close the window. The MCP config is only read on cold start.
Relaunch.
New conversation:

list the entities SchemaBrain knows about

If Claude calls list_entities and reports user, order, etc., you're done. If not, see Troubleshooting.

After the wizard, schemabrain inspect shows what the agent has and schemabrain tail streams every tool call live — see docs/operations.md.

The firewall

schemabrain architecture: agent talks to schemabrain over MCP stdio (12 read-only tools); schemabrain emits parameterized SQL to Postgres; the schemabrain boundary is the trust boundary; audit log is tamper-evident.

Six properties SchemaBrain enforces at the SQL boundary today:

1. Read-only by architecture, not configuration

The MCP surface exposes twelve tools — none of which can write. There is no execute() tool, no query() tool, no path from agent prompt to a write at your database, regardless of session state. The read-only guarantee is structural, not a session flag the agent can flip. Stage 1 of init additionally pins default_transaction_read_only=on on the connection as belt-and-suspenders against a misconfigured downstream.

2. PII-aware refusal at the `get_metric` tool boundary

Any get_metric touching a blocked PII category returns a refused envelope; the compiled SQL never runs and the refusal lands in mcp_audit as status='refused', refusal_reason='pii_blocked'. describe_entity enforces the same policy at the column level — the agent still sees the entity and its non-PII columns, but blocked columns ship with redacted=True and the LLM-enriched description cleared. schemabrain init writes --pii-block contact into the Claude Desktop snippet by default so email / phone / address columns are blocked on a fresh install; widen with --pii-block contact,health and other categories as needed.

schemabrain serve --pii-block contact,health

Twelve categories from GDPR, CCPA/CPRA, HIPAA, PCI DSS, ISO 27018 — tagged per-column at index time. Detection is column-name patterns + redacted-sample inspection — see schemabrain/pii/classifier.py for the full rule set.

Enforcement scope: binding is at the get_metric compile path today. Lower-level tools (describe_entity, resolve_join, describe_table) surface the redacted=True flag + pii_categories as advisory metadata so the agent can self-regulate, but they don't refuse. Uniform SQL-layer enforcement against agent-emitted SQL ships in v2 — see Where it's going.

PII classification →

3. Tamper-evident audit log

Every tool call under schemabrain serve writes one row to an append-only mcp_audit table — PII categories per row, content-addressable fingerprints, sha256 hash chain. audit verify re-walks the chain and exits non-zero if any past row was rewritten. Detects post-hoc tampering by any process without write access to the audit table; streaming the chain to an external immutable store is on the v3 roadmap (mcp_audit is local SQLite today).

schemabrain audit verify   # exit 0 = chain clean

Tamper-evident audit log →

4. Failure is a contract, not a string

Refused and degraded calls return a structured recovery.suggested_args block — not a message agents have to parse. PII blocks ship the entity name to retry; ambiguous time dimensions ship the candidate to pick; unknown joins ship the next tool to call. Agents act on the contract programmatically.

{
  "status": "refused",
  "kind": "ambiguous_time_dimension",
  "recovery": {
    "suggested_tool": "get_metric",
    "suggested_args": {"time_dimension": "order.placed_at"}
  }
}

Charter v1.2 envelope reference →

5. Compile path: definitions → parameterized SQL

Entities, metrics, and canonical joins compile to parameterized SQL SchemaBrain runs on its side. The agent sees rows + the SQL that was run — never arbitrary statements at your database. LLM-suggested definitions during init are reviewed and applied explicitly; nothing reaches the runtime store until you accept it.

Build your semantic layer →

6. Pluggable into any agent loop

The same MCP stdio surface Claude Desktop sees is exposed to any host that speaks MCP — your own Anthropic, OpenAI, or LangGraph loop included. examples/anthropic_demo.py is a 230-LOC drop-in that wires Claude Haiku 4.5 to schemabrain serve and prints exactly which tools the agent chose to call.

python examples/anthropic_demo.py \
    --url-env DATABASE_URL \
    --question "..."

~$0.005–0.02 per run on Haiku 4.5, bounded by --max-turns.

Anthropic SDK walkthrough →

Sample session

Real Claude Desktop session against the bundled e-commerce fixture (7 tables, 30 columns, 6 entities curated for ~$0.03):

You: Using SchemaBrain, write me a SQL query to compute each customer's total spend by product category.

Claude: (calls describe_entity for order_item, then chains resolve_join along user → order → order_item → product. Each FK-derived hop returns inference_method: fk_constraint / confidence: HIGH. The last hop — resolve_join(product, category) — refuses:)
{ "status": "error",
  "error": {
    "kind": "no_canonical_join",
    "message": "No canonical join exists between 'product' and 'category'...",
    "recovery": { "suggested_tool": "suggest_joins", ... } } }
(Claude verifies at the physical layer — describe_table(public.products) returns zero foreign keys; describe_table(public.categories) has only id, name, parent_id self-ref; find_relevant_tables for "junction" returns empty.)

Verdict up front: I can't write that query correctly, and you shouldn't want me to — the data model can't answer it as it stands. product has no category_id and no junction table exists. So there is no relationship anywhere in the schema between a product and a category. "Total spend by category" is unanswerable until that link exists — and I'm not going to invent it.

What I can give you — total spend per customer (this path is fully resolved: user → order → order_item, all fk_constraint joins, confidence: HIGH):
SELECT u.id                                            AS user_id,
       u.full_name                                     AS customer,     -- ⚠ pii / contact / redacted: true in the semantic layer
       SUM(oi.quantity * oi.unit_price_cents) / 100.0  AS total_spend   -- prices stored in integer cents
FROM   public.order_items oi
JOIN   public.orders      o ON oi.order_id = o.id
JOIN   public.users       u ON o.user_id   = u.id
GROUP BY u.id, u.full_name
ORDER BY total_spend DESC;
Two honest caveats:

Status filter omitted. orders.status is free TEXT and I haven't inspected its values; this counts every order. If you only want paid/fulfilled spend, add WHERE o.status = … once you confirm the allowed values.

PII flag. full_name is pii_sensitivity: pii / categories: [contact] / redacted: true in the semantic layer. Fine if you're running this yourself; not fine if it's flowing into an agent context.

The differentiator is what didn't happen: most LLM-over-database tools, asked this exact question, would confidently emit JOIN categories c ON p.category_id = c.id and produce SQL that errors against a column that isn't there. SchemaBrain refused — resolve_join returned kind: no_canonical_join with recovery.suggested_tool: suggest_joins, not prose. The agent acted on the structured recovery contract programmatically instead of fabricating a join. Refusal-not-fabrication is the safety mechanism, demonstrated live.

Cost. $0.001/column with Claude Haiku 4.5 + Sonnet 4.6 (Sonnet for the structured curation prompt). The bundled 7-table fixture (30 columns, 6 entities + 10 metrics + 5 joins) indexes + curates for **$0.03 in ~85s**. The Pagila DVD-rental sample (87 columns after partition deduplication) runs for $0.0299 in 105s. Re-indexing an unchanged schema is $0 — content-addressable fingerprinting skips the LLM call entirely.

To verify Claude's SQL is mechanically correct (and that flagged caveats are the actual data behavior), see Validating SQL Claude generates.

Run this exact session yourself: schemabrain init walks you to a wired Claude Desktop in one command; then ask Claude "Using SchemaBrain, write me a SQL query to compute each customer's total spend by product category." and watch the refuse-then-pivot live.

Where it's going

SchemaBrain is being built as the SQL-boundary safety layer for AI agents — the layer that parses what your agent is about to ask the database and refuses (or rewrites) before it runs.

That layer needs a semantic substrate underneath it. You can't refuse "this query touches PII" without knowing which columns are PII. You can't rewrite "join through this junction" without canonical-join definitions. You can't validate a metric without knowing its grain.

So the engineering order is schema intelligence → semantic substrate → safety primitives. The first two are shipped (v0.5 + v1); the third — validate_query for agent-emitted SQL and execute with hard caps — is the next major milestone. Today the product gives you PII-aware refusal at the get_metric boundary, structured recovery on every refused or degraded call, and tamper-evident audit — all running against parameterized SQL the agent never sees. If you need parse-before-execute over arbitrary agent-emitted SQL, track the roadmap.

Roadmap

The v0.5 / v1 / v2 / v3 labels are roadmap milestone names, not package versions. The package follows strict semver — 1.0.0 is reserved for an API that's been battle-tested by external users without a forced break. See ADR-0003.

v0.5 — schema intelligence (shipped):

Agent-UX charter v1.0 retrofit on existing tools + CI enforcement ✓
Dev-UX foundations: rich progress UI, guided errors, --dry-run ✓
Query log mining via pg_stat_statements (schemabrain mine-queries) ✓
5 physical-schema MCP tools including get_example_queries ✓

v1 — semantic substrate (shipped):

Entities, metrics, canonical joins as first-class persisted definitions ✓
LLM-suggested entity / metric / join definitions from FK graph + column descriptions ✓
5 semantic-layer MCP tools (find_relevant_entities, list_entities, describe_entity, resolve_join, get_metric) ✓
Composite-expression measures — SUM(unit_price * quantity) over the same anchor table ✓
Multi-hop canonical-join chains — BFS over the join graph with via= disambiguation ✓
Drift detection (schemabrain check) ✓
PII-aware refusal at the get_metric boundary ✓
Tamper-evident audit log with sha256 chain ✓

v1.x — engine breadth (in progress):

One additional engine: Snowflake / BigQuery / MySQL
BIRD Mini-Dev automated eval harness
Pre-built multi-platform Docker image on a public registry

v2 — SQL-boundary safety wedge:

validate_query — agent-emitted SQL parsed and judged against policy before execution
execute with hard caps — read-only role enforced at the database layer, statement timeouts, row caps, per-call cost guards
Sub-query refusal with recovery — parse the SQL, identify the unsafe fragment, refuse just that fragment with a suggested rewrite

v3 — multi-engine + control plane (commercial, gated on hosted demand):

Remaining engines (BigQuery / Snowflake / Redshift breadth)
Learning loop from telemetry and reformulation patterns
Hosted control plane with fleet-wide adversarial-signature aggregation

Troubleshooting

The five most common first-run failures. Full troubleshooter in docs/setup.md.

pip install schemabrain gave me an older version. Check schemabrain --version. If it doesn't match the latest release your pip cache is stale — run pip install --upgrade schemabrain. schemabrain init writes the same version into the Claude Desktop snippet (uvx schemabrain==<pin>) so it stays reproducible across restarts; bump the pin in the snippet manually after upgrading via pip.
init reports source unreachable. Postgres may not be ready on first run — wait a few seconds and re-run. For your own database, verify host, port, and credentials. Connection URLs in any form are accepted (postgresql://, postgres://, postgresql+psycopg://).
The first init or schemabrain index hangs for ~60 seconds. Normal. The first index downloads the ONNX embedding model (~67 MB) and makes one LLM call per column. Subsequent runs are fast.
init fails at stage 6 "wire host". Claude Desktop must be installed first — SchemaBrain writes into its config file, which doesn't exist until Claude Desktop has launched at least once.
Claude Desktop doesn't show SchemaBrain after restart. Cmd+Q is required (close-window doesn't trigger a re-read of MCP config). Run schemabrain doctor to verify the config landed. If doctor says everything's good but Claude Desktop still doesn't see the tool, check ~/Library/Logs/Claude/mcp*.log.

Documentation

Doc	What's inside
`docs/setup.md`	Activation wizard, Claude Desktop / Code / Cursor wiring, Anthropic SDK demo, troubleshooting, validating Claude's SQL
`docs/semantic-layer.md`	Building entities, metrics (incl. composite expressions), canonical joins (incl. multi-hop), dbt import
`docs/operations.md`	`inspect`, `check` (drift), `index --dry-run`, Docker compose
`docs/observability.md`	`tail`, audit log, OTel export, PII classification
`docs/mcp-tools.md`	Full reference for all 12 MCP tools
`docs/architecture.md`	Pipeline, retrieval contract, cache logic, cost model, eval
`docs/landscape.md`	Comparison vs Vanna / Atlan / dbt-mcp / WrenAI; "is this a semantic layer?"
`docs/threat-model.md`	Security model + boundaries
`docs/adr/`	Architecture decision records (audit/PII taxonomy, store protocol, versioning policy, observability bus)
`examples/`	Copy-paste-ready MCP configs, headless agent loop, end-to-end ecommerce walkthrough

FAQ

Does my data leave my machine? Only LLM-enriched column descriptions and the redacted sample values that feed them. Three regex passes (email, US SSN, credit-card-shaped digit runs) run on every sample before it leaves the profiler module — see schemabrain/profiler/stats.py. The Anthropic API call sends column metadata + redacted samples + sibling-column context — no raw rows. Embeddings are generated locally via fastembed (BAAI/bge-small-en-v1.5, ONNX, ~67 MB).

What databases work today? Postgres 16+ (primary target) and SQLite (for development and demos). Adding Snowflake / BigQuery / MySQL is mostly a new DataSource implementation plus a profiler tweak — on the v1.x roadmap.

Why MCP and not a REST API? The consumer is an agent, not a service. MCP standardizes tool registration, schema description, and request/response transport. Agents discover SchemaBrain natively and get its tool surface — no API wrapper, no SDK to maintain per language.

Is this a semantic layer like Cube or dbt Semantic Layer? Partially. SchemaBrain ships entities, metrics, and canonical joins as first-class persisted definitions today — agents call them via list_entities, describe_entity, resolve_join, get_metric. But the semantic layer isn't the headline; it's the substrate that makes the SQL-boundary safety primitives possible. Full comparison vs Cube / dbt-mcp / Vanna / WrenAI in docs/landscape.md.

More questions answered in docs/setup.md (why local embeddings, more troubleshooting).

Contributors

Contributing & License

PRs welcome. The bar is high — see CONTRIBUTING.md for the test-first / 99%-coverage / conventional-commits / architecture-invariants checklist. CI enforces all of it.

Bugs and feature requests use the structured templates in .github/ISSUE_TEMPLATE/. Issues without a reproduction (bugs) or a clear underlying problem (features) get closed with a request to re-open with the right info.

MIT.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

arunkc

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.6.0

Jun 13, 2026

0.5.0

Jun 1, 2026

This version

0.4.0

May 25, 2026

0.3.0

May 20, 2026

0.2.0a1 pre-release

May 15, 2026

0.1.0a1 pre-release

May 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

schemabrain-0.4.0.tar.gz (10.5 MB view details)

Uploaded May 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

schemabrain-0.4.0-py3-none-any.whl (625.3 kB view details)

Uploaded May 25, 2026 Python 3

File details

Details for the file schemabrain-0.4.0.tar.gz.

File metadata

Download URL: schemabrain-0.4.0.tar.gz
Upload date: May 25, 2026
Size: 10.5 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for schemabrain-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`b245b7abbfe8eda43585ea962b7c20d6b46ad6782a9ce76e1485d589af9589e3`
MD5	`97771478086fd53a6d30bb3df9fe51ee`
BLAKE2b-256	`6a97dbf0dfc8d21dbfa51dc4090a072556abdcff55bf8bf60f8d6a014336cd78`

See more details on using hashes here.

Provenance

The following attestation bundles were made for schemabrain-0.4.0.tar.gz:

Publisher: publish.yml on Arun-kc/schemabrain

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: schemabrain-0.4.0.tar.gz
- Subject digest: b245b7abbfe8eda43585ea962b7c20d6b46ad6782a9ce76e1485d589af9589e3
- Sigstore transparency entry: 1629255740
- Sigstore integration time: May 25, 2026
Source repository:
- Permalink: Arun-kc/schemabrain@28a30f53760953c1c4136be15af41ab7101fae1c
- Branch / Tag: refs/heads/main
- Owner: https://github.com/Arun-kc
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@28a30f53760953c1c4136be15af41ab7101fae1c
- Trigger Event: workflow_dispatch

File details

Details for the file schemabrain-0.4.0-py3-none-any.whl.

File metadata

Download URL: schemabrain-0.4.0-py3-none-any.whl
Upload date: May 25, 2026
Size: 625.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for schemabrain-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d77599bb3c63deb4d90411b0071692099c577955cb994477afaad2e572d28390`
MD5	`35a1bf57c3347adca6328703d781965d`
BLAKE2b-256	`80ff4e0e737a159eecf7e030a3ceaba84f15047c98155d8a2c0c819e5451e585`

See more details on using hashes here.

Provenance

The following attestation bundles were made for schemabrain-0.4.0-py3-none-any.whl:

Publisher: publish.yml on Arun-kc/schemabrain

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: schemabrain-0.4.0-py3-none-any.whl
- Subject digest: d77599bb3c63deb4d90411b0071692099c577955cb994477afaad2e572d28390
- Sigstore transparency entry: 1629255755
- Sigstore integration time: May 25, 2026
Source repository:
- Permalink: Arun-kc/schemabrain@28a30f53760953c1c4136be15af41ab7101fae1c
- Branch / Tag: refs/heads/main
- Owner: https://github.com/Arun-kc
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@28a30f53760953c1c4136be15af41ab7101fae1c
- Trigger Event: workflow_dispatch

schemabrain 0.4.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Contents

Quickstart

1. Install

2. Run the activation wizard

3. Restart Claude Desktop and ask

The firewall

1. Read-only by architecture, not configuration

2. PII-aware refusal at the get_metric tool boundary

3. Tamper-evident audit log

4. Failure is a contract, not a string

5. Compile path: definitions → parameterized SQL

6. Pluggable into any agent loop

Sample session

Where it's going

Roadmap

Troubleshooting

Documentation

FAQ

Contributors

Contributing & License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

2. PII-aware refusal at the `get_metric` tool boundary