The SQL firewall between AI agents and your production database.
Project description
Works with Claude Desktop · Claude Code · Cursor · Zed · any MCP host
Stop giving AI agents raw database connection strings.
SchemaBrain is the SQL firewall between AI agents and your production database — twelve read-only tools, validated metrics, tamper-evident audit.
The agent never writes SQL. SchemaBrain does, from definitions you control.
Three guarantees that close the trust gap between AI agents and your database:
- Read-only by architecture — twelve MCP tools, none of which can write. No
execute()tool, noquery()tool, no path from agent prompt to a write at your database. - PII refusal at retrieval — PII tags propagate from the physical schema through joins and metrics. If a query touches a blocked category, SchemaBrain refuses before the database is queried.
- Cryptographic audit chain — every call, refusal, and recovery lands in a SHA256-hashed append-only log.
audit verifyexits non-zero if any past row was rewritten.
pip install schemabrain
schemabrain init
# then: Cmd+Q Claude Desktop, relaunch, and ask: "list the entities SchemaBrain knows about"
Cost: ~$0.03 to index the bundled demo · $0 to re-index unchanged schemas. Detail in Sample session.
Status: 0.4.0 (alpha). Postgres + SQLite supported today. Snowflake / BigQuery / MySQL on the roadmap.
Contents
- Quickstart — 3 steps from
pip installto a working agent - The firewall — what SchemaBrain enforces at the SQL boundary
- Sample session — real Claude Desktop interaction against the bundled fixture
- Where it's going — honest disclaimer about what's not built yet
- Roadmap — shipped + in progress + planned
- Troubleshooting — the five most common first-run failures
- Documentation — deeper guides
Read next based on what you need:
| Goal | Where to go |
|---|---|
| Try it on the bundled fixture | Quickstart |
| Understand the firewall properties | The firewall |
| Plug into your own agent loop | docs/setup.md |
| Build a semantic layer | docs/semantic-layer.md |
| Run in production (audit, drift, Docker) | docs/operations.md |
| Observe the agent (tail, audit log, OTel) | docs/observability.md |
| Compare with Vanna / Atlan / dbt-mcp / WrenAI | docs/landscape.md |
Quickstart
Three steps from pip install to a working Claude Desktop integration. If you paste your own Postgres URL — no Docker needed, ~30s. Press Enter for the bundled demo and init invokes Docker + downloads a ~67 MB embedding model first time; ~45s once cached.
1. Install
pip install schemabrain
schemabrain --version
Source install (git clone + uv sync --extra dev) is documented in docs/setup.md.
2. Run the activation wizard
schemabrain init
init is a seven-stage wizard that takes you from "I have a Postgres database" to "Claude Desktop can answer questions about it" in one command. On first run it prompts for what it needs:
- A Postgres URL — paste your own connection string, or press Enter to spin up a local demo Postgres container with the bundled e-commerce fixture (Docker is invoked automatically; idempotent on re-runs).
- An
ANTHROPIC_API_KEY— optional. Skip and the wizard still wires Claude Desktop; entity curation can run later.
SchemaBrain init — activation wizard
[1/7] Source check ✓ source reachable + read-only
[2/7] Index schema ✓ 7 tables, 30 columns indexed
[3/7] Curate entities ✓ 6 entities suggested + applied (cost: $0.01)
[4/7] Curate metrics ✓ 10 metrics suggested + applied (cost: $0.03)
[5/7] Curate joins ✓ 5 canonical joins created (FK-mined, no LLM)
[6/7] Wire host ✓ wrote schemabrain entry to claude_desktop_config.json
[7/7] Next ✓ restart your MCP host, then ask: "list the entities SchemaBrain knows about"
Full wizard reference (stages explained, flags, dbt auto-detection, --print-only for non-Claude-Desktop hosts, --no-entities / --no-metrics / --no-joins opt-outs, cost-cap pauses): docs/setup.md.
3. Restart Claude Desktop and ask
-
Quit Claude Desktop fully — Cmd+Q, not just close the window. The MCP config is only read on cold start.
-
Relaunch.
-
New conversation:
list the entities SchemaBrain knows about
If Claude calls list_entities and reports user, order, etc., you're done. If not, see Troubleshooting.
After the wizard, schemabrain inspect shows what the agent has and schemabrain tail streams every tool call live — see docs/operations.md.
The firewall
Six properties SchemaBrain enforces at the SQL boundary today:
1. Read-only by architecture, not configurationThe MCP surface exposes twelve tools — none of which can write. There is no |
2. PII-aware refusal at the
|
3. Tamper-evident audit logEvery tool call under schemabrain audit verify # exit 0 = chain clean
|
4. Failure is a contract, not a stringRefused and degraded calls return a structured {
"status": "refused",
"kind": "ambiguous_time_dimension",
"recovery": {
"suggested_tool": "get_metric",
"suggested_args": {"time_dimension": "order.placed_at"}
}
}
|
5. Compile path: definitions → parameterized SQLEntities, metrics, and canonical joins compile to parameterized SQL SchemaBrain runs on its side. The agent sees rows + the SQL that was run — never arbitrary statements at your database. LLM-suggested definitions during |
6. Pluggable into any agent loopThe same MCP stdio surface Claude Desktop sees is exposed to any host that speaks MCP — your own Anthropic, OpenAI, or LangGraph loop included. python examples/anthropic_demo.py \
--url-env DATABASE_URL \
--question "..."
~$0.005–0.02 per run on Haiku 4.5, bounded by |
Sample session
Real Claude Desktop session against the bundled e-commerce fixture (7 tables, 30 columns, 6 entities curated for ~$0.03):
You: Using SchemaBrain, write me a SQL query to compute each customer's total spend by product category.
Claude: (calls
describe_entityfororder_item, then chainsresolve_joinalonguser → order → order_item → product. Each FK-derived hop returnsinference_method: fk_constraint/confidence: HIGH. The last hop —resolve_join(product, category)— refuses:){ "status": "error", "error": { "kind": "no_canonical_join", "message": "No canonical join exists between 'product' and 'category'...", "recovery": { "suggested_tool": "suggest_joins", ... } } }(Claude verifies at the physical layer —
describe_table(public.products)returns zero foreign keys;describe_table(public.categories)has onlyid, name, parent_idself-ref;find_relevant_tablesfor "junction" returns empty.)Verdict up front: I can't write that query correctly, and you shouldn't want me to — the data model can't answer it as it stands.
producthas nocategory_idand no junction table exists. So there is no relationship anywhere in the schema between a product and a category. "Total spend by category" is unanswerable until that link exists — and I'm not going to invent it.What I can give you — total spend per customer (this path is fully resolved:
user → order → order_item, allfk_constraintjoins,confidence: HIGH):SELECT u.id AS user_id, u.full_name AS customer, -- ⚠ pii / contact / redacted: true in the semantic layer SUM(oi.quantity * oi.unit_price_cents) / 100.0 AS total_spend -- prices stored in integer cents FROM public.order_items oi JOIN public.orders o ON oi.order_id = o.id JOIN public.users u ON o.user_id = u.id GROUP BY u.id, u.full_name ORDER BY total_spend DESC;Two honest caveats:
- Status filter omitted.
orders.statusis free TEXT and I haven't inspected its values; this counts every order. If you only want paid/fulfilled spend, addWHERE o.status = …once you confirm the allowed values.- PII flag.
full_nameispii_sensitivity: pii / categories: [contact] / redacted: truein the semantic layer. Fine if you're running this yourself; not fine if it's flowing into an agent context.
The differentiator is what didn't happen: most LLM-over-database tools, asked this exact question, would confidently emit JOIN categories c ON p.category_id = c.id and produce SQL that errors against a column that isn't there. SchemaBrain refused — resolve_join returned kind: no_canonical_join with recovery.suggested_tool: suggest_joins, not prose. The agent acted on the structured recovery contract programmatically instead of fabricating a join. Refusal-not-fabrication is the safety mechanism, demonstrated live.
Cost. $0.001/column with Claude Haiku 4.5 + Sonnet 4.6 (Sonnet for the structured curation prompt). The bundled 7-table fixture (30 columns, 6 entities + 10 metrics + 5 joins) indexes + curates for **$0.03 in ~85s**. The Pagila DVD-rental sample (87 columns after partition deduplication) runs for $0.0299 in 105s. Re-indexing an unchanged schema is $0 — content-addressable fingerprinting skips the LLM call entirely.
To verify Claude's SQL is mechanically correct (and that flagged caveats are the actual data behavior), see Validating SQL Claude generates.
Run this exact session yourself: schemabrain init walks you to a wired Claude Desktop in one command; then ask Claude "Using SchemaBrain, write me a SQL query to compute each customer's total spend by product category." and watch the refuse-then-pivot live.
Where it's going
SchemaBrain is being built as the SQL-boundary safety layer for AI agents — the layer that parses what your agent is about to ask the database and refuses (or rewrites) before it runs.
That layer needs a semantic substrate underneath it. You can't refuse "this query touches PII" without knowing which columns are PII. You can't rewrite "join through this junction" without canonical-join definitions. You can't validate a metric without knowing its grain.
So the engineering order is schema intelligence → semantic substrate → safety primitives. The first two are shipped (v0.5 + v1); the third — validate_query for agent-emitted SQL and execute with hard caps — is the next major milestone. Today the product gives you PII-aware refusal at the get_metric boundary, structured recovery on every refused or degraded call, and tamper-evident audit — all running against parameterized SQL the agent never sees. If you need parse-before-execute over arbitrary agent-emitted SQL, track the roadmap.
Roadmap
The
v0.5/v1/v2/v3labels are roadmap milestone names, not package versions. The package follows strict semver —1.0.0is reserved for an API that's been battle-tested by external users without a forced break. See ADR-0003.
v0.5 — schema intelligence (shipped):
- Agent-UX charter v1.0 retrofit on existing tools + CI enforcement ✓
- Dev-UX foundations: rich progress UI, guided errors,
--dry-run✓ - Query log mining via
pg_stat_statements(schemabrain mine-queries) ✓ - 5 physical-schema MCP tools including
get_example_queries✓
v1 — semantic substrate (shipped):
- Entities, metrics, canonical joins as first-class persisted definitions ✓
- LLM-suggested entity / metric / join definitions from FK graph + column descriptions ✓
- 5 semantic-layer MCP tools (
find_relevant_entities,list_entities,describe_entity,resolve_join,get_metric) ✓ - Composite-expression measures —
SUM(unit_price * quantity)over the same anchor table ✓ - Multi-hop canonical-join chains — BFS over the join graph with
via=disambiguation ✓ - Drift detection (
schemabrain check) ✓ - PII-aware refusal at the
get_metricboundary ✓ - Tamper-evident audit log with sha256 chain ✓
v1.x — engine breadth (in progress):
- One additional engine: Snowflake / BigQuery / MySQL
- BIRD Mini-Dev automated eval harness
- Pre-built multi-platform Docker image on a public registry
v2 — SQL-boundary safety wedge:
validate_query— agent-emitted SQL parsed and judged against policy before executionexecutewith hard caps — read-only role enforced at the database layer, statement timeouts, row caps, per-call cost guards- Sub-query refusal with recovery — parse the SQL, identify the unsafe fragment, refuse just that fragment with a suggested rewrite
v3 — multi-engine + control plane (commercial, gated on hosted demand):
- Remaining engines (BigQuery / Snowflake / Redshift breadth)
- Learning loop from telemetry and reformulation patterns
- Hosted control plane with fleet-wide adversarial-signature aggregation
Troubleshooting
The five most common first-run failures. Full troubleshooter in docs/setup.md.
pip install schemabraingave me an older version. Checkschemabrain --version. If it doesn't match the latest release your pip cache is stale — runpip install --upgrade schemabrain.schemabrain initwrites the same version into the Claude Desktop snippet (uvx schemabrain==<pin>) so it stays reproducible across restarts; bump the pin in the snippet manually after upgrading via pip.initreportssource unreachable. Postgres may not be ready on first run — wait a few seconds and re-run. For your own database, verify host, port, and credentials. Connection URLs in any form are accepted (postgresql://,postgres://,postgresql+psycopg://).- The first
initorschemabrain indexhangs for ~60 seconds. Normal. The first index downloads the ONNX embedding model (~67 MB) and makes one LLM call per column. Subsequent runs are fast. initfails at stage 6 "wire host". Claude Desktop must be installed first — SchemaBrain writes into its config file, which doesn't exist until Claude Desktop has launched at least once.- Claude Desktop doesn't show SchemaBrain after restart. Cmd+Q is required (close-window doesn't trigger a re-read of MCP config). Run
schemabrain doctorto verify the config landed. Ifdoctorsays everything's good but Claude Desktop still doesn't see the tool, check~/Library/Logs/Claude/mcp*.log.
Documentation
| Doc | What's inside |
|---|---|
docs/setup.md |
Activation wizard, Claude Desktop / Code / Cursor wiring, Anthropic SDK demo, troubleshooting, validating Claude's SQL |
docs/semantic-layer.md |
Building entities, metrics (incl. composite expressions), canonical joins (incl. multi-hop), dbt import |
docs/operations.md |
inspect, check (drift), index --dry-run, Docker compose |
docs/observability.md |
tail, audit log, OTel export, PII classification |
docs/mcp-tools.md |
Full reference for all 12 MCP tools |
docs/architecture.md |
Pipeline, retrieval contract, cache logic, cost model, eval |
docs/landscape.md |
Comparison vs Vanna / Atlan / dbt-mcp / WrenAI; "is this a semantic layer?" |
docs/threat-model.md |
Security model + boundaries |
docs/adr/ |
Architecture decision records (audit/PII taxonomy, store protocol, versioning policy, observability bus) |
examples/ |
Copy-paste-ready MCP configs, headless agent loop, end-to-end ecommerce walkthrough |
FAQ
Does my data leave my machine?
Only LLM-enriched column descriptions and the redacted sample values that feed them. Three regex passes (email, US SSN, credit-card-shaped digit runs) run on every sample before it leaves the profiler module — see schemabrain/profiler/stats.py. The Anthropic API call sends column metadata + redacted samples + sibling-column context — no raw rows. Embeddings are generated locally via fastembed (BAAI/bge-small-en-v1.5, ONNX, ~67 MB).
What databases work today?
Postgres 16+ (primary target) and SQLite (for development and demos). Adding Snowflake / BigQuery / MySQL is mostly a new DataSource implementation plus a profiler tweak — on the v1.x roadmap.
Why MCP and not a REST API? The consumer is an agent, not a service. MCP standardizes tool registration, schema description, and request/response transport. Agents discover SchemaBrain natively and get its tool surface — no API wrapper, no SDK to maintain per language.
Is this a semantic layer like Cube or dbt Semantic Layer?
Partially. SchemaBrain ships entities, metrics, and canonical joins as first-class persisted definitions today — agents call them via list_entities, describe_entity, resolve_join, get_metric. But the semantic layer isn't the headline; it's the substrate that makes the SQL-boundary safety primitives possible. Full comparison vs Cube / dbt-mcp / Vanna / WrenAI in docs/landscape.md.
More questions answered in docs/setup.md (why local embeddings, more troubleshooting).
Contributors
Contributing & License
PRs welcome. The bar is high — see CONTRIBUTING.md for the test-first / 99%-coverage / conventional-commits / architecture-invariants checklist. CI enforces all of it.
Bugs and feature requests use the structured templates in .github/ISSUE_TEMPLATE/. Issues without a reproduction (bugs) or a clear underlying problem (features) get closed with a request to re-open with the right info.
MIT.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file schemabrain-0.4.0.tar.gz.
File metadata
- Download URL: schemabrain-0.4.0.tar.gz
- Upload date:
- Size: 10.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b245b7abbfe8eda43585ea962b7c20d6b46ad6782a9ce76e1485d589af9589e3
|
|
| MD5 |
97771478086fd53a6d30bb3df9fe51ee
|
|
| BLAKE2b-256 |
6a97dbf0dfc8d21dbfa51dc4090a072556abdcff55bf8bf60f8d6a014336cd78
|
Provenance
The following attestation bundles were made for schemabrain-0.4.0.tar.gz:
Publisher:
publish.yml on Arun-kc/schemabrain
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
schemabrain-0.4.0.tar.gz -
Subject digest:
b245b7abbfe8eda43585ea962b7c20d6b46ad6782a9ce76e1485d589af9589e3 - Sigstore transparency entry: 1629255740
- Sigstore integration time:
-
Permalink:
Arun-kc/schemabrain@28a30f53760953c1c4136be15af41ab7101fae1c -
Branch / Tag:
refs/heads/main - Owner: https://github.com/Arun-kc
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@28a30f53760953c1c4136be15af41ab7101fae1c -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file schemabrain-0.4.0-py3-none-any.whl.
File metadata
- Download URL: schemabrain-0.4.0-py3-none-any.whl
- Upload date:
- Size: 625.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d77599bb3c63deb4d90411b0071692099c577955cb994477afaad2e572d28390
|
|
| MD5 |
35a1bf57c3347adca6328703d781965d
|
|
| BLAKE2b-256 |
80ff4e0e737a159eecf7e030a3ceaba84f15047c98155d8a2c0c819e5451e585
|
Provenance
The following attestation bundles were made for schemabrain-0.4.0-py3-none-any.whl:
Publisher:
publish.yml on Arun-kc/schemabrain
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
schemabrain-0.4.0-py3-none-any.whl -
Subject digest:
d77599bb3c63deb4d90411b0071692099c577955cb994477afaad2e572d28390 - Sigstore transparency entry: 1629255755
- Sigstore integration time:
-
Permalink:
Arun-kc/schemabrain@28a30f53760953c1c4136be15af41ab7101fae1c -
Branch / Tag:
refs/heads/main - Owner: https://github.com/Arun-kc
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@28a30f53760953c1c4136be15af41ab7101fae1c -
Trigger Event:
workflow_dispatch
-
Statement type: