The SQL firewall between AI agents and your production database.
Project description
Works with Claude Desktop · Claude Code · Cursor · Windsurf · any MCP host
Stop giving AI agents raw database connection strings.
SchemaBrain is the SQL firewall between AI agents and your production database — twelve read-only tools, validated metrics, tamper-evident audit.
The agent never writes SQL. SchemaBrain does, from definitions you control.
Three guarantees that close the trust gap between AI agents and your database:
- Read-only by architecture — twelve MCP tools, none of which can write. No
execute()tool, noquery()tool, no path from agent prompt to a write at your database. - PII refusal at retrieval — PII tags propagate from the physical schema through joins and metrics. If a query touches a blocked category, SchemaBrain refuses before the database is queried.
- Cryptographic audit chain — every call, refusal, and recovery lands in a SHA256-hashed append-only log.
audit verifyexits non-zero if any past row was rewritten.
pip install schemabrain
schemabrain init
# then: Cmd+Q Claude Desktop, relaunch, and ask: "list the entities SchemaBrain knows about"
Cost: ~$0.03 to index the bundled demo · $0 to re-index unchanged schemas. Detail in Sample session.
Status: 0.4.0 (alpha). Postgres + SQLite supported today. Snowflake / BigQuery / MySQL on the roadmap.
Contents
- Quickstart — 3 steps from
pip installto a working agent - The firewall — what SchemaBrain enforces at the SQL boundary
- Observability dashboard — read-only UI for PII, refusals, audit
- Sample session — real Claude Desktop interaction against the bundled fixture
- Where it's going — honest disclaimer about what's not built yet
- Roadmap — shipped + in progress + planned
- Troubleshooting — the five most common first-run failures
- Documentation — deeper guides
Read next based on what you need:
| Goal | Where to go |
|---|---|
| Try it on the bundled fixture | Quickstart |
| Understand the firewall properties | The firewall |
| Wire up your MCP client | Claude Desktop · Claude Code · Cursor · Windsurf · ChatGPT (roadmap) |
| Plug into your own agent loop | docs/setup/manual.md |
| Build a semantic layer | docs/semantic-layer.md |
| Run in production (audit, drift, Docker) | docs/operations.md |
| Observe the agent (tail, audit log, OTel) | docs/observability.md |
| Compare with Querybear / Anthropic reference Postgres MCP | vs Querybear · vs Anthropic reference |
| Compare with Vanna / Atlan / dbt-mcp / WrenAI | docs/landscape.md |
Quickstart
Three steps from pip install to a working Claude Desktop integration. If you paste your own Postgres URL — no Docker needed, ~30s. Press Enter for the bundled demo and init invokes Docker + downloads a ~67 MB embedding model first time; ~45s once cached.
1. Install
pip install schemabrain
schemabrain --version
Source install (git clone + uv sync --extra dev) is documented in docs/setup.md.
2. Run the activation wizard
schemabrain init
init is a seven-stage wizard that takes you from "I have a Postgres database" to "Claude Desktop can answer questions about it" in one command. On first run it prompts for what it needs:
- A Postgres URL — paste your own connection string, or press Enter to spin up a local demo Postgres container with the bundled SaaS fixture (Docker is invoked automatically; idempotent on re-runs).
- An
ANTHROPIC_API_KEY— optional. Skip and the wizard still wires Claude Desktop. On the demo path, entities + metrics + joins are pre-curated from a bundled YAML pack — the semantic layer works zero-config. On your own database, entity curation can run later viaschemabrain entities suggest --applyonce you have a key.
SchemaBrain init — activation wizard
[1/7] Source check ✓ source reachable + read-only
[2/7] Index schema ✓ 12 tables, 84 columns indexed
[3/7] Curate entities ✓ 12 entities applied (bundled demo pack)
[4/7] Curate metrics ✓ 5 metrics applied (bundled demo pack)
[5/7] Curate joins ✓ 8 canonical joins applied (bundled demo pack)
[6/7] Wire host ✓ wrote schemabrain entry to claude_desktop_config.json
(default; switch with --host claude-code|cursor|windsurf|manual)
[7/7] Next ✓ restart your MCP host, then ask: "list the entities SchemaBrain knows about"
Full wizard reference (stages explained, flags, dbt auto-detection, --print-only for non-Claude-Desktop hosts, --no-entities / --no-metrics / --no-joins opt-outs, cost-cap pauses): docs/setup.md.
3. Restart Claude Desktop and ask
-
Quit Claude Desktop fully — Cmd+Q, not just close the window. The MCP config is only read on cold start.
-
Relaunch.
-
New conversation:
list the entities SchemaBrain knows about
If Claude calls list_entities and reports user, order, etc., you're done. If not, see Troubleshooting.
After the wizard, schemabrain inspect shows what the agent has and schemabrain tail streams every tool call live — see docs/operations.md.
The firewall
Six properties SchemaBrain enforces at the SQL boundary today:
1. Read-only by architecture, not configurationThe MCP surface exposes twelve tools — none of which can write. There is no |
2. PII-aware refusal at the
|
3. Tamper-evident audit logEvery tool call under schemabrain audit verify # exit 0 = chain clean
|
4. Failure is a contract, not a stringRefused and degraded calls return a structured {
"status": "refused",
"kind": "ambiguous_time_dimension",
"recovery": {
"suggested_tool": "get_metric",
"suggested_args": {"time_dimension": "order.placed_at"}
}
}
Structured recovery → · Trust signal → · Charter v1.2 reference → |
5. Compile path: definitions → parameterized SQLEntities, metrics, and canonical joins compile to parameterized SQL SchemaBrain runs on its side. The agent sees rows + the SQL that was run — never arbitrary statements at your database. LLM-suggested definitions during |
6. Pluggable into any agent loopThe same MCP stdio surface Claude Desktop sees is exposed to any host that speaks MCP — your own Anthropic, OpenAI, or LangGraph loop included. python examples/anthropic_demo.py \
--url-env DATABASE_URL \
--question "..."
~$0.005–0.02 per run on Haiku 4.5, bounded by |
Observability dashboard
SchemaBrain ships an opt-in read-only dashboard for the same audit + PII + refusal data the firewall is already writing.
pip install "schemabrain[ui]"
schemabrain dashboard
# → http://127.0.0.1:7878
Three surfaces:
- PII Ledger — entity-by-category matrix; catastrophic-leak markers identify which entities will trip the default
--pii-blockpolicy. - Refusals — every
refusedenvelope with its recovery hint, so you can see exactly what the agent was blocked from doing. - Audit Viewer — append-only
mcp_auditchain with hash-verify cursor; click any row to see the compiled SQL, parameters, and PII categories.
The dashboard binds 127.0.0.1 only — there is no --host flag, by design. It's read-only and reads the same SQLite store serve writes to. No agent talks to it.
Dashboard guide → · PII Ledger → · Refusals → · Audit Viewer →
Works with
SchemaBrain speaks the Model Context Protocol over stdio. schemabrain init --host <X> writes first-party config for four MCP clients; everything else that speaks MCP stdio works via --host manual (prints the snippet, you paste).
First-party wiring
schemabrain init --host <X> writes the MCP entry directly into the host's config file.
| Client | Setup guide | Config path |
|---|---|---|
| Claude Desktop | /setup/claude-desktop |
macOS: ~/Library/Application Support/Claude/claude_desktop_config.jsonWindows: %APPDATA%\Claude\claude_desktop_config.json |
| Claude Code | /setup/claude-code |
Shells out to claude mcp add |
| Cursor | /setup/cursor |
~/.cursor/mcp.json |
| Windsurf | /setup/windsurf |
~/.codeium/windsurf/mcp_config.json |
Any other MCP stdio host
schemabrain init --host manual prints the JSON entry to stdout — paste it into whatever host config you're using. Any client that launches a subprocess and speaks MCP stdio should work in principle; we have not exhaustively tested each. Common targets:
- Zed — full walkthrough at
docs/setup/zed.md - Codex CLI (working path for ChatGPT users) — full walkthrough at
docs/setup/codex.md - Cline (VS Code extension) — paste into the MCP server settings
- Continue — paste into
~/.continue/config.json - Your own agent loop — see
examples/anthropic_demo.pyfor a ~250-LOC Anthropic-SDK reference
The 12-tool surface, PII firewall, audit chain, and recovery contracts are transport-agnostic — any compliant stdio MCP client gets the same guarantees.
Agent frameworks
The same stdio MCP surface is reachable from any framework that can spawn an MCP server. The Anthropic SDK path is first-party-tested; the others work in principle if the framework's MCP integration speaks stdio.
- Anthropic SDK — first-party walkthrough at
docs/setup/manual.md; reference loop atexamples/anthropic_demo.py - LangChain / LangGraph — via
langchain-mcp-adapters - Pydantic AI — via its built-in MCP support
- CrewAI / AutoGen / Agno / custom loops — any framework with a stdio MCP client works in principle; we have not tested each
We don't ship per-framework adapters; the framework's standard MCP client is sufficient.
Not yet supported (cloud / HTTPS hosts)
SchemaBrain v0.4 ships stdio only — no HTTPS / SSE transport. Clients that require a cloud HTTPS endpoint do not work today:
- ChatGPT Connectors — see the honest gap page for workarounds and the v0.5+ roadmap
- Hosted MCP gateways — by design (local-first wedge; see vs Querybear)
If you need ChatGPT support today, a community stdio→HTTPS bridge (mcp-remote, mcp-proxy, etc.) may work; we have not validated any specific bridge against the PII / audit / recovery semantics.
Security posture for procurement / review →
Sample session
Real Claude Desktop session against the bundled SaaS fixture (12 tables, 84 columns, 12 entities):
You: Using SchemaBrain, compute our usage volume broken down by plan tier.
Claude: (calls
get_metric(name="usage_volume", group_by=["plan.title"]). The metric anchors onusage_event; reachingplanwould need a join that isn't in the schema, so it refuses:){ "status": "error", "error": { "kind": "unreachable_entity", "message": "entity 'plan' is not reachable from metric anchor 'usage_event'; no canonical join exists. Run `schemabrain joins suggest` to surface candidate joins.", "recovery": { "suggested_tool": "resolve_join", "suggested_args": { "entity_a": "usage_event", "entity_b": "plan" } } } }(Claude confirms at the join layer —
resolve_join("usage_event", "plan")returnskind: no_canonical_join.usage_eventscarries onlyworkspace_id, never aplan_id, and no table bridges the two.)Verdict up front: I can't compute usage-by-plan, and you shouldn't want me to fake it — there is no modeled relationship between usage events and plans in this schema.
usage_eventsrecords aworkspace_id, not a plan, and nothing joins the two. "Usage volume by plan tier" is unanswerable until that link exists — and I'm not going to invent it.What I can give you — contracted revenue by plan tier (this path fully resolves:
subscription_item → subscription → plan, bothfk_constraintjoins,confidence: HIGH):SELECT p.title AS plan_tier, SUM(si.unit_price_cents * si.seats) / 100.0 AS revenue -- prices stored in integer cents FROM public.subscription_items si JOIN public.subscriptions s ON si.subscription_id = s.id JOIN public.plans p ON s.plan_id = p.id GROUP BY p.title ORDER BY revenue DESC;
plan_tier revenue Enterprise $379,620.00 Pro $539.00 Free $54.00 Two honest caveats:
- Contracted, not collected. This is subscription line-item revenue (
seats × unit_price_cents), not paid invoices. For billed totals, thetotal_revenuemetric anchors oninvoicesinstead.- Integer cents.
unit_price_centsis stored as an integer; the/ 100.0converts to currency.
The differentiator is what didn't happen: most LLM-over-database tools, asked for usage-by-plan, would confidently emit JOIN plans p ON usage_events.plan_id = p.id against a plan_id column that doesn't exist. SchemaBrain refused — get_metric returned kind: unreachable_entity with recovery.suggested_tool: resolve_join, not prose. The agent acted on the structured recovery contract programmatically instead of fabricating a join. Refusal-not-fabrication is the safety mechanism, demonstrated live.
Cost. $0.001/column with Claude Haiku 4.5 + Sonnet 4.6 (Sonnet for the structured curation prompt). The bundled 12-table fixture (84 columns, 12 entities + 5 metrics + 8 joins) ships pre-curated, so the demo path applies it for $0 — no API key. Indexing those 84 columns with LLM column descriptions measured **$0.034**. The Pagila DVD-rental sample (87 columns after partition deduplication) runs for $0.0299 in 105s. Re-indexing an unchanged schema is $0 — content-addressable fingerprinting skips the LLM call entirely.
To verify Claude's SQL is mechanically correct (and that flagged caveats are the actual data behavior), see Validating SQL Claude generates.
Run this exact session yourself: schemabrain init walks you to a wired Claude Desktop in one command; then ask Claude "Using SchemaBrain, compute our usage volume broken down by plan tier." and watch the refuse-then-pivot live.
Where it's going
SchemaBrain is being built as the SQL-boundary safety layer for AI agents — the layer that parses what your agent is about to ask the database and refuses (or rewrites) before it runs.
That layer needs a semantic substrate underneath it. You can't refuse "this query touches PII" without knowing which columns are PII. You can't rewrite "join through this junction" without canonical-join definitions. You can't validate a metric without knowing its grain.
So the engineering order is schema intelligence → semantic substrate → safety primitives. The first two are shipped (v0.5 + v1); the third — validate_query for agent-emitted SQL and execute with hard caps — is the next major milestone. Today the product gives you PII-aware refusal at the get_metric boundary, structured recovery on every refused or degraded call, and tamper-evident audit — all running against parameterized SQL the agent never sees. If you need parse-before-execute over arbitrary agent-emitted SQL, track the roadmap.
Roadmap
The
v0.5/v1/v2/v3labels are roadmap milestone names, not package versions. The package follows strict semver —1.0.0is reserved for an API that's been battle-tested by external users without a forced break. See ADR-0003.
v0.5 — schema intelligence (shipped):
- Agent-UX charter v1.0 retrofit on existing tools + CI enforcement ✓
- Dev-UX foundations: rich progress UI, guided errors,
--dry-run✓ - Query log mining via
pg_stat_statements(schemabrain mine-queries) ✓ - 5 physical-schema MCP tools including
get_example_queries✓
v1 — semantic substrate (shipped):
- Entities, metrics, canonical joins as first-class persisted definitions ✓
- LLM-suggested entity / metric / join definitions from FK graph + column descriptions ✓
- 5 semantic-layer MCP tools (
find_relevant_entities,list_entities,describe_entity,resolve_join,get_metric) ✓ - Composite-expression measures —
SUM(unit_price * quantity)over the same anchor table ✓ - Multi-hop canonical-join chains — BFS over the join graph with
via=disambiguation ✓ - Drift detection (
schemabrain check) ✓ - PII-aware refusal at the
get_metricboundary ✓ - Tamper-evident audit log with sha256 chain ✓
v1.x — engine breadth (in progress):
- One additional engine: Snowflake / BigQuery / MySQL
- BIRD Mini-Dev automated eval harness
- Pre-built multi-platform Docker image on a public registry
v2 — SQL-boundary safety wedge:
validate_query— agent-emitted SQL parsed and judged against policy before executionexecutewith hard caps — read-only role enforced at the database layer, statement timeouts, row caps, per-call cost guards- Sub-query refusal with recovery — parse the SQL, identify the unsafe fragment, refuse just that fragment with a suggested rewrite
v3 — multi-engine + control plane (commercial, gated on hosted demand):
- Remaining engines (BigQuery / Snowflake / Redshift breadth)
- Learning loop from telemetry and reformulation patterns
- Hosted control plane with fleet-wide adversarial-signature aggregation
Troubleshooting
The five most common first-run failures. Full troubleshooter in docs/setup/manual.md.
pip install schemabraingave me an older version. Checkschemabrain --version. If it doesn't match the latest release your pip cache is stale — runpip install --upgrade schemabrain.schemabrain initwrites the same version into the Claude Desktop snippet (uvx schemabrain==<pin>) so it stays reproducible across restarts; bump the pin in the snippet manually after upgrading via pip.initreportssource unreachable. Postgres may not be ready on first run — wait a few seconds and re-run. For your own database, verify host, port, and credentials. Connection URLs in any form are accepted (postgresql://,postgres://,postgresql+psycopg://).- The first
initorschemabrain indexhangs for ~60 seconds. Normal. The first index downloads the ONNX embedding model (~67 MB) and makes one LLM call per column. Subsequent runs are fast. initfails at stage 6 "wire host". Claude Desktop must be installed first — SchemaBrain writes into its config file, which doesn't exist until Claude Desktop has launched at least once.- Claude Desktop doesn't show SchemaBrain after restart. Cmd+Q is required (close-window doesn't trigger a re-read of MCP config). Run
schemabrain doctorto verify the config landed. Ifdoctorsays everything's good but Claude Desktop still doesn't see the tool, check~/Library/Logs/Claude/mcp*.log.
Documentation
| Doc | What's inside |
|---|---|
docs/setup.md |
Activation wizard (recommended) — pick a host, run the wizard, ask the agent (~60s) |
docs/setup/docker.md |
Docker install (image with embedding model baked in, no first-run download) |
docs/setup/manual.md |
Manual index, mine-queries, logs config, troubleshooting, MCP Inspector, SQL-validation ladder |
docs/first-5-queries.md |
What to actually do after init — five queries that exercise read-only, PII refusal, audit chain, and structured recovery |
docs/semantic-layer.md |
Building entities, metrics (incl. composite expressions), canonical joins (incl. multi-hop), dbt import |
docs/operations.md |
inspect, check (drift), index --dry-run, Docker compose |
docs/observability.md |
tail, audit log, OTel export, PII classification |
docs/reference/mcp-tools/ |
Full reference for all 12 MCP tools (overview + 12 per-tool pages) |
docs/architecture.mdx |
Pipeline, retrieval contract, cache logic, cost model, eval |
docs/dashboard/overview.mdx |
Read-only observability dashboard — PII ledger, refusals, audit viewer |
docs/landscape.md |
Comparison vs Vanna / Atlan / dbt-mcp / WrenAI; "is this a semantic layer?" |
docs/threat-model.md |
Security model + boundaries |
docs/adr/ |
Architecture decision records (audit/PII taxonomy, store protocol, versioning policy, observability bus) |
examples/ |
Copy-paste-ready MCP configs, headless agent loop, end-to-end ecommerce walkthrough |
FAQ
Does my data leave my machine?
Only LLM-enriched column descriptions and the redacted sample values that feed them. Three regex passes (email, US SSN, credit-card-shaped digit runs) run on every sample before it leaves the profiler module — see schemabrain/profiler/stats.py. The Anthropic API call sends column metadata + redacted samples + sibling-column context — no raw rows. Embeddings are generated locally via fastembed (BAAI/bge-small-en-v1.5, ONNX, ~67 MB).
What databases work today?
Postgres 16+ (primary target) and SQLite (for development and demos). Adding Snowflake / BigQuery / MySQL is mostly a new DataSource implementation plus a profiler tweak — on the v1.x roadmap.
Why MCP and not a REST API? The consumer is an agent, not a service. MCP standardizes tool registration, schema description, and request/response transport. Agents discover SchemaBrain natively and get its tool surface — no API wrapper, no SDK to maintain per language.
Is this a semantic layer like Cube or dbt Semantic Layer?
No — SchemaBrain is a SQL firewall built on a semantic-layer substrate. Entities, metrics, and canonical joins are first-class persisted definitions (list_entities, describe_entity, resolve_join, get_metric), but they exist to make the safety primitives possible — read-only-by-architecture, PII refusal, audit chain. The substrate is the means; the firewall is the headline. Full comparison vs Cube / dbt-mcp / Vanna / WrenAI in docs/landscape.md.
More questions answered in docs/setup/manual.md (why local embeddings, more troubleshooting).
Contributors
Contributing & License
PRs welcome. The bar is high — see CONTRIBUTING.md for the test-first / 99%-coverage / conventional-commits / architecture-invariants checklist. CI enforces all of it.
Bugs and feature requests use the structured templates in .github/ISSUE_TEMPLATE/. Issues without a reproduction (bugs) or a clear underlying problem (features) get closed with a request to re-open with the right info.
MIT.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file schemabrain-0.5.0-py3-none-any.whl.
File metadata
- Download URL: schemabrain-0.5.0-py3-none-any.whl
- Upload date:
- Size: 1.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9ccfd5a0c882044f501eba7e4ac02596a4cadd8e71c2a2ed8cd0c7ae181138a7
|
|
| MD5 |
f1cfbfff0c497ba3b9d09f7233ec6935
|
|
| BLAKE2b-256 |
711042a32bb64a48dc1363310e1bf768b20b2b7514c228a2f375b06cfc9192ad
|
Provenance
The following attestation bundles were made for schemabrain-0.5.0-py3-none-any.whl:
Publisher:
publish.yml on Arun-kc/schemabrain
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
schemabrain-0.5.0-py3-none-any.whl -
Subject digest:
9ccfd5a0c882044f501eba7e4ac02596a4cadd8e71c2a2ed8cd0c7ae181138a7 - Sigstore transparency entry: 1691080358
- Sigstore integration time:
-
Permalink:
Arun-kc/schemabrain@14759ae24b200c1b1afd8cbac249df3cce43c924 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/Arun-kc
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@14759ae24b200c1b1afd8cbac249df3cce43c924 -
Trigger Event:
workflow_dispatch
-
Statement type: