An MCP server that gives AI agents deep semantic understanding of any production database.
Project description
Schema Brain
The SQL-boundary safety layer for AI agents that touch real databases. Schema intelligence and LLM-enriched semantics today; validate-before-execute, PII-tagged refusal, and sub-query rewrite landing in v2.
Status: 0.2.0a1 (alpha preview). Postgres + SQLite supported today. Snowflake / BigQuery / MySQL on the v1 roadmap. APIs may change before v1 — pin the version (pip install schemabrain==0.2.0a1) if you need stability.
The problem
AI agents fail when querying real production databases:
- Schemas don't fit in context — a 300-table schema is 50k+ tokens of
CREATE TABLEalone. - Column names are cryptic —
acct_dim_v3,pmt_fct_h,cust_id_v2_legacy. - Joins aren't obvious — which FK is the "right" one when there are three?
- Data has shapes —
statuscould be 5 enum values, 50, or a free-text mess.
Schema Brain fixes all four and serves the result through a stable MCP tool surface that any agent (Claude Desktop, Anthropic SDK, custom) can call.
The bigger problem behind these — database MCPs running as the credentialed role, prompt injection escalating to SQLi (Anthropic Postgres MCP's published NPM/Docker artifacts shipped an unpatched SQL injection at archival per Datadog Security Labs; Supabase MCP enables data exfil under documented conditions), no PII-aware refusal at the SQL boundary — is what Schema Brain is being built to address at the SQL-boundary safety layer in v2. The schema intelligence shipping today is the substrate that layer needs. See Where this is going.
What it does
- Indexes your database schema, profiles each column, and generates a one-paragraph LLM description per column (Claude Haiku 4.5 by default; Sonnet 4.6 for cryptic abbreviations).
- Embeds the descriptions locally with
BAAI/bge-small-en-v1.5viafastembed— no second API vendor. - Stores everything in a single SQLite file. No Qdrant, no Redis, no ops.
- Serves five MCP tools:
find_relevant_tables,describe_table,describe_column,suggest_joins,get_example_queries. Every response includes a token estimate so agents can budget context. - Mines observed queries from
pg_stat_statementssoget_example_queriesreturns the SQL agents (or humans) have actually run against your tables — not invented examples.
Where this is going
Schema Brain is being built as the SQL-boundary safety layer for AI agents — the layer that parses what your agent is about to ask the database and refuses (or rewrites) before it runs.
That layer needs a semantic substrate underneath it. You can't refuse "this query touches PII" without knowing which columns are PII. You can't rewrite "join through this junction" without canonical-join definitions. You can't validate a metric without knowing its grain.
So the engineering order is schema intelligence → semantic substrate → safety primitives:
- v0 / v0.5 — schema intelligence (shipping now): schema introspection, LLM-enriched column descriptions, embedding retrieval, query-log mining via
pg_stat_statements, and 5 MCP tools includingget_example_queriesreturning observed SQL. - v1 — semantic substrate: entities, metrics, canonical joins as first-class persisted definitions. LLM-suggested from observed data; user-confirmed in YAML.
- v2 — safety wedge: PII-tagged refusal,
validate_querybefore execute,executewith row/cost/timeout caps, sub-query refusal with recovery (parse agent SQL, refuse just the unsafe fragment with a suggested rewrite). No shipped competitor as of mid-2026.
Today the product is schema intelligence. The safety layer is the trajectory, not a current claim. If you need safety primitives now, this isn't ready yet — track the v2 roadmap.
How it compares
The open-source landscape thinned in 2026: Vanna's public repo was frozen as the project went commercial, and the reference Postgres MCP server was archived in 2025 with no first-party successor named. The live landscape today:
| Project | License | First-party MCP | Status |
|---|---|---|---|
| Schema Brain | MIT | ✅ | Active — 0.2.0a1 alpha |
| Vanna AI | MIT (repo frozen) | ❌ | OSS archived 2026-03; project moved commercial (Vanna 2.0 / Cloud / Enterprise) |
| Reference Postgres MCP | MIT | ✅ | Archived 2025-05; no first-party successor named |
| Atlan | Closed-source | ✅ | SaaS-only, enterprise pricing |
| dbt-mcp | Apache-2.0 | ✅ | Active — requires a dbt project |
| WrenAI | Apache-2.0 | ❌ (roadmap) | Active — uses MDL modeling layer |
Schema Brain sits where none of these cover cleanly: OSS + MIT + first-party MCP + no modeling layer required + introspects a live Postgres in one Python process + mines pg_stat_statements to surface observed SQL as agent context.
The longer-term position is the SQL-boundary safety layer for agents (see Where this is going). None of the projects above operate at the parse-agent-SQL-and-judge-against-policy layer; that's the v2 wedge.
What it looks like in practice
Real Claude Desktop session, validated 2026-05-11 against the bundled e-commerce fixture (6 tables, 24 columns, indexed for $0.0074):
You: Using Schema Brain, write me a SQL query to compute each customer's total spend by product category.
Claude: (calls
describe_tableon users, products, orders, order_items, product_categories — in parallel)Here's the query. Schema is fully verified via Schema Brain.
SELECT u.id AS customer_id, u.full_name, c.name AS category_name, SUM(oi.quantity * oi.unit_price_cents) / 100.0 AS total_spend FROM public.users u JOIN public.orders o ON o.user_id = u.id JOIN public.order_items oi ON oi.order_id = o.id JOIN public.products p ON p.id = oi.product_id JOIN public.product_categories pc ON pc.product_id = p.id JOIN public.categories c ON c.id = pc.category_id GROUP BY u.id, u.full_name, c.name;Things you should know:
- Facts (from the schema): Prices live on
order_items.unit_price_cents, notorders.product_categoriesis a many-to-many composite-PK join table.- Assumptions: No status filter applied —
statusis free TEXT with default'pending'.categories.parent_idexists (nested categories) but I'm grouping by leaf — say if you want a recursive CTE for roll-up.- Caveat: A product in N categories has its spend counted N times (per category). Per-customer cross-category sums will exceed actual spend if products are multi-category in your data.
That's senior-engineer-grade output. None of the caveats — M:N double-counting, recursive-CTE awareness, free-text-status flag — is hardcoded; it falls out of letting Claude reason over the indexed descriptions. To verify Claude's SQL is mechanically correct (and that flagged caveats are the actual data behavior), see Validating SQL Claude generates.
And when the answer doesn't exist
You: Using Schema Brain, where do I find payment processor transaction IDs?
Claude: Short answer: they're not in this database — at least not in anything Schema Brain has indexed.
Most likely: payments live in a different system (Stripe/Adyen dashboard), or there's a payments table that wasn't indexed, or payment IDs are stored externally and only
orders.statusreflects the result.What to do next:
SELECT table_schema, table_name FROM information_schema.tables WHERE table_name ILIKE '%payment%'to confirm.
The qualifier — "at least not in anything Schema Brain has indexed" — matters. Most LLM-over-database tools confidently invent a payments table or shoehorn the answer into orders.total_cents. Schema Brain doesn't.
Cost. ~$0.0003/column with Claude Haiku 4.5. The bundled 6-table fixture indexes for $0.0074 in 38s; the Pagila DVD-rental sample (87 columns after partition deduplication) for $0.0299 in 105s. Re-indexing an unchanged schema costs $0 — content-addressable fingerprinting skips the LLM call entirely.
Quickstart
Five minutes from pip install to a working Claude Desktop integration. Three caveats up front — they tripped real users:
| Gotcha | Fix |
|---|---|
psql is not on macOS by default |
We use docker exec -i sb-pg psql ... instead — runs psql inside the postgres container, no host install needed |
pip install schemabrain and the first schemabrain index are each silent for ~30–60s |
Don't kill them. pip resolves ~75 wheels; the first index downloads the ONNX embedding model (~67 MB) and makes 24 LLM calls. Progress bars land in v0. |
ANTHROPIC_API_KEY propagation |
Run export ANTHROPIC_API_KEY=sk-ant-... in the same terminal you'll run index from |
1. Install
pip install schemabrain
Or from source if you want to hack on it:
git clone git@github.com:Arun-kc/schemabrain.git
cd schemabrain && uv sync --extra dev
2. Boot Postgres + apply the bundled fixture (or point at your own DB)
docker run --rm -d -p 5432:5432 -e POSTGRES_PASSWORD=local --name sb-pg postgres:16-alpine
docker exec -i sb-pg psql -U postgres -d postgres \
< $(schemabrain fixture-path ecommerce.sql)
For your own database, skip docker and use your real postgresql+psycopg:// URL.
3. Index it
export ANTHROPIC_API_KEY=sk-ant-...
export DATABASE_URL="postgresql+psycopg://postgres:local@localhost:5432/postgres"
schemabrain index --url-env DATABASE_URL --store-path ./schemabrain.db
--url-env keeps the password out of ps, shell history, and journald.
The older schemabrain index "<url>" form still works for backwards
compatibility, but emits a deprecation warning when the URL contains a
password.
Expect ~30–60 seconds of silence on the first run, then:
Indexed 6 table(s): 6 changed, 0 unchanged, 0 removed.
Columns: +24/~0/-0. LLM: 24 descriptions ($0.0074). Embeddings: 24
4. Wire into Claude Desktop
Edit ~/Library/Application Support/Claude/claude_desktop_config.json (macOS):
{
"mcpServers": {
"schemabrain": {
"command": "/ABSOLUTE/PATH/TO/.venv/bin/schemabrain",
"args": [
"serve",
"--url-env",
"DATABASE_URL",
"--store-path",
"/ABSOLUTE/PATH/TO/schemabrain.db"
],
"env": {
"DATABASE_URL": "postgresql+psycopg://postgres:local@localhost:5432/postgres"
}
}
}
}
Both paths must be absolute. Quit Claude Desktop fully (Cmd+Q) and relaunch. The 🔌 tools panel should now show "schemabrain" with 4 tools.
Or for Cursor: drop the same mcpServers block into ~/.cursor/mcp.json (global) or .cursor/mcp.json (project-scoped). Template at examples/cursor_mcp_config.example.json. Restart Cursor.
For the headless Anthropic-SDK path, see examples/anthropic_demo.py and docs/setup.md.
Roadmap
v0.5 — finish schema intelligence (shipped):
- Agent-UX charter v1.0 retrofit on existing tools + CI enforcement ✓
- Dev-UX foundations: rich progress UI, guided errors,
--dry-run✓ - Query log mining via
pg_stat_statements(schemabrain mine-queries) ✓ - 5th MCP tool:
get_example_queries— returns real SQL from your query log matching agent intent ✓
v1 — semantic substrate:
- Entities, metrics, canonical joins as first-class persisted definitions
- LLM-suggested entity/metric definitions from existing column descriptions + FK graph (the wedge: Cube/dbt require multi-week hand-authoring; Schema Brain collapses bootstrap to ~30 min)
- BIRD Mini-Dev automated eval harness
- Drift CLI:
schemabrain reindex --diff - One additional engine: Snowflake / BigQuery / MySQL
- Typer + rich CLI migration
v2 — SQL-boundary safety wedge:
- PII tagging beyond pattern redaction (column-level classification, agent-visible refusal at the tool boundary)
validate_query— agent-emitted SQL parsed and judged against policy before executionexecutewith hard caps — read-only Postgres role enforced at the database layer (not just SQL string inspection), statement timeouts, row caps, per-call cost guards- Sub-query refusal with recovery — parse the SQL, identify the unsafe fragment, refuse just that fragment with a suggested rewrite or alternative-tool call
- Append-only
mcp_auditlog + response provenance on every tool call
v3 — multi-engine + control plane (commercial, gated on hosted demand):
- Remaining engines (BigQuery / Snowflake / Redshift breadth)
- Learning loop from telemetry and reformulation patterns
- Hosted control plane with fleet-wide adversarial-signature aggregation (per-deployment refusal patterns propagate across tenants — Cloudflare-WAF model)
Documentation
docs/architecture.md— pipeline, retrieval contract, cache logic, cost model, eval, what's validateddocs/mcp-tools.md— full reference for the 4 MCP tools with example responsesdocs/setup.md— Claude Desktop wiring + Anthropic SDK demo, with troubleshootingCONTRIBUTING.md— dev setup, TDD expectations, conventional commits, architecture invariantsexamples/— copy-paste-ready Claude Desktop config + headless agent loop using the officialmcpPython SDK
FAQ
Does my data leave my machine?
Only LLM-enriched column descriptions and the redacted sample values that feed them. Three regex passes (email, US SSN, credit-card-shaped digit runs) run on every sample before it leaves the profiler module — see schemabrain/profiler/stats.py. The Anthropic API call sends column metadata + redacted samples + sibling-column context — no raw rows, no full result sets. Embeddings are generated locally via fastembed (BAAI/bge-small-en-v1.5, ONNX, ~67 MB).
Is this a semantic layer like Cube or dbt Semantic Layer?
Today, no — Schema Brain is schema intelligence (LLM-enriched descriptions + retrieval over your physical schema). Agents see schema.table.column, not entity.metric.
The semantic substrate (first-class entities like customer instead of public.users, metrics with grain + units, canonical joins as versioned definitions) lands in v1. But the semantic layer is the substrate, not the headline — it's what makes the v2 SQL-boundary safety primitives possible (refuse-by-PII-tag, validate-before-execute, sub-query refusal). If you already run dbt or Cube, Schema Brain will complement them at the safety layer rather than replace them at the semantic layer; if you don't, the v1 substrate is generated for you (LLM-suggested, user-confirmed).
What databases work today?
Postgres 16+ (primary target) and SQLite (for development and demos). Adding Snowflake / BigQuery / MySQL is mostly a new DataSource implementation plus a profiler tweak — on the v1 roadmap.
Why MCP and not a REST API? The consumer is an agent, not a service. MCP standardizes tool registration, schema description, and request/response transport. Agents (Claude Desktop, the Anthropic SDK, custom ones) discover Schema Brain natively and get four tools — no API wrapper, no SDK to maintain per language.
Why local embeddings instead of OpenAI / Voyage? One LLM provider (Anthropic) and one local vector model is simpler than two API vendors. Embeddings change rarely, the model is bounded (one short description per column), and ~30 ms per query embed on a laptop is fast enough. Local-first also means you can index a private schema without exposing it to a second vendor.
Contributing
PRs welcome. The bar is high — see CONTRIBUTING.md for the test-first / 99%-coverage / conventional-commits / architecture-invariants checklist. CI enforces all of it.
Bugs and feature requests use the structured templates in .github/ISSUE_TEMPLATE/. Issues without a reproduction (bugs) or a clear underlying problem (features) get closed with a request to re-open with the right info.
License
MIT.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file schemabrain-0.2.0a1.tar.gz.
File metadata
- Download URL: schemabrain-0.2.0a1.tar.gz
- Upload date:
- Size: 626.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.12 {"installer":{"name":"uv","version":"0.11.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
db7ee650802303cf9739113420a788953c9235dd8e75299574e38aa17ffec415
|
|
| MD5 |
27689efd5abd7edac753eb3d2487a5c5
|
|
| BLAKE2b-256 |
f33b8102633e9d068bd4fffb75f744fcf6e6be7365fefb4d8d9ab2af30d98e91
|
File details
Details for the file schemabrain-0.2.0a1-py3-none-any.whl.
File metadata
- Download URL: schemabrain-0.2.0a1-py3-none-any.whl
- Upload date:
- Size: 156.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.12 {"installer":{"name":"uv","version":"0.11.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3a3f27b46bb945336758829920ccb3ba368d1462af615e8492b4c016ce88a0d6
|
|
| MD5 |
08fd5c7333e31b3a1f21029c95946aec
|
|
| BLAKE2b-256 |
450ee6a8e646523798be290a2dfa7c302efda1d924720ce3d4d8671130a0596e
|