The data layer for AI agents. Extract, transform, and expose data via MCP.
Project description
Substrate
Canonicalize data from any source — APIs, databases, PDFs, agent outputs, ML models — into structured, joinable, traceable, portable bundles your AI agent can query.
Extract once, query forever, trace everything.
The Thesis
Most useful data lives behind APIs, in databases, in PDFs, in agent transcripts, in model outputs — and most of it stays trapped there. Every Claude session that needs real data either hallucinates or scrapes something live. Slow, expensive, no provenance, different results every time.
Substrate is a small kernel that turns any of these sources into typed, canonical, joinable rows in a portable DuckDB bundle, with per-row provenance preserved through every transform. Three ops, two extensibility layers, one bundle format. Any agent can attach and query.
The Four Properties
| Property | What it means |
|---|---|
| Structured | Typed rows, declared schemas, asserts that fail loudly |
| Joinable | Canonical-key registry types cross-source joins |
| Traceable | Per-row provenance (run_id, source, model, cost, sql_hash) follows every transform |
| Portable | DuckDB bundle is one file, attaches over HTTP, no infrastructure |
dbt has structured + joinable. Iceberg has structured + portable. Fivetran has structured. Substrate is the only thing that brings all four together.
Install
pip install substrate-data
The PyPI distribution is substrate-data; the Python import and CLI remain
substrate. The bare substrate package name is owned by an unrelated
archived AI SDK.
Domain libraries are separate packages — install what you need:
pip install substrate-govdata # 26 US agency connectors + 21 datasets
pip install substrate-llm # LLM extraction recipes (OpenRouter / Anthropic)
pip install substrate-pdf # PDF text + section extraction recipes
Hello World (no external services required)
from substrate import DuckDBStore
from substrate.connectors.csv_file import CSVConnector
csv = CSVConnector({
"path": "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
})
store = DuckDBStore("./hello.duckdb")
store.sync(csv, "titanic")
store.derive(
"survival_by_class",
"SELECT Pclass, AVG(CAST(Survived AS DOUBLE)) AS survival_rate "
"FROM titanic GROUP BY Pclass ORDER BY Pclass",
depends_on=["titanic"],
)
for row in store.query("SELECT * FROM survival_by_class"):
print(row)
Then expose it to Claude:
substrate serve ./hello.duckdb
The Kernel: Three Ops
sync — external source → table (connectors live here)
transform — table → table via callable (recipes live here)
derive — SQL transform across tables (joins, canonicalize, project)
Plus three orchestrators: materialize runs the DAG, publish ships a bundle to a Registry (R2, GitHub Releases, local dir), pull/from_url fetches one and ATTACHes it via DuckDB httpfs.
Compatibility.
extract_text,section_extract, andllm_extractremain available as convenience methods. They are now wrappers overtransform; the next package split moves their implementation intosubstrate-pdfandsubstrate-llm.
Operational Stores vs Bundle Format
Do not confuse where a pipeline runs with what gets distributed.
| Concern | What it is for | Substrate contract |
|---|---|---|
| Execution store | Runs sync, transform, derive, asserts, and materialize |
ExecutionStore |
| Bundle publisher | Ships a portable DuckDB artifact to agents, analysts, MCP/API consumers | BundlePublisher |
DuckDBStore satisfies both contracts today: it can run pipelines locally and
publish the same file as a bundle. Flightdeck-style apps should keep Postgres
as the operational store for concurrency, tenancy, RLS, jobs, and hot reads.
That path is PostgresStore: it implements ExecutionStore against one
schema in an application Postgres database, with an explicit export bridge that
produces DuckDB bundles for distribution.
pull() and from_url() stay DuckDB-specific because they are consumer-side
bundle conveniences, not requirements for every operational backend.
from substrate import PostgresStore
store = PostgresStore(settings.DATABASE_URL, schema=f"tenant_{workspace_id}")
result = store.materialize(my_dataset)
# Operational tables live in Postgres; portable consumers still get DuckDB.
bundle = store.export_to_duckdb("./workspace_bundle.duckdb", overwrite=True)
Two Extensibility Layers
Connectors — for "I have a data source"
Implement three methods on BaseConnector and you have a substrate connector. The kernel handles state, manifests, retries with backoff, fan_out across parameter combos, partial-failure resumption, MCP exposure, provenance.
from substrate.connectors.base import BaseConnector
class MyAPIConnector(BaseConnector):
@classmethod
def connector_type(cls) -> str:
return "myapi"
def list_streams(self):
return [StreamInfo(name="orders"), StreamInfo(name="customers")]
def get_stream_schema(self, stream):
...
def extract_streaming(self, stream, stream_filters=None):
for page in self.paginate(stream, stream_filters):
yield from page
Built-in: CSVConnector, PostgresConnector, RESTAPIConnector. Domain packages add their own — substrate-govdata ships 26 US federal agency connectors with canonical join keys (FIPS, NAICS, year).
Recipes — for "I have unstructured stuff that needs structure"
A recipe is a batch-capable callable that turns source rows into one or more
output rows, with declared output schema and provenance facets. Simple per-row
recipes subclass RowRecipe; cost-sensitive recipes can process batches
directly. PDFs, OCR, LLM extraction, vision classification, agent loops, audio
transcription — every modality becomes a recipe on the same transform op.
from substrate import ColumnSpec, RowRecipe, TransformContext
class UpperTitle(RowRecipe):
output_schema = [ColumnSpec("upper_title", "VARCHAR")]
recipe_name = "examples.upper_title"
def transform_row(self, row: dict, context: TransformContext) -> dict:
return {"upper_title": row["title"].upper()}
store.transform(
"upper_docs",
source="docs",
recipe=UpperTitle(),
pass_through_columns=["id"],
)
Transform outputs get common provenance (__substrate_recipe,
__substrate_source_row_id, __substrate_input_hash, etc.). Row or batch
failures can be recorded in _substrate_transform_errors; error thresholds
decide whether the transform blocks.
from substrate import DuckDBStore
from substrate_pdf import extract_text
from substrate_llm import llm_extract
store.sync(pdf_connector, "pdfs") # URLs / paths → table
extract_text(store, "pdfs", "acfr_text") # PDF rows → text rows
llm_extract( # text rows → structured rows
store, "acfr_text", "acfr_financials",
prompt="Extract total_revenue_usd, total_expenses_usd, unfunded_opeb_usd.",
schema={
"total_revenue_usd": "DOUBLE",
"total_expenses_usd": "DOUBLE",
"unfunded_opeb_usd": "DOUBLE",
},
)
Real example: govdata/examples/acfr_extraction.py extracts 56 financial fields from 168 municipal audit PDFs for $1.87 total. Every row traces to its source PDF page, LLM model, prompt hash, and cost.
Provenance by Default
Every substrate-managed table carries provenance columns, but the exact column set depends on the operation that emitted the row:
Sync tables record source-call lineage:
__substrate_run_id
__substrate_synced_at
__substrate_source_stream
__substrate_source_filter
__substrate_connector_type
Derive tables record SQL lineage:
__substrate_run_id
__substrate_sql_hash
__substrate_depends_on
__substrate_derived_at
__substrate_contributor
Recipe outputs add recipe-specific facets (cost, model, prompt_hash, confidence, bbox, source page, extraction status, etc.) — declared by each recipe, no hand-maintenance.
store.query("""
SELECT __substrate_model, __substrate_cost_usd
FROM acfr_financials
LIMIT 1
""")
# [{'__substrate_model': 'anthropic/claude-sonnet-4', '__substrate_cost_usd': 0.011}]
Datasets and Bundles
A Dataset is a declarative recipe — syncs + derives + asserts + canonical join keys. materialize(dataset) runs the whole DAG. publish(dataset, registry) ships a versioned bundle to R2 (or GitHub Releases, or a local dir) plus an index.json entry. Anyone with the URL can attach it from DuckDB without installing substrate:
INSTALL httpfs; LOAD httpfs;
ATTACH 'https://substrate.example.com/datasets/economy/state_panel/v1/store.duckdb'
AS state_panel (READ_ONLY);
SELECT * FROM state_panel.state_panel WHERE year = 2023;
Or via the substrate library:
from substrate import DuckDBStore
store = DuckDBStore.from_url("https://substrate.example.com/.../store.duckdb")
store.query("""
SELECT period_label, value
FROM fred_unemployment_rate
ORDER BY date DESC
LIMIT 5
""")
Build Plane vs Serve Plane
Substrate has two separate hardening surfaces:
| Plane | Goal | Hardening that belongs here |
|---|---|---|
| Build plane | Create correct bundles | connector contracts, fan-out provenance, min-row checks, canonical-key conformance, join-key column contract checks, failed-assert blocking |
| Serve plane | Make bundles safe and easy to use | catalog metadata, raw vs curated, live vs draft vs broken, MCP attach/query workflow, API schema, honest status/quality signals |
Keeping these separate matters. A dataset can be a good recipe but not yet materialized locally. A bundle can be queryable but still draft. MCP should never hide those states: agents need to know both whether data is available now and whether the recipe has passed its build checks.
Real-data smoke checks
The repo includes smoke scripts that hit real upstream sources and exercise both planes:
# Build plane: materialize raw, heavy fan-out, and curated cross-agency bundles.
uv run python scripts/smoke_real_datasets.py
# Serve plane: start the platform, then call MCP with a real MCP client.
cd platform/backend
SUBSTRATE_PLATFORM_HOME=/tmp/substrate-e2e/platform \
uv run uvicorn main:app --host 127.0.0.1 --port 8790
# In another terminal:
uv run python scripts/smoke_mcp_client.py \
--url http://127.0.0.1:8790/mcp/mcp \
--dataset census_acs1_state
# Operational Postgres path: run in a temp schema, export to DuckDB, query both.
SUBSTRATE_POSTGRES_TEST_DSN=postgresql://... \
uv run python scripts/smoke_postgres_store.py
The default build smoke materializes census_acs1_state,
fred_macro_indicators, and state_panel. The FRED tables use
period_label, not period, as their canonical time key.
Cross-Source Joins via Canonical Keys
The killer feature of the connector ecosystem: joins across sources are typed. govdata declares state_fips, state_abbrev, county_fips, cbsa, year, naics, cik, ein, etc. as canonical kinds. Any dataset declaring JoinKey("state_fips", ...) gets per-row IN-list conformance asserts for free, and joins to any other dataset claiming the same kind.
JOINS_ON = {
"bls_qcew_state": [
JoinKey("state_fips", "state_fips", "VARCHAR"),
JoinKey("year", "year", "INTEGER"),
],
}
The same pattern works for private-sector data — match Stripe customers to Salesforce accounts via email, match QBO vendors to HubSpot companies via EIN. Each domain package ships its own canonical-key registry.
CLI
substrate version
substrate serve <store.duckdb> # MCP server over stdio
substrate tables <store.duckdb> # list tables + row counts
substrate query <store.duckdb> "<sql>" # run a SELECT/WITH query
substrate inspect <module:Class> <stream> # preview a connector stream
substrate doctor [--package P] # health-check installed datasets
substrate serve is what you point Claude Desktop at:
{
"mcpServers": {
"substrate": {
"command": "substrate",
"args": ["serve", "/absolute/path/to/store.duckdb"]
}
}
}
What Substrate Is NOT
- Not dbt. dbt is the SQL-transformation graph. Substrate's unique value is
syncandtransform— getting data IN from APIs, models, documents. dbt has zero LLM ops, no portable bundles, no canonical-key join registry. Use both: substrate produces tables, dbt transforms them. - Not Fivetran/Airbyte. Those ship to your warehouse. Substrate ships to a portable DuckDB bundle. Different audience: agent builders and analysts who don't run a warehouse.
- Not HuggingFace datasets. HF is for ML training. Substrate is for analytical SQL/MCP queries.
- Not Iceberg/Delta Sharing. Those are table formats for already-existing data at scale. Substrate is end-to-end, opinionated about how data gets created.
- Not LangChain document loaders. Loaders are stateless converters with no provenance. Substrate's
transformop is the durable, auditable, idempotent equivalent. - Not a hosted platform (yet). Substrate is a library you install. Hosting is an option, not a requirement.
Architecture
Connectors Recipes
(sync — APIs, DBs, PDFs) (transform — LLM, OCR,
vision, agent loops, ML)
│ │
▼ ▼
┌────────────────────────────────────┐
│ 3-op kernel + ExecutionStore │
│ sync → transform → derive │
│ DuckDB today; Postgres path │
│ (per-row provenance throughout) │
└────────────────────────────────────┘
│
export / publish
▼
┌────────────────────────────────────┐
│ DuckDB bundle + Registry │
│ index.json + versioned bundles │
└────────────────────────────────────┘
│
pull / from_url
│
▼
MCP server / DuckDB ATTACH
Any agent queries with SQL
Three ops. Two layers (connectors, recipes). One bundle format. Provenance everywhere.
Domain Libraries
| Package | What it adds | Status |
|---|---|---|
substrate-govdata |
26 US agency connectors, 21 published datasets, FIPS/NAICS canonical keys | shipping |
substrate-acfr |
Municipal audit PDF extraction pipeline | shipping |
substrate-llm |
LLM extraction recipe (OpenRouter, Anthropic, OpenAI) | factoring out of kernel |
substrate-pdf |
PDF text + section extraction recipes | factoring out of kernel |
substrate-stripe |
Stripe connector + canonical schemas | planned |
substrate-salesforce |
Salesforce connector + canonical schemas | planned |
substrate-ocr |
Tesseract / Textract OCR recipes | planned |
substrate-vision |
CLIP / vision-model classification recipes | planned |
substrate-agent |
Tool-using agent loop recipes | planned |
Same kernel, same primitives, different sources and recipes.
Where This Is Going
See research/substrate_thesis_2026_04_28.md for the full thesis and roadmap. The short version:
- Phase 0 — taxonomy hygiene (plumb
kind/health, lint tag prefixes) - Phase 1 — UI collapse (drop hardcoded packs, derive groupings from data)
- Phase 2 — kernel reshape (
transformlanded; extract LLM/PDF into separate packages next) - Phase 3 — hierarchical dataset namespaces (
economy.state_paneletc.) - Phase 4 — first private-sector connector + first new modality recipe (proves the package ecosystem)
License
MIT.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file substrate_data-0.1.1.tar.gz.
File metadata
- Download URL: substrate_data-0.1.1.tar.gz
- Upload date:
- Size: 67.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2963e1eb5daec15d6b324148e2cc6e7512b736bdbea70c908d70505623b0bf4c
|
|
| MD5 |
a60b653ee3e3dd84105d85288fea0bba
|
|
| BLAKE2b-256 |
6c834ff173d84f3d9e08f60842f9d9b2b88db4cfaff0a9504c5ae4459a1358ca
|
File details
Details for the file substrate_data-0.1.1-py3-none-any.whl.
File metadata
- Download URL: substrate_data-0.1.1-py3-none-any.whl
- Upload date:
- Size: 80.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
483568c5a502429123f255e21bba328c40e08488e8f0a78f41b4a7c4b738aee4
|
|
| MD5 |
1ca5440a8780f14407a09fb646edcd5a
|
|
| BLAKE2b-256 |
07f164ad071db76d95e740a2214f6de6931b6563aa5b311cbecb4dcb167842c0
|