The data layer for AI agents. Extract, transform, and expose data via MCP.

These details have not been verified by PyPI

Project description

Substrate

Canonicalize data from any source — APIs, databases, PDFs, agent outputs, ML models — into structured, joinable, traceable, portable bundles your AI agent can query.

Extract once, query forever, trace everything.

The Thesis

Most useful data lives behind APIs, in databases, in PDFs, in agent transcripts, in model outputs — and most of it stays trapped there. Every Claude session that needs real data either hallucinates or scrapes something live. Slow, expensive, no provenance, different results every time.

Substrate is a small kernel that turns any of these sources into typed, canonical, joinable rows in a portable DuckDB bundle, with per-row provenance preserved through every transform. Three ops, two extensibility layers, one bundle format. Any agent can attach and query.

The Four Properties

Property	What it means
Structured	Typed rows, declared schemas, asserts that fail loudly
Joinable	Canonical-key registry types cross-source joins
Traceable	Per-row provenance (run_id, source, model, cost, sql_hash) follows every transform
Portable	DuckDB bundle is one file, attaches over HTTP, no infrastructure

dbt has structured + joinable. Iceberg has structured + portable. Fivetran has structured. Substrate is the only thing that brings all four together.

Install

pip install substrate-data

The PyPI distribution is substrate-data; the Python import and CLI remain substrate. The bare substrate package name is owned by an unrelated archived AI SDK.

Domain libraries are separate packages — install what you need:

pip install substrate-govdata     # 26 US agency connectors + 21 datasets
pip install substrate-llm         # LLM extraction recipes (OpenRouter / Anthropic)
pip install substrate-pdf         # PDF text + section extraction recipes

Hello World (no external services required)

from substrate import DuckDBStore
from substrate.connectors.csv_file import CSVConnector

csv = CSVConnector({
    "path": "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
})

store = DuckDBStore("./hello.duckdb")
store.sync(csv, "titanic")

store.derive(
    "survival_by_class",
    "SELECT Pclass, AVG(CAST(Survived AS DOUBLE)) AS survival_rate "
    "FROM titanic GROUP BY Pclass ORDER BY Pclass",
    depends_on=["titanic"],
)

for row in store.query("SELECT * FROM survival_by_class"):
    print(row)

Then expose it to Claude:

substrate serve ./hello.duckdb

The Kernel: Three Ops

sync       — external source → table          (connectors live here)
transform  — table → table via callable       (recipes live here)
derive     — SQL transform across tables      (joins, canonicalize, project)

Plus three orchestrators: materialize runs the DAG, publish ships a bundle to a Registry (R2, GitHub Releases, local dir), pull/from_url fetches one and ATTACHes it via DuckDB httpfs.

Compatibility. extract_text, section_extract, and llm_extract remain available as convenience methods. They are now wrappers over transform; the next package split moves their implementation into substrate-pdf and substrate-llm.

Operational Stores vs Bundle Format

Do not confuse where a pipeline runs with what gets distributed.

Concern	What it is for	Substrate contract
Execution store	Runs `sync`, `transform`, `derive`, asserts, and `materialize`	`ExecutionStore`
Bundle publisher	Ships a portable DuckDB artifact to agents, analysts, MCP/API consumers	`BundlePublisher`

DuckDBStore satisfies both contracts today: it can run pipelines locally and publish the same file as a bundle. Flightdeck-style apps should keep Postgres as the operational store for concurrency, tenancy, RLS, jobs, and hot reads. That path is PostgresStore: it implements ExecutionStore against one schema in an application Postgres database, with an explicit export bridge that produces DuckDB bundles for distribution.

pull() and from_url() stay DuckDB-specific because they are consumer-side bundle conveniences, not requirements for every operational backend.

from substrate import PostgresStore

store = PostgresStore(settings.DATABASE_URL, schema=f"tenant_{workspace_id}")
result = store.materialize(my_dataset)

# Operational tables live in Postgres; portable consumers still get DuckDB.
bundle = store.export_to_duckdb("./workspace_bundle.duckdb", overwrite=True)

Two Extensibility Layers

Connectors — for "I have a data source"

Implement three methods on BaseConnector and you have a substrate connector. The kernel handles state, manifests, retries with backoff, fan_out across parameter combos, partial-failure resumption, MCP exposure, provenance.

from substrate.connectors.base import BaseConnector

class MyAPIConnector(BaseConnector):
    @classmethod
    def connector_type(cls) -> str:
        return "myapi"

    def list_streams(self):
        return [StreamInfo(name="orders"), StreamInfo(name="customers")]

    def get_stream_schema(self, stream):
        ...

    def extract_streaming(self, stream, stream_filters=None):
        for page in self.paginate(stream, stream_filters):
            yield from page

Built-in: CSVConnector, PostgresConnector, RESTAPIConnector. Domain packages add their own — substrate-govdata ships 26 US federal agency connectors with canonical join keys (FIPS, NAICS, year).

Recipes — for "I have unstructured stuff that needs structure"

A recipe is a batch-capable callable that turns source rows into one or more output rows, with declared output schema and provenance facets. Simple per-row recipes subclass RowRecipe; cost-sensitive recipes can process batches directly. PDFs, OCR, LLM extraction, vision classification, agent loops, audio transcription — every modality becomes a recipe on the same transform op.

from substrate import ColumnSpec, RowRecipe, TransformContext

class UpperTitle(RowRecipe):
    output_schema = [ColumnSpec("upper_title", "VARCHAR")]
    recipe_name = "examples.upper_title"

    def transform_row(self, row: dict, context: TransformContext) -> dict:
        return {"upper_title": row["title"].upper()}

store.transform(
    "upper_docs",
    source="docs",
    recipe=UpperTitle(),
    pass_through_columns=["id"],
)

Transform outputs get common provenance (__substrate_recipe, __substrate_source_row_id, __substrate_input_hash, etc.). Row or batch failures can be recorded in _substrate_transform_errors; error thresholds decide whether the transform blocks.

from substrate import DuckDBStore
from substrate_pdf import extract_text
from substrate_llm import llm_extract

store.sync(pdf_connector, "pdfs")              # URLs / paths → table
extract_text(store, "pdfs", "acfr_text")       # PDF rows → text rows
llm_extract(                                   # text rows → structured rows
    store, "acfr_text", "acfr_financials",
    prompt="Extract total_revenue_usd, total_expenses_usd, unfunded_opeb_usd.",
    schema={
        "total_revenue_usd": "DOUBLE",
        "total_expenses_usd": "DOUBLE",
        "unfunded_opeb_usd": "DOUBLE",
    },
)

Real example: govdata/examples/acfr_extraction.py extracts 56 financial fields from 168 municipal audit PDFs for $1.87 total. Every row traces to its source PDF page, LLM model, prompt hash, and cost.

Provenance by Default

Every substrate-managed table carries provenance columns, but the exact column set depends on the operation that emitted the row:

Sync tables record source-call lineage:

__substrate_run_id
__substrate_synced_at
__substrate_source_stream
__substrate_source_filter
__substrate_connector_type

Derive tables record SQL lineage:

__substrate_run_id
__substrate_sql_hash
__substrate_depends_on
__substrate_derived_at
__substrate_contributor

Recipe outputs add recipe-specific facets (cost, model, prompt_hash, confidence, bbox, source page, extraction status, etc.) — declared by each recipe, no hand-maintenance.

store.query("""
  SELECT __substrate_model, __substrate_cost_usd
  FROM acfr_financials
  LIMIT 1
""")
# [{'__substrate_model': 'anthropic/claude-sonnet-4', '__substrate_cost_usd': 0.011}]

Datasets and Bundles

A Dataset is a declarative recipe — syncs + derives + asserts + canonical join keys. materialize(dataset) runs the whole DAG. publish(dataset, registry) ships a versioned bundle to R2 (or GitHub Releases, or a local dir) plus an index.json entry. Anyone with the URL can attach it from DuckDB without installing substrate:

INSTALL httpfs; LOAD httpfs;
ATTACH 'https://substrate.example.com/datasets/economy/state_panel/v1/store.duckdb'
       AS state_panel (READ_ONLY);
SELECT * FROM state_panel.state_panel WHERE year = 2023;

Or via the substrate library:

from substrate import DuckDBStore
store = DuckDBStore.from_url("https://substrate.example.com/.../store.duckdb")
store.query("""
  SELECT period_label, value
  FROM fred_unemployment_rate
  ORDER BY date DESC
  LIMIT 5
""")

Build Plane vs Serve Plane

Substrate has two separate hardening surfaces:

Plane	Goal	Hardening that belongs here
Build plane	Create correct bundles	connector contracts, fan-out provenance, min-row checks, canonical-key conformance, join-key column contract checks, failed-assert blocking
Serve plane	Make bundles safe and easy to use	catalog metadata, `raw` vs `curated`, `live` vs `draft` vs `broken`, MCP attach/query workflow, API schema, honest status/quality signals

Keeping these separate matters. A dataset can be a good recipe but not yet materialized locally. A bundle can be queryable but still draft. MCP should never hide those states: agents need to know both whether data is available now and whether the recipe has passed its build checks.

Real-data smoke checks

The repo includes smoke scripts that hit real upstream sources and exercise both planes:

# Build plane: materialize raw, heavy fan-out, and curated cross-agency bundles.
uv run python scripts/smoke_real_datasets.py

# Serve plane: start the platform, then call MCP with a real MCP client.
cd platform/backend
SUBSTRATE_PLATFORM_HOME=/tmp/substrate-e2e/platform \
  uv run uvicorn main:app --host 127.0.0.1 --port 8790

# In another terminal:
uv run python scripts/smoke_mcp_client.py \
  --url http://127.0.0.1:8790/mcp/mcp \
  --dataset census_acs1_state

# Operational Postgres path: run in a temp schema, export to DuckDB, query both.
SUBSTRATE_POSTGRES_TEST_DSN=postgresql://... \
  uv run python scripts/smoke_postgres_store.py

The default build smoke materializes census_acs1_state, fred_macro_indicators, and state_panel. The FRED tables use period_label, not period, as their canonical time key.

Cross-Source Joins via Canonical Keys

The killer feature of the connector ecosystem: joins across sources are typed. govdata declares state_fips, state_abbrev, county_fips, cbsa, year, naics, cik, ein, etc. as canonical kinds. Any dataset declaring JoinKey("state_fips", ...) gets per-row IN-list conformance asserts for free, and joins to any other dataset claiming the same kind.

JOINS_ON = {
    "bls_qcew_state": [
        JoinKey("state_fips", "state_fips", "VARCHAR"),
        JoinKey("year", "year", "INTEGER"),
    ],
}

The same pattern works for private-sector data — match Stripe customers to Salesforce accounts via email, match QBO vendors to HubSpot companies via EIN. Each domain package ships its own canonical-key registry.

CLI

substrate version
substrate serve   <store.duckdb>           # MCP server over stdio
substrate tables  <store.duckdb>           # list tables + row counts
substrate query   <store.duckdb> "<sql>"   # run a SELECT/WITH query
substrate inspect <module:Class> <stream>  # preview a connector stream
substrate doctor  [--package P]            # health-check installed datasets

substrate serve is what you point Claude Desktop at:

{
  "mcpServers": {
    "substrate": {
      "command": "substrate",
      "args": ["serve", "/absolute/path/to/store.duckdb"]
    }
  }
}

What Substrate Is NOT

Not dbt. dbt is the SQL-transformation graph. Substrate's unique value is sync and transform — getting data IN from APIs, models, documents. dbt has zero LLM ops, no portable bundles, no canonical-key join registry. Use both: substrate produces tables, dbt transforms them.
Not Fivetran/Airbyte. Those ship to your warehouse. Substrate ships to a portable DuckDB bundle. Different audience: agent builders and analysts who don't run a warehouse.
Not HuggingFace datasets. HF is for ML training. Substrate is for analytical SQL/MCP queries.
Not Iceberg/Delta Sharing. Those are table formats for already-existing data at scale. Substrate is end-to-end, opinionated about how data gets created.
Not LangChain document loaders. Loaders are stateless converters with no provenance. Substrate's transform op is the durable, auditable, idempotent equivalent.
Not a hosted platform (yet). Substrate is a library you install. Hosting is an option, not a requirement.

Architecture

                         Connectors                   Recipes
                  (sync — APIs, DBs, PDFs)    (transform — LLM, OCR,
                                               vision, agent loops, ML)
                              │                           │
                              ▼                           ▼
                          ┌────────────────────────────────────┐
                          │   3-op kernel + ExecutionStore     │
                          │   sync → transform → derive        │
                          │   DuckDB today; Postgres path      │
                          │   (per-row provenance throughout)  │
                          └────────────────────────────────────┘
                                          │
                                  export / publish
                                          ▼
                          ┌────────────────────────────────────┐
                          │   DuckDB bundle + Registry         │
                          │   index.json + versioned bundles   │
                          └────────────────────────────────────┘
                                          │
                                  pull / from_url
                                          │
                                          ▼
                              MCP server / DuckDB ATTACH
                              Any agent queries with SQL

Three ops. Two layers (connectors, recipes). One bundle format. Provenance everywhere.

Domain Libraries

Package	What it adds	Status
`substrate-govdata`	26 US agency connectors, 21 published datasets, FIPS/NAICS canonical keys	shipping
`substrate-acfr`	Municipal audit PDF extraction pipeline	shipping
`substrate-llm`	LLM extraction recipe (OpenRouter, Anthropic, OpenAI)	factoring out of kernel
`substrate-pdf`	PDF text + section extraction recipes	factoring out of kernel
`substrate-stripe`	Stripe connector + canonical schemas	planned
`substrate-salesforce`	Salesforce connector + canonical schemas	planned
`substrate-ocr`	Tesseract / Textract OCR recipes	planned
`substrate-vision`	CLIP / vision-model classification recipes	planned
`substrate-agent`	Tool-using agent loop recipes	planned

Same kernel, same primitives, different sources and recipes.

Where This Is Going

See research/substrate_thesis_2026_04_28.md for the full thesis and roadmap. The short version:

Phase 0 — taxonomy hygiene (plumb kind/health, lint tag prefixes)
Phase 1 — UI collapse (drop hardcoded packs, derive groupings from data)
Phase 2 — kernel reshape (transform landed; extract LLM/PDF into separate packages next)
Phase 3 — hierarchical dataset namespaces (economy.state_panel etc.)
Phase 4 — first private-sector connector + first new modality recipe (proves the package ecosystem)

License

MIT.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.1

Apr 29, 2026

0.1.0 yanked

Apr 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

substrate_data-0.1.1.tar.gz (67.3 kB view details)

Uploaded Apr 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

substrate_data-0.1.1-py3-none-any.whl (80.0 kB view details)

Uploaded Apr 29, 2026 Python 3

File details

Details for the file substrate_data-0.1.1.tar.gz.

File metadata

Download URL: substrate_data-0.1.1.tar.gz
Upload date: Apr 29, 2026
Size: 67.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for substrate_data-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`2963e1eb5daec15d6b324148e2cc6e7512b736bdbea70c908d70505623b0bf4c`
MD5	`a60b653ee3e3dd84105d85288fea0bba`
BLAKE2b-256	`6c834ff173d84f3d9e08f60842f9d9b2b88db4cfaff0a9504c5ae4459a1358ca`

See more details on using hashes here.

File details

Details for the file substrate_data-0.1.1-py3-none-any.whl.

File metadata

Download URL: substrate_data-0.1.1-py3-none-any.whl
Upload date: Apr 29, 2026
Size: 80.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for substrate_data-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`483568c5a502429123f255e21bba328c40e08488e8f0a78f41b4a7c4b738aee4`
MD5	`1ca5440a8780f14407a09fb646edcd5a`
BLAKE2b-256	`07f164ad071db76d95e740a2214f6de6931b6563aa5b311cbecb4dcb167842c0`

See more details on using hashes here.

substrate-data 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Substrate

The Thesis

The Four Properties

Install

Hello World (no external services required)

The Kernel: Three Ops

Operational Stores vs Bundle Format

Two Extensibility Layers

Connectors — for "I have a data source"

Recipes — for "I have unstructured stuff that needs structure"

Provenance by Default

Datasets and Bundles

Build Plane vs Serve Plane

Real-data smoke checks

Cross-Source Joins via Canonical Keys

CLI

What Substrate Is NOT

Architecture

Domain Libraries

Where This Is Going

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes