Skip to main content

The data layer for AI agents. Extract, transform, and expose data via MCP.

Project description

Substrate

Canonicalize data from any source — APIs, databases, PDFs, agent outputs, ML models — into structured, joinable, traceable, portable bundles your AI agent can query.

Extract once, query forever, trace everything.

The Thesis

Most useful data lives behind APIs, in databases, in PDFs, in agent transcripts, in model outputs — and most of it stays trapped there. Every Claude session that needs real data either hallucinates or scrapes something live. Slow, expensive, no provenance, different results every time.

Substrate is a small kernel that turns any of these sources into typed, canonical, joinable rows in a portable DuckDB bundle, with per-row provenance preserved through every transform. Three ops, two extensibility layers, one bundle format. Any agent can attach and query.

The Four Properties

Property What it means
Structured Typed rows, declared schemas, asserts that fail loudly
Joinable Canonical-key registry types cross-source joins
Traceable Per-row provenance (run_id, source, model, cost, sql_hash) follows every transform
Portable DuckDB bundle is one file, attaches over HTTP, no infrastructure

dbt has structured + joinable. Iceberg has structured + portable. Fivetran has structured. Substrate is the only thing that brings all four together.

Install

pip install substrate-data

The PyPI distribution is substrate-data; the Python import and CLI remain substrate. The bare substrate package name is owned by an unrelated archived AI SDK.

Domain libraries are separate packages — install what you need:

pip install substrate-govdata     # 26 US agency connectors + 21 datasets
pip install substrate-llm         # LLM extraction recipes (OpenRouter / Anthropic)
pip install substrate-pdf         # PDF text + section extraction recipes

Hello World (no external services required)

from substrate import DuckDBStore
from substrate.connectors.csv_file import CSVConnector

csv = CSVConnector({
    "path": "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
})

store = DuckDBStore("./hello.duckdb")
store.sync(csv, "titanic")

store.derive(
    "survival_by_class",
    "SELECT Pclass, AVG(CAST(Survived AS DOUBLE)) AS survival_rate "
    "FROM titanic GROUP BY Pclass ORDER BY Pclass",
    depends_on=["titanic"],
)

for row in store.query("SELECT * FROM survival_by_class"):
    print(row)

Then expose it to Claude:

substrate serve ./hello.duckdb

The Kernel: Three Ops

sync       — external source → table          (connectors live here)
transform  — table → table via callable       (recipes live here)
derive     — SQL transform across tables      (joins, canonicalize, project)

Plus three orchestrators: materialize runs the DAG, publish ships a bundle to a Registry (R2, GitHub Releases, local dir), pull/from_url fetches one and ATTACHes it via DuckDB httpfs.

Compatibility. extract_text, section_extract, and llm_extract remain available as convenience methods. They are now wrappers over transform; the next package split moves their implementation into substrate-pdf and substrate-llm.

Operational Stores vs Bundle Format

Do not confuse where a pipeline runs with what gets distributed.

Concern What it is for Substrate contract
Execution store Runs sync, transform, derive, asserts, and materialize ExecutionStore
Bundle publisher Ships a portable DuckDB artifact to agents, analysts, MCP/API consumers BundlePublisher

DuckDBStore satisfies both contracts today: it can run pipelines locally and publish the same file as a bundle. Flightdeck-style apps should keep Postgres as the operational store for concurrency, tenancy, RLS, jobs, and hot reads. That path is PostgresStore: it implements ExecutionStore against one schema in an application Postgres database, with an explicit export bridge that produces DuckDB bundles for distribution.

pull() and from_url() stay DuckDB-specific because they are consumer-side bundle conveniences, not requirements for every operational backend.

from substrate import PostgresStore

store = PostgresStore(settings.DATABASE_URL, schema=f"tenant_{workspace_id}")
result = store.materialize(my_dataset)

# Operational tables live in Postgres; portable consumers still get DuckDB.
bundle = store.export_to_duckdb("./workspace_bundle.duckdb", overwrite=True)

Two Extensibility Layers

Connectors — for "I have a data source"

Implement three methods on BaseConnector and you have a substrate connector. The kernel handles state, manifests, retries with backoff, fan_out across parameter combos, partial-failure resumption, MCP exposure, provenance.

from substrate.connectors.base import BaseConnector

class MyAPIConnector(BaseConnector):
    @classmethod
    def connector_type(cls) -> str:
        return "myapi"

    def list_streams(self):
        return [StreamInfo(name="orders"), StreamInfo(name="customers")]

    def get_stream_schema(self, stream):
        ...

    def extract_streaming(self, stream, stream_filters=None):
        for page in self.paginate(stream, stream_filters):
            yield from page

Built-in: CSVConnector, PostgresConnector, RESTAPIConnector. Domain packages add their own — substrate-govdata ships 26 US federal agency connectors with canonical join keys (FIPS, NAICS, year).

Recipes — for "I have unstructured stuff that needs structure"

A recipe is a batch-capable callable that turns source rows into one or more output rows, with declared output schema and provenance facets. Simple per-row recipes subclass RowRecipe; cost-sensitive recipes can process batches directly. PDFs, OCR, LLM extraction, vision classification, agent loops, audio transcription — every modality becomes a recipe on the same transform op.

from substrate import ColumnSpec, RowRecipe, TransformContext

class UpperTitle(RowRecipe):
    output_schema = [ColumnSpec("upper_title", "VARCHAR")]
    recipe_name = "examples.upper_title"

    def transform_row(self, row: dict, context: TransformContext) -> dict:
        return {"upper_title": row["title"].upper()}

store.transform(
    "upper_docs",
    source="docs",
    recipe=UpperTitle(),
    pass_through_columns=["id"],
)

Transform outputs get common provenance (__substrate_recipe, __substrate_source_row_id, __substrate_input_hash, etc.). Row or batch failures can be recorded in _substrate_transform_errors; error thresholds decide whether the transform blocks.

from substrate import DuckDBStore
from substrate_pdf import extract_text
from substrate_llm import llm_extract

store.sync(pdf_connector, "pdfs")              # URLs / paths → table
extract_text(store, "pdfs", "acfr_text")       # PDF rows → text rows
llm_extract(                                   # text rows → structured rows
    store, "acfr_text", "acfr_financials",
    prompt="Extract total_revenue_usd, total_expenses_usd, unfunded_opeb_usd.",
    schema={
        "total_revenue_usd": "DOUBLE",
        "total_expenses_usd": "DOUBLE",
        "unfunded_opeb_usd": "DOUBLE",
    },
)

Real example: govdata/examples/acfr_extraction.py extracts 56 financial fields from 168 municipal audit PDFs for $1.87 total. Every row traces to its source PDF page, LLM model, prompt hash, and cost.

Provenance by Default

Every substrate-managed table carries provenance columns, but the exact column set depends on the operation that emitted the row:

Sync tables record source-call lineage:

__substrate_run_id
__substrate_synced_at
__substrate_source_stream
__substrate_source_filter
__substrate_connector_type

Derive tables record SQL lineage:

__substrate_run_id
__substrate_sql_hash
__substrate_depends_on
__substrate_derived_at
__substrate_contributor

Recipe outputs add recipe-specific facets (cost, model, prompt_hash, confidence, bbox, source page, extraction status, etc.) — declared by each recipe, no hand-maintenance.

store.query("""
  SELECT __substrate_model, __substrate_cost_usd
  FROM acfr_financials
  LIMIT 1
""")
# [{'__substrate_model': 'anthropic/claude-sonnet-4', '__substrate_cost_usd': 0.011}]

Datasets and Bundles

A Dataset is a declarative recipe — syncs + derives + asserts + canonical join keys. materialize(dataset) runs the whole DAG. publish(dataset, registry) ships a versioned bundle to R2 (or GitHub Releases, or a local dir) plus an index.json entry. Anyone with the URL can attach it from DuckDB without installing substrate:

INSTALL httpfs; LOAD httpfs;
ATTACH 'https://substrate.example.com/datasets/economy/state_panel/v1/store.duckdb'
       AS state_panel (READ_ONLY);
SELECT * FROM state_panel.state_panel WHERE year = 2023;

Or via the substrate library:

from substrate import DuckDBStore
store = DuckDBStore.from_url("https://substrate.example.com/.../store.duckdb")
store.query("""
  SELECT period_label, value
  FROM fred_unemployment_rate
  ORDER BY date DESC
  LIMIT 5
""")

Build Plane vs Serve Plane

Substrate has two separate hardening surfaces:

Plane Goal Hardening that belongs here
Build plane Create correct bundles connector contracts, fan-out provenance, min-row checks, canonical-key conformance, join-key column contract checks, failed-assert blocking
Serve plane Make bundles safe and easy to use catalog metadata, raw vs curated, live vs draft vs broken, MCP attach/query workflow, API schema, honest status/quality signals

Keeping these separate matters. A dataset can be a good recipe but not yet materialized locally. A bundle can be queryable but still draft. MCP should never hide those states: agents need to know both whether data is available now and whether the recipe has passed its build checks.

Real-data smoke checks

The repo includes smoke scripts that hit real upstream sources and exercise both planes:

# Build plane: materialize raw, heavy fan-out, and curated cross-agency bundles.
uv run python scripts/smoke_real_datasets.py

# Serve plane: start the platform, then call MCP with a real MCP client.
cd platform/backend
SUBSTRATE_PLATFORM_HOME=/tmp/substrate-e2e/platform \
  uv run uvicorn main:app --host 127.0.0.1 --port 8790

# In another terminal:
uv run python scripts/smoke_mcp_client.py \
  --url http://127.0.0.1:8790/mcp/mcp \
  --dataset census_acs1_state

# Operational Postgres path: run in a temp schema, export to DuckDB, query both.
SUBSTRATE_POSTGRES_TEST_DSN=postgresql://... \
  uv run python scripts/smoke_postgres_store.py

The default build smoke materializes census_acs1_state, fred_macro_indicators, and state_panel. The FRED tables use period_label, not period, as their canonical time key.

Cross-Source Joins via Canonical Keys

The killer feature of the connector ecosystem: joins across sources are typed. govdata declares state_fips, state_abbrev, county_fips, cbsa, year, naics, cik, ein, etc. as canonical kinds. Any dataset declaring JoinKey("state_fips", ...) gets per-row IN-list conformance asserts for free, and joins to any other dataset claiming the same kind.

JOINS_ON = {
    "bls_qcew_state": [
        JoinKey("state_fips", "state_fips", "VARCHAR"),
        JoinKey("year", "year", "INTEGER"),
    ],
}

The same pattern works for private-sector data — match Stripe customers to Salesforce accounts via email, match QBO vendors to HubSpot companies via EIN. Each domain package ships its own canonical-key registry.

CLI

substrate version
substrate serve   <store.duckdb>           # MCP server over stdio
substrate tables  <store.duckdb>           # list tables + row counts
substrate query   <store.duckdb> "<sql>"   # run a SELECT/WITH query
substrate inspect <module:Class> <stream>  # preview a connector stream
substrate doctor  [--package P]            # health-check installed datasets

substrate serve is what you point Claude Desktop at:

{
  "mcpServers": {
    "substrate": {
      "command": "substrate",
      "args": ["serve", "/absolute/path/to/store.duckdb"]
    }
  }
}

What Substrate Is NOT

  • Not dbt. dbt is the SQL-transformation graph. Substrate's unique value is sync and transform — getting data IN from APIs, models, documents. dbt has zero LLM ops, no portable bundles, no canonical-key join registry. Use both: substrate produces tables, dbt transforms them.
  • Not Fivetran/Airbyte. Those ship to your warehouse. Substrate ships to a portable DuckDB bundle. Different audience: agent builders and analysts who don't run a warehouse.
  • Not HuggingFace datasets. HF is for ML training. Substrate is for analytical SQL/MCP queries.
  • Not Iceberg/Delta Sharing. Those are table formats for already-existing data at scale. Substrate is end-to-end, opinionated about how data gets created.
  • Not LangChain document loaders. Loaders are stateless converters with no provenance. Substrate's transform op is the durable, auditable, idempotent equivalent.
  • Not a hosted platform (yet). Substrate is a library you install. Hosting is an option, not a requirement.

Architecture

                         Connectors                   Recipes
                  (sync — APIs, DBs, PDFs)    (transform — LLM, OCR,
                                               vision, agent loops, ML)
                              │                           │
                              ▼                           ▼
                          ┌────────────────────────────────────┐
                          │   3-op kernel + ExecutionStore     │
                          │   sync → transform → derive        │
                          │   DuckDB today; Postgres path      │
                          │   (per-row provenance throughout)  │
                          └────────────────────────────────────┘
                                          │
                                  export / publish
                                          ▼
                          ┌────────────────────────────────────┐
                          │   DuckDB bundle + Registry         │
                          │   index.json + versioned bundles   │
                          └────────────────────────────────────┘
                                          │
                                  pull / from_url
                                          │
                                          ▼
                              MCP server / DuckDB ATTACH
                              Any agent queries with SQL

Three ops. Two layers (connectors, recipes). One bundle format. Provenance everywhere.

Domain Libraries

Package What it adds Status
substrate-govdata 26 US agency connectors, 21 published datasets, FIPS/NAICS canonical keys shipping
substrate-acfr Municipal audit PDF extraction pipeline shipping
substrate-llm LLM extraction recipe (OpenRouter, Anthropic, OpenAI) factoring out of kernel
substrate-pdf PDF text + section extraction recipes factoring out of kernel
substrate-stripe Stripe connector + canonical schemas planned
substrate-salesforce Salesforce connector + canonical schemas planned
substrate-ocr Tesseract / Textract OCR recipes planned
substrate-vision CLIP / vision-model classification recipes planned
substrate-agent Tool-using agent loop recipes planned

Same kernel, same primitives, different sources and recipes.

Where This Is Going

See research/substrate_thesis_2026_04_28.md for the full thesis and roadmap. The short version:

  1. Phase 0 — taxonomy hygiene (plumb kind/health, lint tag prefixes)
  2. Phase 1 — UI collapse (drop hardcoded packs, derive groupings from data)
  3. Phase 2 — kernel reshape (transform landed; extract LLM/PDF into separate packages next)
  4. Phase 3 — hierarchical dataset namespaces (economy.state_panel etc.)
  5. Phase 4 — first private-sector connector + first new modality recipe (proves the package ecosystem)

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

substrate_data-0.1.1.tar.gz (67.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

substrate_data-0.1.1-py3-none-any.whl (80.0 kB view details)

Uploaded Python 3

File details

Details for the file substrate_data-0.1.1.tar.gz.

File metadata

  • Download URL: substrate_data-0.1.1.tar.gz
  • Upload date:
  • Size: 67.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for substrate_data-0.1.1.tar.gz
Algorithm Hash digest
SHA256 2963e1eb5daec15d6b324148e2cc6e7512b736bdbea70c908d70505623b0bf4c
MD5 a60b653ee3e3dd84105d85288fea0bba
BLAKE2b-256 6c834ff173d84f3d9e08f60842f9d9b2b88db4cfaff0a9504c5ae4459a1358ca

See more details on using hashes here.

File details

Details for the file substrate_data-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: substrate_data-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 80.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for substrate_data-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 483568c5a502429123f255e21bba328c40e08488e8f0a78f41b4a7c4b738aee4
MD5 1ca5440a8780f14407a09fb646edcd5a
BLAKE2b-256 07f164ad071db76d95e740a2214f6de6931b6563aa5b311cbecb4dcb167842c0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page