Validate data contracts between dbt models and FastAPI/Pydantic APIs with accurate, low-false-positive schema checks
Project description
๐ก๏ธ Data Contract Validator
Catch breaking changes between your dbt models and your FastAPI/Pydantic APIs โ before they hit production.
๐ฏ What it solves
Your analytics team changes a dbt model. Your API team's FastAPI service still expects the old shape. Nobody notices until production 500s at 2 AM.
This tool sits on that boundary. It extracts the schema your dbt models produce and the schema your Pydantic models expect, compares them, and fails CI when the data side can no longer satisfy the API side.
dbt models Data Contract Validator FastAPI / Pydantic
(what the pipeline โโโถ extract โ normalize โ compare โโโ (what the API expects)
produces) โ
critical issues block the build
Built for trust
A check that gates a deploy is only useful if it doesn't cry wolf. v1.1 re-architected extraction around that principle:
- Canonical types โ dbt
varcharand Pydanticstrare understood to be the same thing, so you don't get drowned in fake "type mismatch" warnings. - A real SQL parser (
sqlglot) instead of regex โ CTEs,||concatenation, window functions and quoted identifiers are parsed correctly. - Confidence-aware โ if the tool can't fully resolve a model's columns
(e.g.
SELECT *), it will warn rather than falsely block your build.
โก Quick start
pip install data-contract-validator
# Initialize config + CI workflow in your dbt project
contract-validator init --interactive
# Sanity-check the setup
contract-validator test
# Validate
contract-validator validate
๐ Getting started, step by step
If you're setting this up on a project for the first time, the order below avoids the sharp edges:
-
Install into the same environment dbt runs in (not a separate venv) โ the tool needs to see your dbt project:
pip install data-contract-validator
Already have
.retl-validator.ymlcommitted by a teammate? Skip to step 5. -
Generate the config + CI workflow (one-time):
contract-validator init --interactive
You'll be asked: where your dbt project is, which API framework you use, whether your models live in this local project or a different GitHub repo, and then the local path (or the
org/repo+ path within it, plus an optional branch/tag/commit โ blank reads the repo's default branch). Local-vs-GitHub is asked explicitly rather than guessed from the path's shape โ a local path likeapp/modelsis syntactically identical to a GitHuborg/repostring, so there's no reliable way to infer which one you mean. If you pick GitHub, it checks the path actually exists before writing the config โ so a typo surfaces here instead of atvalidatetime.initrefuses to touch an existing.retl-validator.ymlor workflow file โ it won't clobber hand-addedmappingentries just because you upgraded the package and re-raninit. Pass--forceif you really want to regenerate them from the new version's defaults. -
Pre-commit hook:
init --interactiveasks whether you want one set up right after creating the config and CI workflow โ say yes there and it's done. To add one later (or if you used non-interactiveinit, which doesn't prompt), run it standalone:contract-validator setup-precommit --install-hooks
-
If the target repo is private, set a token before running anything that talks to GitHub locally:
export GITHUB_TOKEN=$(gh auth token) # or a PAT with repo read access
See Private GitHub repos need
GITHUB_TOKENbelow for why this is easy to miss. -
Sanity-check the setup:
contract-validator test
Confirms the config parses, the dbt project is found, and the target (local path or GitHub path) is reachable. If this fails,
validatewill fail the same way โ fix it here first. -
Run it:
contract-validator validate -
When it reports a critical issue, diagnose before assuming your dbt model is wrong:
- Real missing column/table โ fix the dbt model.
- Target name doesn't match the dbt model by convention (renamed/prefixed)
โ add an entry under
mapping.tablesin.retl-validator.yml(see When do I needmapping?). - A table that's genuinely populated by something other than dbt (e.g. a
separate streaming pipeline) and has no source model on purpose โ add
it to
mapping.exclude.table=Truealone is not used to infer this automatically โ see FastAPI side for why.
-
For accurate type-checking (not just column-presence checks), run
dbt docs generatebeforevalidateso it picks upcatalog.json(Tier 1, real warehouse types) instead of inferring from SQL text โ see How extraction works below.
One-off validation (no config file)
# Local dbt project against a local Pydantic models file or directory
contract-validator validate \
--dbt-project ./my-dbt-project \
--fastapi-local ./my-api/app/models.py
# dbt project against models in another GitHub repo (microservices)
contract-validator validate \
--dbt-project . \
--fastapi-repo "my-org/my-api" \
--fastapi-path "app/models.py"
# ...against a dev/staging branch of that repo instead of its default branch
contract-validator validate \
--dbt-project . \
--fastapi-repo "my-org/my-api" \
--fastapi-path "app/models.py" \
--fastapi-ref "dev"
--fastapi-ref accepts a branch, tag, or commit SHA. It's useful for
validating an in-progress API change (on a dev or feature branch) against
dbt before it merges to main โ catch the break in the PR that's about to
introduce it, not after. Omit it to read the repo's default branch, same as
before.
๐ How extraction works (and why it's accurate)
dbt side โ tiered, best-source-wins
| Tier | Source | Types | Confidence | Notes |
|---|---|---|---|---|
| 1 | target/catalog.json |
Real warehouse types | high | Produced by dbt docs generate. Most accurate. |
| 2 | sqlglot SQL parse |
Inferred (often unknown) | medium | Trusted column names; enriched with documented types from manifest.json. Detects SELECT *. |
| 3 | regex parse | Guessed | low | Last resort. Never used to hard-fail a build. |
The tool auto-detects what's available and degrades gracefully โ so it works offline in pre-commit and with full type fidelity in a warehouse-connected CI job.
๐ก Tip: run
dbt docs generatein CI before validating to unlock Tier 1 (real types). Without it, you still get accurate column-presence checks from Tier 2. The workflowinitgenerates includes this step already, commented out โ it needs your warehouse adapter and credentials filled in, which can't be guessed, so it isn't active by default.
FastAPI side
Pydantic / SQLModel classes are parsed from source with Python's ast (no
imports executed). Optional[...] controls whether a field is required.
An explicit __tablename__ is used as the table name when present;
otherwise the class name is converted to snake_case.
table=True SQLModel classes are validated the same as any other class โ
they are not skipped. Whether a table is meant to come from dbt is
business knowledge that isn't recoverable from the Python source: two
structurally identical table=True classes can need opposite treatment (one
is a normal dbt-fed table your API also returns directly; another is
populated by a Kafka stream and was never meant to have a dbt model). Use
mapping.exclude to state the latter case explicitly rather than relying on
table=True to imply it.
๐ฆ What gets flagged
| Severity | Meaning | Example |
|---|---|---|
| ๐จ Critical | Blocks the build | API requires a column the dbt model no longer produces |
| โ ๏ธ Warning | Worth a look, non-blocking | A real type mismatch, or a missing column on a model we couldn't fully resolve |
$ contract-validator validate
๐ก๏ธ Data Contract Validation Results:
Status: โ FAILED
Critical: 1 | Warnings: 0
๐จ Critical Issues (Must Fix):
๐ฅ user_analytics
Column: total_orders
Problem: Target REQUIRES column 'total_orders' but source doesn't provide it
๐ง Fix: Add column 'total_orders' to source model for table 'user_analytics'
๐ง Configuration (.retl-validator.yml)
version: "1.0"
name: "my-project-contracts"
source:
dbt:
project_path: "."
auto_compile: true
# Force Tier 2/3 SQL parsing even if catalog/manifest exist:
disable_manifest: false
target:
fastapi:
# GitHub repo:
type: "github"
repo: "my-org/my-api"
path: "app/models.py"
# Optional: branch, tag, or commit to read from. Omit for the repo's
# default branch. Handy for validating a dev/staging branch instead of
# main -- e.g. to catch a break in a PR before it merges.
# ref: "dev"
# ...or local:
# type: "local"
# path: "../my-api/app/models.py"
# Optional: explicit mapping for when names don't line up by convention.
mapping:
tables:
# target (Pydantic) table : source (dbt) model
user_analytics: user_analytics_summary
columns:
user_analytics:
# target column : source column
userId: user_id
# Target tables with no source model on purpose (e.g. Kafka-populated,
# not dbt) -- see "When do I need mapping?" below.
exclude:
- feed_interaction
validation:
fail_on: ["missing_tables", "missing_required_columns"]
warn_on: ["type_mismatches", "missing_optional_columns"]
Private GitHub repos need GITHUB_TOKEN
If target.*.repo points at a private repository, contract-validator
needs a token with read access to it. Where that token comes from is
different locally vs. in CI โ and the CI case has a sharp edge worth
understanding before it silently fails on a PR.
Locally, set the GITHUB_TOKEN environment variable before running the
CLI. On bash/zsh that's export (there's nothing to install โ export just
makes the variable visible to the contract-validator process you run
next):
export GITHUB_TOKEN=$(gh auth token) # or a PAT with repo read access
contract-validator validate
GitHub's API 404s (not 403s) an unauthenticated request to a private path,
so without a token this looks identical to a plain typo in path โ
contract-validator init --interactive and contract-validator test both
check target.*.path actually exists and will point you at this if the
lookup 404s with no token set.
In CI, the workflow init generates for a GitHub target wires up
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} โ the token Actions
auto-provides. That's fine and needs no setup if your target repo is
public. But that token only has access to the repository the workflow
is running in, so if your dbt repo and your API repo are different repos
and the target is private, it silently can't read it, and validation fails
on every PR with no clue why.
โ ๏ธ Strong recommendation: if your target repo is private, switch this to a token you create yourself before you rely on this workflow. The generated workflow carries a loud comment for exactly this โ don't wait to discover it the hard way on a PR.
To make that switch:
-
Create a token with read access to the target repo โ a fine-grained PAT scoped to just that repo's Contents (read-only) is the least-privilege option; a classic PAT with the
reposcope also works. -
In the repo running the workflow (your dbt repo): Settings โ Secrets and variables โ Actions โ New repository secret. Name it
API_REPO_TOKEN(or similar) and paste the token as the value.โ ๏ธ GitHub rejects any secret name starting with
GITHUB_โ it's a reserved prefix. You cannot create a secret literally calledGITHUB_TOKEN; that's not a naming suggestion, the UI will refuse it. That's why the secret needs a different name, even though the environment variable it feeds isGITHUB_TOKENโ two different things with confusingly similar names:env: GITHUB_TOKEN: ${{ secrets.API_REPO_TOKEN }} # ^^^^^^^^^^^ local variable name, can be anything -- the CLI # just needs it called GITHUB_TOKEN to find it # ^^^^^^^^^^^^^^ the *secret's* name -- # this is what GitHub restricts
-
Replace
secrets.GITHUB_TOKENwithsecrets.API_REPO_TOKEN(or whatever you named it) in the workflow'senv:block.
Skip all of this for a local target โ init omits the whole env: block
since a local target never talks to the GitHub API at all.
When do I need mapping?
Most of the time you don't. Names are matched automatically across:
snake_case/camelCase/ casing โUserAnalyticsโuser_analytics,userIdโuser_id- plural โ singular โ dbt's plural
usersmatches Pydantic'sUser(โuser) with no config (and it won't over-match โaddressis never confused withaddres).
Reach for mapping.tables / mapping.columns only when a model or column is
named so differently that convention can't bridge it (e.g. Pydantic
user_id โ dbt customer_identifier).
mapping.exclude is different โ it's not about renamed models, it's for a
target table that has no source model on purpose, because it's
populated by something other than dbt (a Kafka stream, a cron job, etc.).
This can't be inferred from the code (a table=True SQLModel class looks
identical whether or not dbt is supposed to feed it), so it has to be a
deliberate, human-stated exception:
mapping:
exclude:
- feed_interaction
- affiliate_reward
Anything not listed is validated normally โ including table=True classes,
which are treated the same as any other target and are not silently skipped.
๐ Python API
from data_contract_validator import ContractValidator, DBTExtractor, FastAPIExtractor
dbt = DBTExtractor(project_path="./dbt-project")
fastapi = FastAPIExtractor.from_github_repo("my-org/my-api", "app/models.py")
validator = ContractValidator(
source_extractor=dbt,
target_extractor=fastapi,
mapping={"tables": {"user_analytics": "user_analytics_summary"}}, # optional
)
result = validator.validate()
if not result.success:
for issue in result.critical_issues:
print(f"๐ฅ {issue.table}.{issue.column}: {issue.message}")
๐ช CI / pre-commit integration
GitHub Actions
contract-validator init generates a workflow for you. Minimal version:
name: ๐ก๏ธ Data Contract Validation
on:
pull_request:
paths: ["models/**/*.sql", "dbt_project.yml", "**/*models*.py"]
jobs:
validate-contracts:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v4
with: { python-version: "3.11" }
- run: pip install data-contract-validator
# Optional: `dbt docs generate` here for real warehouse types (Tier 1)
- run: contract-validator validate --output github
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
GITHUB_TOKEN here is only needed if target is a github repo (init
omits the whole env: block for a local target). The default above works
as-is for a public target repo. For a private one, strongly
recommended: swap it for a token you create yourself โ see
Private GitHub repos need GITHUB_TOKEN
above for why the default silently can't read a private target, and how to
set up the replacement.
Pre-commit
contract-validator setup-precommit --install-hooks
repos:
- repo: https://github.com/OGsiji/data-contract-validator
rev: v1.1.0
hooks:
- id: contract-validation
๐งช Output formats
contract-validator validate --output terminal # human-friendly (default)
contract-validator validate --output json # machine-readable for CI
contract-validator validate --output github # GitHub Actions annotations
๐ Supported frameworks
Source: dbt (all adapters โ Snowflake, BigQuery, Redshift, Postgres, โฆ). Target: FastAPI (Pydantic v2 + SQLModel).
The extractor architecture is intentionally pluggable (BaseExtractor โ
Dict[str, Schema] with canonical types), so additional sources/targets can be
added without touching the validator. Open an issue
to request one.
๐ ๏ธ Development & testing
git clone https://github.com/OGsiji/data-contract-validator
cd data-contract-validator
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]" # or: pip install -e ".[test]"
# Run the suite
pytest
# Lint / format
black data_contract_validator tests
The test suite covers the canonical type system (tests/test_core/test_types.py),
the tiered dbt extractor including sqlglot CTE handling and catalog.json
(tests/test_extractors/test_dbt.py), and the confidence/mapping behavior of
the validator (tests/test_core/test_validator.py).
Adding an extractor
from data_contract_validator.extractors.base import BaseExtractor
from data_contract_validator.core.types import CanonicalType
class MyExtractor(BaseExtractor):
def extract_schemas(self):
# return Dict[str, Schema]; use self._make_column(...) so each column
# carries a canonical_type the validator can compare.
...
๐บ๏ธ Roadmap
- Real compatibility semantics (nullability, additive vs. breaking changes)
- Reporter/logging abstraction (quiet/embeddable core)
- A canonical, language-neutral contract artifact + baseline/snapshot diffing
- More targets (Django, SQLAlchemy, GraphQL, OpenAPI)
๐ License
MIT โ see LICENSE.
๐ Support
- ๐ Issues: https://github.com/OGsiji/data-contract-validator/issues
- ๐ง Email: ogunniransiji@gmail.com
If this saves you a production incident, please โญ the repo.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file data_contract_validator-1.1.9.tar.gz.
File metadata
- Download URL: data_contract_validator-1.1.9.tar.gz
- Upload date:
- Size: 53.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
64cc1a5d7ad1d2ba3e5e3f5ae4ca5a19bcf99155041dd09ed40685ca5d20b606
|
|
| MD5 |
3c9e7933761fe642773ad6579196075c
|
|
| BLAKE2b-256 |
c43e2b0f300978777cce3ab9048df9236527cd36b548c84763c8a892ec6c1e36
|
Provenance
The following attestation bundles were made for data_contract_validator-1.1.9.tar.gz:
Publisher:
publish.yml on OGsiji/data-contract-validator
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
data_contract_validator-1.1.9.tar.gz -
Subject digest:
64cc1a5d7ad1d2ba3e5e3f5ae4ca5a19bcf99155041dd09ed40685ca5d20b606 - Sigstore transparency entry: 2067590426
- Sigstore integration time:
-
Permalink:
OGsiji/data-contract-validator@abdaa1dc6e35ee26037a4cf7b5624c7e4b3d9ae1 -
Branch / Tag:
refs/tags/v1.1.9 - Owner: https://github.com/OGsiji
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@abdaa1dc6e35ee26037a4cf7b5624c7e4b3d9ae1 -
Trigger Event:
release
-
Statement type:
File details
Details for the file data_contract_validator-1.1.9-py3-none-any.whl.
File metadata
- Download URL: data_contract_validator-1.1.9-py3-none-any.whl
- Upload date:
- Size: 45.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1417743b4c12a6e699fc5657792f2dccaf3e963ae875e67ef04ca4c95a43fb45
|
|
| MD5 |
5fa688cac20d3cb6d809d4f2155e73ca
|
|
| BLAKE2b-256 |
9145ffe374336ba773c29365cd2eb3f366b61dca1fca5fe25237c04217a7a490
|
Provenance
The following attestation bundles were made for data_contract_validator-1.1.9-py3-none-any.whl:
Publisher:
publish.yml on OGsiji/data-contract-validator
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
data_contract_validator-1.1.9-py3-none-any.whl -
Subject digest:
1417743b4c12a6e699fc5657792f2dccaf3e963ae875e67ef04ca4c95a43fb45 - Sigstore transparency entry: 2067590558
- Sigstore integration time:
-
Permalink:
OGsiji/data-contract-validator@abdaa1dc6e35ee26037a4cf7b5624c7e4b3d9ae1 -
Branch / Tag:
refs/tags/v1.1.9 - Owner: https://github.com/OGsiji
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@abdaa1dc6e35ee26037a4cf7b5624c7e4b3d9ae1 -
Trigger Event:
release
-
Statement type: