Skip to main content

Validate data contracts between dbt models and FastAPI/Pydantic APIs with accurate, low-false-positive schema checks

Project description

๐Ÿ›ก๏ธ Data Contract Validator

Catch breaking changes between your dbt models and your FastAPI/Pydantic APIs โ€” before they hit production.

PyPI version Tests License: MIT

๐ŸŽฏ What it solves

Your analytics team changes a dbt model. Your API team's FastAPI service still expects the old shape. Nobody notices until production 500s at 2 AM.

This tool sits on that boundary. It extracts the schema your dbt models produce and the schema your Pydantic models expect, compares them, and fails CI when the data side can no longer satisfy the API side.

   dbt models                 Data Contract Validator                FastAPI / Pydantic
(what the pipeline   โ”€โ”€โ–ถ   extract โ†’ normalize โ†’ compare   โ—€โ”€โ”€   (what the API expects)
    produces)                     โ†“
                          critical issues block the build

Built for trust

A check that gates a deploy is only useful if it doesn't cry wolf. v1.1 re-architected extraction around that principle:

  • Canonical types โ€” dbt varchar and Pydantic str are understood to be the same thing, so you don't get drowned in fake "type mismatch" warnings.
  • A real SQL parser (sqlglot) instead of regex โ€” CTEs, || concatenation, window functions and quoted identifiers are parsed correctly.
  • Confidence-aware โ€” if the tool can't fully resolve a model's columns (e.g. SELECT *), it will warn rather than falsely block your build.

โšก Quick start

pip install data-contract-validator
# Initialize config + CI workflow in your dbt project
contract-validator init --interactive

# Sanity-check the setup
contract-validator test

# Validate
contract-validator validate

๐Ÿš€ Getting started, step by step

If you're setting this up on a project for the first time, the order below avoids the sharp edges:

  1. Install into the same environment dbt runs in (not a separate venv) โ€” the tool needs to see your dbt project:

    pip install data-contract-validator
    

    Already have .retl-validator.yml committed by a teammate? Skip to step 5.

  2. Generate the config + CI workflow (one-time):

    contract-validator init --interactive
    

    You'll be asked: where your dbt project is, which API framework you use, whether your models live in this local project or a different GitHub repo, and then the local path (or the org/repo + path within it). It's asked explicitly rather than guessed from the path's shape โ€” a local path like app/models is syntactically identical to a GitHub org/repo string, so there's no reliable way to infer which one you mean. If you pick GitHub, it checks the path actually exists before writing the config โ€” so a typo surfaces here instead of at validate time.

    init refuses to touch an existing .retl-validator.yml or workflow file โ€” it won't clobber hand-added mapping entries just because you upgraded the package and re-ran init. Pass --force if you really want to regenerate them from the new version's defaults.

  3. Pre-commit hook: init --interactive asks whether you want one set up right after creating the config and CI workflow โ€” say yes there and it's done. To add one later (or if you used non-interactive init, which doesn't prompt), run it standalone:

    contract-validator setup-precommit --install-hooks
    
  4. If the target repo is private, set a token before running anything that talks to GitHub locally:

    export GITHUB_TOKEN=$(gh auth token)   # or a PAT with repo read access
    

    See Private GitHub repos need GITHUB_TOKEN below for why this is easy to miss.

  5. Sanity-check the setup:

    contract-validator test
    

    Confirms the config parses, the dbt project is found, and the target (local path or GitHub path) is reachable. If this fails, validate will fail the same way โ€” fix it here first.

  6. Run it:

    contract-validator validate
    
  7. When it reports a critical issue, diagnose before assuming your dbt model is wrong:

    • Real missing column/table โ†’ fix the dbt model.
    • Target name doesn't match the dbt model by convention (renamed/prefixed) โ†’ add an entry under mapping.tables in .retl-validator.yml (see When do I need mapping?).
    • A table that's genuinely populated by something other than dbt (e.g. a separate streaming pipeline) and has no source model on purpose โ†’ add it to mapping.exclude. table=True alone is not used to infer this automatically โ€” see FastAPI side for why.
  8. For accurate type-checking (not just column-presence checks), run dbt docs generate before validate so it picks up catalog.json (Tier 1, real warehouse types) instead of inferring from SQL text โ€” see How extraction works below.

One-off validation (no config file)

# Local dbt project against a local Pydantic models file or directory
contract-validator validate \
  --dbt-project ./my-dbt-project \
  --fastapi-local ./my-api/app/models.py

# dbt project against models in another GitHub repo (microservices)
contract-validator validate \
  --dbt-project . \
  --fastapi-repo "my-org/my-api" \
  --fastapi-path "app/models.py"

๐Ÿ” How extraction works (and why it's accurate)

dbt side โ€” tiered, best-source-wins

Tier Source Types Confidence Notes
1 target/catalog.json Real warehouse types high Produced by dbt docs generate. Most accurate.
2 sqlglot SQL parse Inferred (often unknown) medium Trusted column names; enriched with documented types from manifest.json. Detects SELECT *.
3 regex parse Guessed low Last resort. Never used to hard-fail a build.

The tool auto-detects what's available and degrades gracefully โ€” so it works offline in pre-commit and with full type fidelity in a warehouse-connected CI job.

๐Ÿ’ก Tip: run dbt docs generate in CI before validating to unlock Tier 1 (real types). Without it, you still get accurate column-presence checks from Tier 2. The workflow init generates includes this step already, commented out โ€” it needs your warehouse adapter and credentials filled in, which can't be guessed, so it isn't active by default.

FastAPI side

Pydantic / SQLModel classes are parsed from source with Python's ast (no imports executed). Optional[...] controls whether a field is required. An explicit __tablename__ is used as the table name when present; otherwise the class name is converted to snake_case.

table=True SQLModel classes are validated the same as any other class โ€” they are not skipped. Whether a table is meant to come from dbt is business knowledge that isn't recoverable from the Python source: two structurally identical table=True classes can need opposite treatment (one is a normal dbt-fed table your API also returns directly; another is populated by a Kafka stream and was never meant to have a dbt model). Use mapping.exclude to state the latter case explicitly rather than relying on table=True to imply it.

๐Ÿšฆ What gets flagged

Severity Meaning Example
๐Ÿšจ Critical Blocks the build API requires a column the dbt model no longer produces
โš ๏ธ Warning Worth a look, non-blocking A real type mismatch, or a missing column on a model we couldn't fully resolve
$ contract-validator validate

๐Ÿ›ก๏ธ Data Contract Validation Results:
Status: โŒ FAILED
Critical: 1 | Warnings: 0

๐Ÿšจ Critical Issues (Must Fix):
  ๐Ÿ’ฅ user_analytics
     Column: total_orders
     Problem: Target REQUIRES column 'total_orders' but source doesn't provide it
     ๐Ÿ”ง Fix: Add column 'total_orders' to source model for table 'user_analytics'

๐Ÿ”ง Configuration (.retl-validator.yml)

version: "1.0"
name: "my-project-contracts"

source:
  dbt:
    project_path: "."
    auto_compile: true
    # Force Tier 2/3 SQL parsing even if catalog/manifest exist:
    disable_manifest: false

target:
  fastapi:
    # GitHub repo:
    type: "github"
    repo: "my-org/my-api"
    path: "app/models.py"
    # ...or local:
    # type: "local"
    # path: "../my-api/app/models.py"

# Optional: explicit mapping for when names don't line up by convention.
mapping:
  tables:
    # target (Pydantic) table : source (dbt) model
    user_analytics: user_analytics_summary
  columns:
    user_analytics:
      # target column : source column
      userId: user_id
  # Target tables with no source model on purpose (e.g. Kafka-populated,
  # not dbt) -- see "When do I need mapping?" below.
  exclude:
    - feed_interaction

validation:
  fail_on: ["missing_tables", "missing_required_columns"]
  warn_on: ["type_mismatches", "missing_optional_columns"]

Private GitHub repos need GITHUB_TOKEN

If target.*.repo points at a private repository, contract-validator needs a token with read access to it. Where that token comes from is different locally vs. in CI โ€” and the CI case has a sharp edge worth understanding before it silently fails on a PR.

Locally, set the GITHUB_TOKEN environment variable before running the CLI. On bash/zsh that's export (there's nothing to install โ€” export just makes the variable visible to the contract-validator process you run next):

export GITHUB_TOKEN=$(gh auth token)   # or a PAT with repo read access
contract-validator validate

GitHub's API 404s (not 403s) an unauthenticated request to a private path, so without a token this looks identical to a plain typo in path โ€” contract-validator init --interactive and contract-validator test both check target.*.path actually exists and will point you at this if the lookup 404s with no token set.

In CI, the workflow init generates for a GitHub target wires up GITHUB_TOKEN: ${{ secrets.API_REPO_TOKEN }} โ€” a token you create, not the auto-provided secrets.GITHUB_TOKEN. That auto-provided token only has access to the repository the workflow is running in, so if your dbt repo and your API repo are different repos, it silently can't read the target the first time that target is private โ€” and a PAT works identically for a public target too, so there's no reason to default to the token that only sometimes works. To finish the setup the generated workflow expects:

  1. Create a token with read access to the target repo โ€” a fine-grained PAT scoped to just that repo's Contents (read-only) is the least-privilege option; a classic PAT with the repo scope also works.

  2. In the repo running the workflow (your dbt repo): Settings โ†’ Secrets and variables โ†’ Actions โ†’ New repository secret. Name it API_REPO_TOKEN exactly (that's the name the generated workflow already references) and paste the token as the value.

    โš ๏ธ GitHub rejects any secret name starting with GITHUB_ โ€” it's a reserved prefix. You cannot create a secret literally called GITHUB_TOKEN; that's not a naming suggestion, the UI will refuse it. That's exactly why the workflow's secret is named API_REPO_TOKEN instead, even though the environment variable it feeds is GITHUB_TOKEN โ€” two different things with confusingly similar names:

    env:
      GITHUB_TOKEN: ${{ secrets.API_REPO_TOKEN }}
    #  ^^^^^^^^^^^   local variable name, can be anything -- the CLI
    #                just needs it called GITHUB_TOKEN to find it
    #                            ^^^^^^^^^^^^^^ the *secret's* name --
    #                            this is what GitHub restricts
    

Skip all of this for a local target โ€” init omits the whole env: block since a local target never talks to the GitHub API at all.

When do I need mapping?

Most of the time you don't. Names are matched automatically across:

  • snake_case / camelCase / casing โ€” UserAnalytics โ†’ user_analytics, userId โ†’ user_id
  • plural โ†” singular โ€” dbt's plural users matches Pydantic's User (โ†’ user) with no config (and it won't over-match โ€” address is never confused with addres).

Reach for mapping.tables / mapping.columns only when a model or column is named so differently that convention can't bridge it (e.g. Pydantic user_id โ†” dbt customer_identifier).

mapping.exclude is different โ€” it's not about renamed models, it's for a target table that has no source model on purpose, because it's populated by something other than dbt (a Kafka stream, a cron job, etc.). This can't be inferred from the code (a table=True SQLModel class looks identical whether or not dbt is supposed to feed it), so it has to be a deliberate, human-stated exception:

mapping:
  exclude:
    - feed_interaction
    - affiliate_reward

Anything not listed is validated normally โ€” including table=True classes, which are treated the same as any other target and are not silently skipped.

๐Ÿ Python API

from data_contract_validator import ContractValidator, DBTExtractor, FastAPIExtractor

dbt = DBTExtractor(project_path="./dbt-project")
fastapi = FastAPIExtractor.from_github_repo("my-org/my-api", "app/models.py")

validator = ContractValidator(
    source_extractor=dbt,
    target_extractor=fastapi,
    mapping={"tables": {"user_analytics": "user_analytics_summary"}},  # optional
)
result = validator.validate()

if not result.success:
    for issue in result.critical_issues:
        print(f"๐Ÿ’ฅ {issue.table}.{issue.column}: {issue.message}")

๐Ÿช CI / pre-commit integration

GitHub Actions

contract-validator init generates a workflow for you. Minimal version:

name: ๐Ÿ›ก๏ธ Data Contract Validation
on:
  pull_request:
    paths: ["models/**/*.sql", "dbt_project.yml", "**/*models*.py"]
jobs:
  validate-contracts:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v4
        with: { python-version: "3.11" }
      - run: pip install data-contract-validator
      # Optional: `dbt docs generate` here for real warehouse types (Tier 1)
      - run: contract-validator validate --output github
        env:
          GITHUB_TOKEN: ${{ secrets.API_REPO_TOKEN }}

GITHUB_TOKEN here is only needed if target is a github repo (init omits the whole env: block for a local target). secrets.API_REPO_TOKEN is a token you create yourself, not GitHub's auto-provided secrets.GITHUB_TOKEN โ€” see Private GitHub repos need GITHUB_TOKEN above for why, and how to set it up.

Pre-commit

contract-validator setup-precommit --install-hooks
repos:
  - repo: https://github.com/OGsiji/data-contract-validator
    rev: v1.1.0
    hooks:
      - id: contract-validation

๐Ÿงช Output formats

contract-validator validate --output terminal   # human-friendly (default)
contract-validator validate --output json        # machine-readable for CI
contract-validator validate --output github       # GitHub Actions annotations

๐Ÿš€ Supported frameworks

Source: dbt (all adapters โ€” Snowflake, BigQuery, Redshift, Postgres, โ€ฆ). Target: FastAPI (Pydantic v2 + SQLModel).

The extractor architecture is intentionally pluggable (BaseExtractor โ†’ Dict[str, Schema] with canonical types), so additional sources/targets can be added without touching the validator. Open an issue to request one.

๐Ÿ› ๏ธ Development & testing

git clone https://github.com/OGsiji/data-contract-validator
cd data-contract-validator

python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"     # or: pip install -e ".[test]"

# Run the suite
pytest

# Lint / format
black data_contract_validator tests

The test suite covers the canonical type system (tests/test_core/test_types.py), the tiered dbt extractor including sqlglot CTE handling and catalog.json (tests/test_extractors/test_dbt.py), and the confidence/mapping behavior of the validator (tests/test_core/test_validator.py).

Adding an extractor

from data_contract_validator.extractors.base import BaseExtractor
from data_contract_validator.core.types import CanonicalType

class MyExtractor(BaseExtractor):
    def extract_schemas(self):
        # return Dict[str, Schema]; use self._make_column(...) so each column
        # carries a canonical_type the validator can compare.
        ...

๐Ÿ—บ๏ธ Roadmap

  • Real compatibility semantics (nullability, additive vs. breaking changes)
  • Reporter/logging abstraction (quiet/embeddable core)
  • A canonical, language-neutral contract artifact + baseline/snapshot diffing
  • More targets (Django, SQLAlchemy, GraphQL, OpenAPI)

๐Ÿ“„ License

MIT โ€” see LICENSE.

๐Ÿ†˜ Support

If this saves you a production incident, please โญ the repo.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_contract_validator-1.1.7.tar.gz (51.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

data_contract_validator-1.1.7-py3-none-any.whl (44.3 kB view details)

Uploaded Python 3

File details

Details for the file data_contract_validator-1.1.7.tar.gz.

File metadata

  • Download URL: data_contract_validator-1.1.7.tar.gz
  • Upload date:
  • Size: 51.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for data_contract_validator-1.1.7.tar.gz
Algorithm Hash digest
SHA256 e31e13ae518593ac7d2852a750c9563e9db977ab01dd342c7c09358f7b858df2
MD5 2e9f863c70bea77879312ecc32f000cf
BLAKE2b-256 07f203f12817ac056a3005979c953cd4bb8096e5a440a417d3787ac2bb92f5b1

See more details on using hashes here.

Provenance

The following attestation bundles were made for data_contract_validator-1.1.7.tar.gz:

Publisher: publish.yml on OGsiji/data-contract-validator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file data_contract_validator-1.1.7-py3-none-any.whl.

File metadata

File hashes

Hashes for data_contract_validator-1.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 3fec515db24dc49c6a0831f80de494de689191ed679343d2f1e9428f6b22ee07
MD5 12d6b4160a8acff91150e32bc505fce2
BLAKE2b-256 d7cbc323354f9b2f087de26143c96a22322abdd399d7f532e36fd8f58fbcfe33

See more details on using hashes here.

Provenance

The following attestation bundles were made for data_contract_validator-1.1.7-py3-none-any.whl:

Publisher: publish.yml on OGsiji/data-contract-validator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page