Skip to main content

Define data contracts as YAML, validate pandas/polars/arrow DataFrames, and gate CI on breaking schema changes

Project description

datalasi

Schema contracts for your data pipelines — defined as code, enforced at runtime.


The problem

Data pipelines break silently. A column gets renamed. A type changes from string to integer. An enum gains a new value that downstream code doesn't handle. By the time anyone notices, bad data has already flowed into dashboards, ML features, or production databases.

The standard fixes all have gaps:

Approach What it misses
Unit tests Schema drift between services and data sources
dbt tests Run after data lands in the warehouse — too late
Great Expectations Heavyweight: server, UI, significant setup
Informal docs Forgotten the moment they're written

datalasi takes a different approach: treat data schemas like code. Define them as versioned YAML files, commit them to Git, validate DataFrames against them inline, and gate your CI pipeline on breaking changes.


How it works

1. Define your schema as a contract

# contracts/transactions-v1.0.0.yaml
name: transactions
version: 1.0.0
owner: data-eng@company.com

schema:
  transaction_id:
    type: Int64
    nullable: false
    pk: true

  amount:
    type: Float64
    nullable: false
    min: 0.01

  status:
    type: Enum
    allowed_values: [PENDING, COMPLETED, FAILED]
    nullable: false

expectations:
  - "amount > 0"
  - column: amount
    rule: gt
    value: 0
    description: "Amount must be positive"

2. Validate your DataFrame

from datalasi import DataContract

contract = DataContract.load("contracts/transactions-v1.0.0.yaml")
result = contract.validate(df)   # pandas, polars, or pyarrow

if not result.success:
    print(result)   # schema violations, expectation failures, row counts

3. Gate CI on breaking changes

# In your CI pipeline — exits 1 if this version breaks consumers
datalasi check contracts/ transactions

4. Detect what changed

datalasi diff contracts/ transactions 1.0.0 1.1.0
# transactions 1.0.0 → 1.1.0: 1 breaking change(s), 1 non-breaking change(s)
# Breaking changes:
#   ✗ Column 'amount' changed from nullable to non-nullable
# Non-breaking changes:
#   + Column 'currency' added

Installation

pip install datalasi                    # core only
pip install "datalasi[pandas]"          # + pandas validation
pip install "datalasi[polars]"          # + polars validation
pip install "datalasi[arrow]"           # + pyarrow validation
pip install "datalasi[sql]"             # + sqlalchemy validation
pip install "datalasi[git]"             # + git-backed versioning
pip install "datalasi[all]"             # everything

Features

DataFrame validation — pandas, polars, pyarrow, SQLAlchemy

from datalasi import DataContract
from datalasi.adapters.pandas_adapter import PandasAdapter
from datalasi.adapters.polars_adapter import PolarsAdapter
from datalasi.adapters.arrow_adapter import ArrowAdapter
from datalasi.adapters.sqlalchemy_adapter import SQLAlchemyAdapter

# All four work the same way
result = PandasAdapter.validate(pandas_df, contract)
result = PolarsAdapter.validate(polars_df, contract)
result = ArrowAdapter.validate(arrow_table, contract)

# SQLAlchemy — pass a CursorResult or Table
with engine.connect() as conn:
    result = SQLAlchemyAdapter.validate(
        conn.execute(text("SELECT * FROM orders")), contract
    )

Coercion mode

# Attempt to cast columns to their declared types before validation
result = contract.validate(df, coerce=True)
print(result.coercions_applied)
# ["amount: object → Float64", "quantity: float64 → Int64"]

Structured expectations DSL

from datalasi import ExpectationRule

contract = DataContract(
    name="orders",
    version="1.0.0",
    schema={...},
    expectations=[
        "amount > 0",                                               # plain string
        ExpectationRule("status", "in", ["OPEN", "CLOSED"]),        # structured
        ExpectationRule("email", "regex", r".*@.*\..*",
                        description="Valid email format",
                        severity="WARNING"),                         # warning-only
    ],
)

Contract inheritance

# child contract inherits all fields from parent and adds its own
express_contract = DataContract(
    name="orders_express",
    version="1.0.0",
    extends="orders",               # inherits orders schema + expectations
    schema={
        "delivery_date": Field("delivery_date", Date(), nullable=False),
    },
)

# Resolve merged schema using a registry
from datalasi.io import ContractRegistry
registry = ContractRegistry("contracts/")
merged = express_contract.resolve(registry)

Schema export

# JSON Schema (draft-07) — for APIs, form validators, documentation
js = contract.to_json_schema()

# Apache Avro — for Kafka, data lake ingestion
avro = contract.to_avro_schema()

Git-backed versioning

from datalasi.io import GitBackend

backend = GitBackend(".")
sha = backend.commit_contract(contract, "contracts/orders-v1.1.0.yaml")

# query history
for entry in backend.history("contracts/orders-v1.1.0.yaml"):
    print(entry["sha"], entry["date"], entry["message"])

# load a contract as it was at a specific commit
past_contract = backend.get_at_commit("contracts/orders-v1.1.0.yaml", sha)

VS Code YAML autocomplete

Add this to your workspace .vscode/settings.json:

{
  "yaml.schemas": {
    "./datalasi-schema.json": ["contracts/**/*.yaml"]
  }
}

Contract registry

from datalasi.io import ContractRegistry

registry = ContractRegistry("contracts/")
contract = registry.get("transactions")            # latest version
v1 = registry.get("transactions", version="1.0.0")
diff = registry.diff("transactions", "1.0.0", "1.1.0")

print(diff.has_breaking_changes)  # True / False
print(diff.breaking_changes)      # ["Column 'amount' changed nullable→non-nullable"]

CLI

# Interactively create a contract
datalasi init

# Validate a data file against a contract
datalasi validate contracts/transactions-v1.0.0.yaml data/tx.csv

# Infer a contract from a data file
datalasi infer data/transactions.parquet --name transactions --output contracts/tx-v1.0.0.yaml

# List all contracts in a registry
datalasi list --registry contracts/

# Diff two versions
datalasi diff contracts/ transactions 1.0.0 1.1.0

# CI gate — exit 1 if latest version has breaking changes vs predecessor
datalasi check contracts/ transactions

Supported types

Type Description Constraints
Int64 64-bit integer min, max
Int32 32-bit integer min, max
Float64 64-bit float min, max
String Text max_length, pattern
Boolean True/False
Date YYYY-MM-DD
Timestamp ISO datetime timezone
Enum Fixed value set allowed_values

Development

git clone https://github.com/Malodeity/datalasi
cd datalasi
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev,pandas,polars,arrow,sql,git]"

pytest tests/ -v --cov=datalasi
ruff check datalasi tests
black datalasi tests

License

Apache 2.0 — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datalasi-0.2.0.tar.gz (50.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datalasi-0.2.0-py3-none-any.whl (47.1 kB view details)

Uploaded Python 3

File details

Details for the file datalasi-0.2.0.tar.gz.

File metadata

  • Download URL: datalasi-0.2.0.tar.gz
  • Upload date:
  • Size: 50.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for datalasi-0.2.0.tar.gz
Algorithm Hash digest
SHA256 31da83bc530e7e3e69a749e35beb5f2a398cfcff56577d9d6657da46d37a0e25
MD5 e1c248cbf546f6fa3f00ec4f5b4eb0c6
BLAKE2b-256 2b6152873148532b1b7b708287cb2dd46167d30f8ea9d4fe33101a70db8bb512

See more details on using hashes here.

Provenance

The following attestation bundles were made for datalasi-0.2.0.tar.gz:

Publisher: publish.yml on Malodeity/datalasi

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file datalasi-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: datalasi-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 47.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for datalasi-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e21c4f37d59ff4701484bb7d1fa3dfa5ddf2556d413494d005f999c7f7ff9e8d
MD5 2fcb50688e26bf0022255e372d466659
BLAKE2b-256 c5ab67081a67e622675cdeea363d2cef7f5afb93b73d5d7c5c2502f0b962e73f

See more details on using hashes here.

Provenance

The following attestation bundles were made for datalasi-0.2.0-py3-none-any.whl:

Publisher: publish.yml on Malodeity/datalasi

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page