Probabilistic data test framework for Databricks — test terabyte-scale Delta Lake tables in seconds using statistical sampling and confidence intervals

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

egde

These details have not been verified by PyPI

Project description

Delphi

Probabilistic data test framework for Databricks. Test terabyte-scale Delta Lake tables in seconds using statistical sampling and confidence intervals instead of exhaustive scans.

from delphi import datatest, col
from delphi import functions as F

@datatest("catalog.schema.revenue")
def test_revenue_quality(dt):
    dt.expect(col("revenue").null_rate < 0.01)
    dt.expect(col("revenue").mean.between(1000, 5000), confidence=0.99)
    dt.expect(col("customer_id").uniqueness > 0.99)
    dt.expect(F.row_count() > 1_000_000)

Why Delphi?

Full row-level scans are infeasible on large Delta tables. Delphi samples intelligently and uses statistical confidence intervals to determine pass/fail, giving you fast, reliable data quality checks with quantified uncertainty.

Fast -- Adaptive sampling reads thousands of rows, not billions
Statistically rigorous -- Wilson, t-distribution, and bootstrap confidence intervals
PySpark-native -- col(), operator overloading, and functions as F feel like PySpark
Two-layer API -- Python DSL for engineers, YAML for analysts
Multi-runtime -- Terminal, notebook, CI/CD (JSON + JUnit XML), and agentic output
Databricks-first -- Delta file stats for free pre-scan, Unity Catalog native

Install

pip install dbx-delphi

Or with uv:

uv add dbx-delphi

Requires Python 3.10+ and a Databricks workspace with Unity Catalog.

Quick Start

1. Configure connection

delphi setup

This walks you through connecting to your Databricks workspace. Alternatively, set environment variables:

export DATABRICKS_HOST=https://your-workspace.cloud.databricks.com
export DATABRICKS_TOKEN=dapi...

2. Write a test

# tests/test_revenue.py
from delphi import datatest, col
from delphi import functions as F

@datatest("catalog.schema.revenue")
def test_nulls(dt):
    dt.expect(col("revenue").null_rate < 0.01)

@datatest("catalog.schema.revenue")
def test_distribution(dt):
    dt.expect(col("revenue").mean.between(1000, 5000), confidence=0.99)
    dt.expect(col("revenue").stddev < 2000)
    dt.expect(F.row_count() > 100_000)

3. Run

delphi run tests/

DSL Reference

Column Metrics

Use col("name") to start a column expression, then chain a metric:

from delphi import col

col("revenue").null_rate < 0.01       # Null rate below 1%
col("revenue").mean.between(100, 500) # Mean within range
col("revenue").min > 0                # Minimum above 0
col("revenue").max < 1_000_000        # Maximum below 1M
col("revenue").stddev < 100           # Standard deviation below 100
col("id").uniqueness > 0.99           # 99%+ distinct values

Available metrics: null_rate, uniqueness, mean, min, max, stddev

Dataset-Level Functions

from delphi import functions as F

F.row_count() > 1_000_000                       # Minimum row count
F.approx_percentile("revenue", 0.95) < 10_000   # 95th percentile cap

Confidence Levels

Every expectation defaults to 95% confidence. Override per-expectation:

dt.expect(col("revenue").null_rate < 0.01)                  # 95% (default)
dt.expect(col("revenue").mean.between(100, 500), confidence=0.99)  # 99%

A test passes only when the entire confidence interval satisfies the threshold. This is conservative -- if the CI straddles the threshold, the test fails.

Time Column for Sampling

Delphi auto-detects the time column for stratified sampling (partition keys > clustering keys > well-known names like date, timestamp, created_at). When your table has multiple date/timestamp columns and auto-detection is ambiguous, set it explicitly:

Per-test (decorator):

@datatest("catalog.schema.events", time_column="event_date")
def test_events(dt):
    dt.expect(col("status").null_rate < 0.01)

In delphi.toml (global):

[delphi]
time_column = "event_date"

CLI (per-run):

delphi run tests/ --time-column event_date

YAML Checks

For analysts who prefer configuration over code:

# checks/revenue.yaml
table: catalog.schema.revenue
time_column: event_date  # optional: explicit time column for sampling
checks:
  - column: revenue
    null_rate: "< 0.01"
  - column: revenue
    mean: "between 1000 and 5000"
  - column: customer_id
    uniqueness: "> 0.99"

Confidence defaults to 0.95 in YAML. Override per-check:

  - column: revenue
    mean: "between 1000 and 5000"
    confidence: 0.99

Run YAML checks:

delphi run checks/revenue.yaml

Dataset Comparison

Compare a table against a reference:

from delphi import datatest, col, compare
from delphi import functions as F

@datatest("catalog.schema.output")
def test_matches_expected(dt):
    expected = compare("catalog.schema.expected")
    dt.expect(col("revenue").mean_diff(expected) < 0.05)
    dt.expect(F.row_count_ratio(expected).between(0.99, 1.01))

CLI

delphi setup                          # Interactive connection setup
delphi setup --verify                 # Test current connection
delphi setup --profile staging        # Configure a named profile

delphi run tests/                     # Run all tests in directory
delphi run tests/test_revenue.py      # Run specific file
delphi run checks/revenue.yaml        # Run YAML checks
delphi run tests/ --profile staging   # Use named profile
delphi run tests/ --output json       # JSON output
delphi run tests/ --confidence 0.99   # Override confidence
delphi run tests/ --sample-ceiling 200000
delphi run tests/ --evidence-rows 20  # More evidence rows
delphi run tests/ --no-evidence       # Suppress evidence
delphi run tests/ --time-column event_date  # Explicit time column

delphi inspect catalog.schema.table   # Table profile (no sampling)

delphi --version

Configuration

Create delphi.toml in your project root (or use delphi setup):

[delphi]
default_confidence = 0.95
sample_floor = 1000
sample_ceiling = 100000
evidence_rows = 10
redact_columns = ["ssn", "email"]
connection_retries = 3
connection_timeout = 300
time_column = "event_date"  # optional: explicit time column for sampling

# Serverless (recommended)
[delphi.connection]
host = "https://your-workspace.cloud.databricks.com"
serverless = true
auth_type = "env"
default_catalog = "main"
default_schema = "default"

# Classic cluster (alternative)
# [delphi.connection]
# host = "https://your-workspace.cloud.databricks.com"
# cluster_id = "0123-456789-abcdef"
# auth_type = "env"

Named Profiles

[delphi.connection.profiles.staging]
host = "https://staging.cloud.databricks.com"
serverless = true
auth_type = "env"

Authentication

Method	`auth_type`	How
Environment variables	`env`	`DATABRICKS_HOST` + `DATABRICKS_TOKEN`
Personal Access Token	`pat`	Token stored in `delphi.toml`
OAuth (U2M)	`oauth`	Browser-based flow
Databricks SDK unified auth	(any)	Auto-discovers from env, `~/.databrickscfg`, or cloud identity

How It Works

Delphi runs a three-stage pipeline for each test:

Table ref --> Pre-scan --> Sample --> Metrics --> Confidence --> Result

Pre-scan -- Reads Delta file stats (DESCRIBE DETAIL) for free. Column-level null counts, min/max, row count. Short-circuits trivially passing checks without scanning a single row.
Adaptive Sampling -- Computes the minimum sample size needed for the desired confidence and margin of error. Floors at 1,000 rows, caps at 100,000. For timeseries tables, auto-detects the time column and applies stratified sampling.
Metric Computation -- Runs PySpark aggregations on the sampled DataFrame. Multiple expectations on the same table share one sample.
Confidence Intervals -- Routes each metric to the appropriate statistical method:

Metric type Method

Rates (null_rate, uniqueness) Wilson score interval

Means t-distribution

Distributions, percentiles Bootstrap (B=1000)

Row count, min, max Exact (no CI needed)
Evidence -- On failure, collects up to 10 violating rows from the already-sampled data (no extra scan). Sensitive columns can be redacted.

Metric type	Method
Rates (null_rate, uniqueness)	Wilson score interval
Means	t-distribution
Distributions, percentiles	Bootstrap (B=1000)
Row count, min, max	Exact (no CI needed)

Output Formats

Delphi auto-detects your environment:

Environment	Renderer	Details
Terminal	`rich`	Color tables, confidence bars
CI/CD	JSON + JUnit XML	`delphi-results.xml` for GitHub Actions, Jenkins
Notebook	`plotly` (coming soon)	Inline charts
Programmatic	Structured dict	For agentic/orchestration use

Override with --output terminal|ci|json.

Error Handling

Every error includes a suggestion:

 FAIL  test_nulls    null_rate=0.032  threshold=<0.01  CI=[0.028, 0.036]

 ERROR test_typo     Column "revnue" not found
                     -> Did you mean "revenue"?

 INCONCLUSIVE test_x Sample size (847) too small for confidence=0.99
                     -> Increase ceiling or lower confidence to 0.95

Connection errors retry up to 3 times with exponential backoff (configurable).

Documentation

Tutorial -- Step-by-step guide from setup to CI/CD
Statistics Guide -- Plain-language explanation of confidence intervals, sampling methods, and every statistical concept used in Delphi
Databricks Connect Guide -- Serverless vs cluster, version matching, and troubleshooting

Development

git clone https://github.com/egde/delphi.git
cd delphi
uv sync

# Run unit tests (no Databricks needed)
uv run pytest tests/unit/ -v

# Run integration tests (requires Databricks credentials)
uv run pytest tests/integration/ -v -m integration

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

egde

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.5.1

Apr 9, 2026

0.5.0

Apr 7, 2026

0.4.0

Apr 7, 2026

0.3.1

Apr 7, 2026

This version

0.3.0

Apr 7, 2026

0.2.0

Apr 7, 2026

0.1.0

Apr 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dbx_delphi-0.3.0.tar.gz (151.0 kB view details)

Uploaded Apr 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dbx_delphi-0.3.0-py3-none-any.whl (30.9 kB view details)

Uploaded Apr 7, 2026 Python 3

File details

Details for the file dbx_delphi-0.3.0.tar.gz.

File metadata

Download URL: dbx_delphi-0.3.0.tar.gz
Upload date: Apr 7, 2026
Size: 151.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dbx_delphi-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`2843c3bb05c1db82961c5f34198d94fd92d4957eda404e8556f33c9a23dfce74`
MD5	`38a8d1cc46e272db4e14fdea531b94cf`
BLAKE2b-256	`d18619e885586c3c6d66c2a489f04367b10ccb29ebddf193233649cd4cbc6fea`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dbx_delphi-0.3.0.tar.gz:

Publisher: publish.yml on egde/delphi

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dbx_delphi-0.3.0.tar.gz
- Subject digest: 2843c3bb05c1db82961c5f34198d94fd92d4957eda404e8556f33c9a23dfce74
- Sigstore transparency entry: 1247413618
- Sigstore integration time: Apr 7, 2026
Source repository:
- Permalink: egde/delphi@0638e081a1d980ecd7eaa2a63138923cfd372cf5
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/egde
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@0638e081a1d980ecd7eaa2a63138923cfd372cf5
- Trigger Event: release

File details

Details for the file dbx_delphi-0.3.0-py3-none-any.whl.

File metadata

Download URL: dbx_delphi-0.3.0-py3-none-any.whl
Upload date: Apr 7, 2026
Size: 30.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dbx_delphi-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b7e2b771746de3def71b23c2de9c3052eac1f0412ee72b47793870c8ea359757`
MD5	`ea1cb8a116f2fbf072f29b3ec0b589a0`
BLAKE2b-256	`fb4d80c4d475156e94fb4778a5c19d1e7f5087051eb2fbadf43b3457fa908b1a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dbx_delphi-0.3.0-py3-none-any.whl:

Publisher: publish.yml on egde/delphi

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dbx_delphi-0.3.0-py3-none-any.whl
- Subject digest: b7e2b771746de3def71b23c2de9c3052eac1f0412ee72b47793870c8ea359757
- Sigstore transparency entry: 1247413677
- Sigstore integration time: Apr 7, 2026
Source repository:
- Permalink: egde/delphi@0638e081a1d980ecd7eaa2a63138923cfd372cf5
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/egde
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@0638e081a1d980ecd7eaa2a63138923cfd372cf5
- Trigger Event: release

dbx-delphi 0.3.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Delphi

Why Delphi?

Install

Quick Start

1. Configure connection

2. Write a test

3. Run

DSL Reference

Column Metrics

Dataset-Level Functions

Confidence Levels

Time Column for Sampling

YAML Checks

Dataset Comparison

CLI

Configuration

Named Profiles

Authentication

How It Works

Output Formats

Error Handling

Documentation

Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance