Skip to main content

Probabilistic data test framework for Databricks — test terabyte-scale Delta Lake tables in seconds using statistical sampling and confidence intervals

Project description

Delphi

Probabilistic data test framework for Databricks. Test terabyte-scale Delta Lake tables in seconds using statistical sampling and confidence intervals instead of exhaustive scans.

from delphi import datatest, col
from delphi import functions as F

@datatest("catalog.schema.revenue")
def test_revenue_quality(dt):
    dt.expect(col("revenue").null_rate < 0.01)
    dt.expect(col("revenue").mean.between(1000, 5000), confidence=0.99)
    dt.expect(col("customer_id").uniqueness > 0.99)
    dt.expect(F.row_count() > 1_000_000)

Why Delphi?

Full row-level scans are infeasible on large Delta tables. Delphi samples intelligently and uses statistical confidence intervals to determine pass/fail, giving you fast, reliable data quality checks with quantified uncertainty.

  • Fast -- Adaptive sampling reads thousands of rows, not billions
  • Statistically rigorous -- Wilson, t-distribution, and bootstrap confidence intervals
  • PySpark-native -- col(), operator overloading, and functions as F feel like PySpark
  • Two-layer API -- Python DSL for engineers, YAML for analysts
  • Multi-runtime -- Terminal, notebook, CI/CD (JSON + JUnit XML), and agentic output
  • Databricks-first -- Delta file stats for free pre-scan, Unity Catalog native

Install

pip install dbx-delphi

Or with uv:

uv add dbx-delphi

Requires Python 3.10+ and a Databricks workspace with Unity Catalog.

Quick Start

1. Configure connection

delphi setup

This walks you through connecting to your Databricks workspace. Alternatively, set environment variables:

export DATABRICKS_HOST=https://your-workspace.cloud.databricks.com
export DATABRICKS_TOKEN=dapi...

2. Write a test

# tests/test_revenue.py
from delphi import datatest, col
from delphi import functions as F

@datatest("catalog.schema.revenue")
def test_nulls(dt):
    dt.expect(col("revenue").null_rate < 0.01)

@datatest("catalog.schema.revenue")
def test_distribution(dt):
    dt.expect(col("revenue").mean.between(1000, 5000), confidence=0.99)
    dt.expect(col("revenue").stddev < 2000)
    dt.expect(F.row_count() > 100_000)

3. Run

delphi run tests/

DSL Reference

Column Metrics

Use col("name") to start a column expression, then chain a metric:

from delphi import col

col("revenue").null_rate < 0.01       # Null rate below 1%
col("revenue").mean.between(100, 500) # Mean within range
col("revenue").min > 0                # Minimum above 0
col("revenue").max < 1_000_000        # Maximum below 1M
col("revenue").stddev < 100           # Standard deviation below 100
col("id").uniqueness > 0.99           # 99%+ distinct values

Available metrics: null_rate, uniqueness, mean, min, max, stddev

Dataset-Level Functions

from delphi import functions as F

F.row_count() > 1_000_000                       # Minimum row count
F.approx_percentile("revenue", 0.95) < 10_000   # 95th percentile cap

Confidence Levels

Every expectation defaults to 95% confidence. Override per-expectation:

dt.expect(col("revenue").null_rate < 0.01)                  # 95% (default)
dt.expect(col("revenue").mean.between(100, 500), confidence=0.99)  # 99%

A test passes only when the entire confidence interval satisfies the threshold. This is conservative -- if the CI straddles the threshold, the test fails.

Time Column for Sampling

Delphi auto-detects the time column for stratified sampling (partition keys > clustering keys > well-known names like date, timestamp, created_at). When your table has multiple date/timestamp columns and auto-detection is ambiguous, set it explicitly:

Per-test (decorator):

@datatest("catalog.schema.events", time_column="event_date")
def test_events(dt):
    dt.expect(col("status").null_rate < 0.01)

In delphi.toml (global):

[delphi]
time_column = "event_date"

CLI (per-run):

delphi run tests/ --time-column event_date

YAML Checks

For analysts who prefer configuration over code:

# checks/revenue.yaml
table: catalog.schema.revenue
time_column: event_date  # optional: explicit time column for sampling
checks:
  - column: revenue
    null_rate: "< 0.01"
  - column: revenue
    mean: "between 1000 and 5000"
  - column: customer_id
    uniqueness: "> 0.99"

Confidence defaults to 0.95 in YAML. Override per-check:

  - column: revenue
    mean: "between 1000 and 5000"
    confidence: 0.99

Run YAML checks:

delphi run checks/revenue.yaml

Dataset Comparison

Compare a table against a reference:

from delphi import datatest, col, compare
from delphi import functions as F

@datatest("catalog.schema.output")
def test_matches_expected(dt):
    expected = compare("catalog.schema.expected")
    dt.expect(col("revenue").mean_diff(expected) < 0.05)
    dt.expect(F.row_count_ratio(expected).between(0.99, 1.01))

CLI

delphi setup                          # Interactive connection setup
delphi setup --verify                 # Test current connection
delphi setup --profile staging        # Configure a named profile

delphi run tests/                     # Run all tests in directory
delphi run tests/test_revenue.py      # Run specific file
delphi run checks/revenue.yaml        # Run YAML checks
delphi run tests/ --profile staging   # Use named profile
delphi run tests/ --output json       # JSON output
delphi run tests/ --confidence 0.99   # Override confidence
delphi run tests/ --sample-ceiling 200000
delphi run tests/ --evidence-rows 20  # More evidence rows
delphi run tests/ --no-evidence       # Suppress evidence
delphi run tests/ --time-column event_date  # Explicit time column

delphi inspect catalog.schema.table   # Table profile (no sampling)

delphi --version

Configuration

Create delphi.toml in your project root (or use delphi setup):

[delphi]
default_confidence = 0.95
sample_floor = 1000
sample_ceiling = 100000
evidence_rows = 10
redact_columns = ["ssn", "email"]
connection_retries = 3
connection_timeout = 300
time_column = "event_date"  # optional: explicit time column for sampling

# Serverless (recommended)
[delphi.connection]
host = "https://your-workspace.cloud.databricks.com"
serverless = true
auth_type = "env"
default_catalog = "main"
default_schema = "default"

# Classic cluster (alternative)
# [delphi.connection]
# host = "https://your-workspace.cloud.databricks.com"
# cluster_id = "0123-456789-abcdef"
# auth_type = "env"

Named Profiles

[delphi.connection.profiles.staging]
host = "https://staging.cloud.databricks.com"
serverless = true
auth_type = "env"

Authentication

Method auth_type How
Environment variables env DATABRICKS_HOST + DATABRICKS_TOKEN
Personal Access Token pat Token stored in delphi.toml
OAuth (U2M) oauth Browser-based flow
Databricks SDK unified auth (any) Auto-discovers from env, ~/.databrickscfg, or cloud identity

How It Works

Delphi runs a three-stage pipeline for each test:

Table ref --> Pre-scan --> Sample --> Metrics --> Confidence --> Result
  1. Pre-scan -- Reads Delta file stats (DESCRIBE DETAIL) for free. Column-level null counts, min/max, row count. Short-circuits trivially passing checks without scanning a single row.

  2. Adaptive Sampling -- Computes the minimum sample size needed for the desired confidence and margin of error. Floors at 1,000 rows, caps at 100,000. For timeseries tables, auto-detects the time column and applies stratified sampling.

  3. Metric Computation -- Runs PySpark aggregations on the sampled DataFrame. Multiple expectations on the same table share one sample.

  4. Confidence Intervals -- Routes each metric to the appropriate statistical method:

    Metric type Method
    Rates (null_rate, uniqueness) Wilson score interval
    Means t-distribution
    Distributions, percentiles Bootstrap (B=1000)
    Row count, min, max Exact (no CI needed)
  5. Evidence -- On failure, collects up to 10 violating rows from the already-sampled data (no extra scan). Sensitive columns can be redacted.

Output Formats

Delphi auto-detects your environment:

Environment Renderer Details
Terminal rich Color tables, confidence bars
CI/CD JSON + JUnit XML delphi-results.xml for GitHub Actions, Jenkins
Notebook plotly (coming soon) Inline charts
Programmatic Structured dict For agentic/orchestration use

Override with --output terminal|ci|json.

Error Handling

Every error includes a suggestion:

 FAIL  test_nulls    null_rate=0.032  threshold=<0.01  CI=[0.028, 0.036]

 ERROR test_typo     Column "revnue" not found
                     -> Did you mean "revenue"?

 INCONCLUSIVE test_x Sample size (847) too small for confidence=0.99
                     -> Increase ceiling or lower confidence to 0.95

Connection errors retry up to 3 times with exponential backoff (configurable).

Documentation

  • Tutorial -- Step-by-step guide from setup to CI/CD
  • Statistics Guide -- Plain-language explanation of confidence intervals, sampling methods, and every statistical concept used in Delphi
  • Databricks Connect Guide -- Serverless vs cluster, version matching, and troubleshooting

Development

git clone https://github.com/egde/delphi.git
cd delphi
uv sync

# Run unit tests (no Databricks needed)
uv run pytest tests/unit/ -v

# Run integration tests (requires Databricks credentials)
uv run pytest tests/integration/ -v -m integration

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dbx_delphi-0.3.0.tar.gz (151.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dbx_delphi-0.3.0-py3-none-any.whl (30.9 kB view details)

Uploaded Python 3

File details

Details for the file dbx_delphi-0.3.0.tar.gz.

File metadata

  • Download URL: dbx_delphi-0.3.0.tar.gz
  • Upload date:
  • Size: 151.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dbx_delphi-0.3.0.tar.gz
Algorithm Hash digest
SHA256 2843c3bb05c1db82961c5f34198d94fd92d4957eda404e8556f33c9a23dfce74
MD5 38a8d1cc46e272db4e14fdea531b94cf
BLAKE2b-256 d18619e885586c3c6d66c2a489f04367b10ccb29ebddf193233649cd4cbc6fea

See more details on using hashes here.

Provenance

The following attestation bundles were made for dbx_delphi-0.3.0.tar.gz:

Publisher: publish.yml on egde/delphi

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dbx_delphi-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: dbx_delphi-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 30.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dbx_delphi-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b7e2b771746de3def71b23c2de9c3052eac1f0412ee72b47793870c8ea359757
MD5 ea1cb8a116f2fbf072f29b3ec0b589a0
BLAKE2b-256 fb4d80c4d475156e94fb4778a5c19d1e7f5087051eb2fbadf43b3457fa908b1a

See more details on using hashes here.

Provenance

The following attestation bundles were made for dbx_delphi-0.3.0-py3-none-any.whl:

Publisher: publish.yml on egde/delphi

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page