Skip to main content

Data validation that discovers rules from your data so you don't have to write them

Project description

GoldenCheck

Data validation that discovers rules from your data so you don't have to write them.

PyPI Python 3.11+ Tests License: MIT

Every competitor makes you write rules first. GoldenCheck flips it: validate first, keep the rules you care about.

Why GoldenCheck?

GoldenCheck Great Expectations Pandera Pointblank
Rules Discovered from data Written by hand Written by hand Written by hand
Config Zero to start Heavy YAML/Python setup Decorators/schemas YAML/Python
Interface CLI + interactive TUI HTML reports Exceptions HTML/notebook
Learning curve One command Hours/days Moderate Moderate
LLM enhancement Yes ($0.01/scan) No No No

Install

pip install goldencheck

With LLM boost support:

pip install goldencheck[llm]

Quick Start

# Scan a file — discovers issues, launches interactive TUI
goldencheck data.csv

# CLI-only output (no TUI)
goldencheck data.csv --no-tui

# With LLM enhancement (requires API key)
goldencheck data.csv --llm-boost --no-tui

# Validate against saved rules (for CI/pipelines)
goldencheck validate data.csv

# JSON output for CI integration
goldencheck data.csv --no-tui --json

How It Works

1. SCAN     →  goldencheck data.csv
                GoldenCheck profiles your data and discovers what "healthy" looks like

2. REVIEW   →  Interactive TUI shows findings sorted by severity
                Each finding has: description, affected rows, sample values

3. PIN      →  Press Space to promote findings into permanent rules
                Dismiss false positives — they won't come back

4. EXPORT   →  Press F2 to save rules to goldencheck.yml
                Human-readable YAML with your pinned rules

5. VALIDATE →  goldencheck validate data.csv
                Enforce rules in CI with exit codes (0 = pass, 1 = fail)

What It Detects

Column-Level Profilers

Profiler What It Catches Example
Type inference String columns that are actually numeric "Column age is string but 98% are integer"
Nullability Required vs. optional columns "0 nulls across 50k rows — likely required"
Uniqueness Primary key candidates, near-duplicates "100% unique — likely primary key"
Format detection Emails, phones, URLs, dates "94% email format, 6% malformed"
Range & distribution Outliers, min/max bounds "3 rows have values >10,000"
Cardinality Low-cardinality enum suggestions "4 unique values — possible enum"
Pattern consistency Mixed formats within a column "3 phone formats detected"

Cross-Column Profilers

Profiler What It Catches
Temporal ordering start_date > end_date violations
Null correlation Columns that are null together (e.g., address + city + zip)

LLM Boost

Add --llm-boost to enhance profiler findings with LLM intelligence. The LLM receives a representative sample of your data and:

  1. Finds issues profilers miss — semantic understanding (e.g., "12345" in a name column)
  2. Upgrades severity — knows "emails should be required" even if the profiler only says "INFO"
  3. Discovers relationships — identifies temporal ordering between columns like signup_date and last_login
  4. Downgrades false positives — "mixed phone formats are common, not an error"
# Using OpenAI
export OPENAI_API_KEY=sk-...
goldencheck data.csv --llm-boost --llm-provider openai --no-tui

# Using Anthropic
export ANTHROPIC_API_KEY=sk-ant-...
goldencheck data.csv --llm-boost --no-tui

Cost: ~$0.01 per scan (one API call with representative samples, not per-row).

Budget control:

export GOLDENCHECK_LLM_BUDGET=0.50  # max spend per scan in USD

Configuration (goldencheck.yml)

version: 1

settings:
  sample_size: 100000
  fail_on: error

columns:
  email:
    type: string
    required: true
    format: email
    unique: true

  age:
    type: integer
    range: [0, 120]

  status:
    type: string
    enum: [active, inactive, pending, closed]

relations:
  - type: temporal_order
    columns: [start_date, end_date]

ignore:
  - column: notes
    check: nullability

Only pinned rules appear in this file — not every finding. The ignore list prevents dismissed findings from reappearing.

CLI Reference

Command Description
goldencheck <file> Scan and launch TUI
goldencheck scan <file> Explicit scan
goldencheck validate <file> Validate against goldencheck.yml
goldencheck review <file> Scan + validate, launch TUI

Flags

Flag Description
--no-tui Print results to console
--json JSON output
--fail-on <level> Exit 1 on severity: error or warning
--llm-boost Enable LLM enhancement
--llm-provider <name> LLM provider: anthropic (default) or openai
--verbose Show info-level logs
--debug Show debug-level logs
--version Show version

Benchmarks

Speed

Dataset Time Throughput
1K rows 0.05s 19K rows/sec
10K rows 0.23s 43K rows/sec
100K rows 2.29s 44K rows/sec
1M rows 2.07s 482K rows/sec

Detection Accuracy

Mode Column Recall Cost
Profiler-only 87% $0
With LLM Boost 100% ~$0.01

Tested on a custom benchmark with 341 planted data quality issues across 9 categories.

Raha Benchmark Datasets

Dataset Column Recall
Flights (2,376 rows) 100% (4/4 columns)
Beers (2,410 rows) 80% (4/5 columns)

Tech Stack

Dependency Purpose
Polars All data operations
Typer CLI framework
Textual Interactive TUI
Rich CLI output formatting
Pydantic 2 Config validation

Optional: Anthropic SDK / OpenAI SDK for LLM Boost

Contributing

See CONTRIBUTING.md for development setup and guidelines.

License

MIT — see LICENSE


From the maker of GoldenMatch — entity resolution toolkit.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

goldencheck-0.1.0.tar.gz (100.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

goldencheck-0.1.0-py3-none-any.whl (38.5 kB view details)

Uploaded Python 3

File details

Details for the file goldencheck-0.1.0.tar.gz.

File metadata

  • Download URL: goldencheck-0.1.0.tar.gz
  • Upload date:
  • Size: 100.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for goldencheck-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1c5d9d76ab44d3794bc30a8af4dd8be870e8d818d5141199f291834baa5e08a4
MD5 2432c8ceef27c05978535d0b7ed15a40
BLAKE2b-256 1db07565e55971c7826df8f084b03f01f31ed319a71825a723f697f7ca6aaa8c

See more details on using hashes here.

File details

Details for the file goldencheck-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: goldencheck-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 38.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for goldencheck-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f201fd4350af4bf31034079236af0c0a7ac912b14d0a89380d61b47102e5f637
MD5 8f76da5d4b08804d469ffa11f0ef6602
BLAKE2b-256 4ba59fe1ec30f35ed50ca10d5c8bffa6122a9edcf7456d8c058b596f484fd96c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page