Skip to main content

Data validation that discovers rules from your data so you don't have to write them

Project description

GoldenCheck

Data validation that discovers rules from your data so you don't have to write them.

PyPI Downloads CI Python 3.11+ Tests License: MIT Open In Colab

Every competitor makes you write rules first. GoldenCheck flips it: validate first, keep the rules you care about.

Why GoldenCheck?

GoldenCheck Great Expectations Pandera Pointblank
Rules Discovered from data Written by hand Written by hand Written by hand
Config Zero to start Heavy YAML/Python setup Decorators/schemas YAML/Python
Interface CLI + interactive TUI HTML reports Exceptions HTML/notebook
Learning curve One command Hours/days Moderate Moderate
LLM enhancement Yes ($0.01/scan) No No No
Fix suggestions Yes, in TUI No No No
Confidence scoring Yes (H/M/L per finding) No No No
DQBench Score 72.00 21.68 (best-effort) 32.51 (best-effort) 6.94 (auto)

Install

pip install goldencheck

With LLM boost support:

pip install goldencheck[llm]

Quick Start

# Scan a file — discovers issues, launches interactive TUI
goldencheck data.csv

# CLI-only output (no TUI)
goldencheck data.csv --no-tui

# With LLM enhancement (requires API key)
goldencheck data.csv --llm-boost --no-tui

# Validate against saved rules (for CI/pipelines)
goldencheck validate data.csv

# JSON output for CI integration
goldencheck data.csv --no-tui --json

How It Works

1. SCAN     →  goldencheck data.csv
                GoldenCheck profiles your data and discovers what "healthy" looks like

2. REVIEW   →  Interactive TUI shows findings sorted by severity
                Each finding has: description, affected rows, sample values

3. PIN      →  Press Space to promote findings into permanent rules
                Dismiss false positives — they won't come back

4. EXPORT   →  Press F2 to save rules to goldencheck.yml
                Human-readable YAML with your pinned rules

5. VALIDATE →  goldencheck validate data.csv
                Enforce rules in CI with exit codes (0 = pass, 1 = fail)

What It Detects

Column-Level Profilers

Profiler What It Catches Example
Type inference String columns that are actually numeric "Column age is string but 98% are integer"
Nullability Required vs. optional columns "0 nulls across 50k rows — likely required"
Uniqueness Primary key candidates, near-duplicates "100% unique — likely primary key"
Format detection Emails, phones, URLs, dates "94% email format, 6% malformed"
Range & distribution Outliers, min/max bounds "3 rows have values >10,000"
Cardinality Low-cardinality enum suggestions "4 unique values — possible enum"
Pattern consistency Mixed formats within a column "3 phone formats detected"

Cross-Column Profilers

Profiler What It Catches
Temporal ordering start_date > end_date violations
Null correlation Columns that are null together (e.g., address + city + zip)

LLM Boost

Add --llm-boost to enhance profiler findings with LLM intelligence. The LLM receives a representative sample of your data and:

  1. Finds issues profilers miss — semantic understanding (e.g., "12345" in a name column)
  2. Upgrades severity — knows "emails should be required" even if the profiler only says "INFO"
  3. Discovers relationships — identifies temporal ordering between columns like signup_date and last_login
  4. Downgrades false positives — "mixed phone formats are common, not an error"
# Using OpenAI
export OPENAI_API_KEY=sk-...
goldencheck data.csv --llm-boost --llm-provider openai --no-tui

# Using Anthropic
export ANTHROPIC_API_KEY=sk-ant-...
goldencheck data.csv --llm-boost --no-tui

Cost: ~$0.01 per scan (one API call with representative samples, not per-row).

Budget control:

export GOLDENCHECK_LLM_BUDGET=0.50  # max spend per scan in USD

Configuration (goldencheck.yml)

version: 1

settings:
  sample_size: 100000
  fail_on: error

columns:
  email:
    type: string
    required: true
    format: email
    unique: true

  age:
    type: integer
    range: [0, 120]

  status:
    type: string
    enum: [active, inactive, pending, closed]

relations:
  - type: temporal_order
    columns: [start_date, end_date]

ignore:
  - column: notes
    check: nullability

Only pinned rules appear in this file — not every finding. The ignore list prevents dismissed findings from reappearing.

CLI Reference

Command Description
goldencheck <file> Scan and launch TUI
goldencheck scan <file> Explicit scan
goldencheck validate <file> Validate against goldencheck.yml
goldencheck review <file> Scan + validate, launch TUI

Flags

Flag Description
--no-tui Print results to console
--json JSON output
--fail-on <level> Exit 1 on severity: error or warning
--llm-boost Enable LLM enhancement
--llm-provider <name> LLM provider: anthropic (default) or openai
--verbose Show info-level logs
--debug Show debug-level logs
--version Show version

Benchmarks

Speed

Dataset Time Throughput
1K rows 0.05s 19K rows/sec
10K rows 0.23s 43K rows/sec
100K rows 2.29s 44K rows/sec
1M rows 2.07s 482K rows/sec

DQBench v1.0 — Head-to-Head

Tool Mode DQBench Score
GoldenCheck zero-config 72.00
Pandera best-effort rules 32.51
Soda Core best-effort rules 22.36
Great Expectations best-effort rules 21.68

GoldenCheck's zero-config discovery outperforms every competitor — even when they have hand-written rules.

Run the benchmark yourself:

pip install dqbench goldencheck
dqbench run goldencheck

Detection Accuracy

Mode Column Recall Cost
Profiler-only (v0.1.0) 87% $0
Profiler-only (v0.2.0 with confidence) 100% $0
With LLM Boost 100% ~$0.003-0.01

Tested on a custom benchmark with 341 planted data quality issues across 9 categories.

v0.2.0 improvements: minority wrong-type detection, range profiler chaining, broader temporal heuristics, and confidence scoring pushed profiler-only recall from 87% to 100%.

Raha Benchmark Datasets

Dataset Column Recall
Flights (2,376 rows) 100% (4/4 columns)
Beers (2,410 rows) 80% (4/5 columns)

Tech Stack

Dependency Purpose
Polars All data operations
Typer CLI framework
Textual Interactive TUI
Rich CLI output formatting
Pydantic 2 Config validation

Optional: Anthropic SDK / OpenAI SDK for LLM Boost | MCP SDK for MCP server

MCP Server (Claude Desktop)

GoldenCheck includes an MCP server for Claude Desktop integration:

pip install goldencheck[mcp]

Add to your Claude Desktop config (claude_desktop_config.json):

{
  "mcpServers": {
    "goldencheck": {
      "command": "goldencheck",
      "args": ["mcp-serve"]
    }
  }
}

Available tools:

Tool Description
scan Scan a file for data quality issues (with optional LLM boost)
validate Validate against pinned rules in goldencheck.yml
profile Get column-level statistics and health score
health_score Quick A-F grade for a data file
get_column_detail Deep-dive into a specific column
list_checks List all available profiler checks

Jupyter / Colab

GoldenCheck renders rich HTML in Jupyter notebooks:

from goldencheck.engine.scanner import scan_file
from goldencheck.engine.confidence import apply_confidence_downgrade
from goldencheck.notebook import ScanResult

findings, profile = scan_file("data.csv")
findings = apply_confidence_downgrade(findings, llm_boost=False)

# Rich HTML display in notebooks
ScanResult(findings=findings, profile=profile)

Open In Colab

Contributing

See CONTRIBUTING.md for development setup and guidelines.

License

MIT — see LICENSE


From the maker of GoldenMatch — entity resolution toolkit.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

goldencheck-0.2.0.tar.gz (205.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

goldencheck-0.2.0-py3-none-any.whl (68.4 kB view details)

Uploaded Python 3

File details

Details for the file goldencheck-0.2.0.tar.gz.

File metadata

  • Download URL: goldencheck-0.2.0.tar.gz
  • Upload date:
  • Size: 205.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for goldencheck-0.2.0.tar.gz
Algorithm Hash digest
SHA256 7ff399ac04418b908676e15ba93fe686df49e7e44a4ec9272e15ae59cb3ca59b
MD5 7d4c144912e975e9c383d42484d94d2e
BLAKE2b-256 c3f14a38d91d0cdabf6817cf5a85b6d23254fc29abd9bd044c780e31294f0f34

See more details on using hashes here.

File details

Details for the file goldencheck-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: goldencheck-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 68.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for goldencheck-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9d7efbbc4189ee14a0e7b8f311e4cc821090677f6b71fbc7208395fba39d5737
MD5 3a40de5cc2ae05bbc790e1ec7e102a0f
BLAKE2b-256 9c5f07e0d8917fb211bb1f01280a1b13f8b93062820b984d633e8c72137bfc37

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page