Data validation that discovers rules from your data so you don't have to write them

These details have not been verified by PyPI

Project links

Project description

GoldenCheck

Data validation that discovers rules from your data so you don't have to write them.

Python 3.11+ Tests License: MIT

Every competitor makes you write rules first. GoldenCheck flips it: validate first, keep the rules you care about.

Why GoldenCheck?

	GoldenCheck	Great Expectations	Pandera	Pointblank
Rules	Discovered from data	Written by hand	Written by hand	Written by hand
Config	Zero to start	Heavy YAML/Python setup	Decorators/schemas	YAML/Python
Interface	CLI + interactive TUI	HTML reports	Exceptions	HTML/notebook
Learning curve	One command	Hours/days	Moderate	Moderate
LLM enhancement	Yes ($0.01/scan)	No	No	No
Fix suggestions	Yes, in TUI	No	No	No
Confidence scoring	Yes (H/M/L per finding)	No	No	No
DQBench Score	88.40	21.68 (best-effort)	32.51 (best-effort)	6.94 (auto)

Install

pip install goldencheck

With LLM boost support:

pip install goldencheck[llm]

Quick Start

# Scan a file — discovers issues, launches interactive TUI
goldencheck data.csv

# CLI-only output (no TUI)
goldencheck data.csv --no-tui

# With LLM enhancement (requires API key)
goldencheck data.csv --llm-boost --no-tui

# Validate against saved rules (for CI/pipelines)
goldencheck validate data.csv

# JSON output for CI integration
goldencheck data.csv --no-tui --json

How It Works

1. SCAN     →  goldencheck data.csv
                GoldenCheck profiles your data and discovers what "healthy" looks like

2. REVIEW   →  Interactive TUI shows findings sorted by severity
                Each finding has: description, affected rows, sample values

3. PIN      →  Press Space to promote findings into permanent rules
                Dismiss false positives — they won't come back

4. EXPORT   →  Press F2 to save rules to goldencheck.yml
                Human-readable YAML with your pinned rules

5. VALIDATE →  goldencheck validate data.csv
                Enforce rules in CI with exit codes (0 = pass, 1 = fail)

What It Detects

Column-Level Profilers

Profiler	What It Catches	Example
Type inference	String columns that are actually numeric	"Column `age` is string but 98% are integer"
Nullability	Required vs. optional columns	"0 nulls across 50k rows — likely required"
Uniqueness	Primary key candidates, near-duplicates	"100% unique — likely primary key"
Format detection	Emails, phones, URLs, dates	"94% email format, 6% malformed"
Range & distribution	Outliers, min/max bounds	"3 rows have values >10,000"
Cardinality	Low-cardinality enum suggestions	"4 unique values — possible enum"
Pattern consistency	Mixed formats within a column	"3 phone formats detected"

Cross-Column Profilers

Profiler	What It Catches
Temporal ordering	start_date > end_date violations
Null correlation	Columns that are null together (e.g., address + city + zip)
Numeric cross-column	value > max violations (e.g., claim_amount > policy_max)
Age vs DOB	Age column doesn't match calculated age from date_of_birth

Domain Packs

Improve detection accuracy with domain-specific type definitions:

goldencheck scan data.csv --domain healthcare   # NPI, ICD, insurance, patient types
goldencheck scan data.csv --domain finance      # accounts, routing, CUSIP, transactions
goldencheck scan data.csv --domain ecommerce    # SKUs, orders, tracking, products

Domain packs add semantic types that reduce false positives and improve classification for industry-specific data.

Schema Diff

Compare two versions of a data file:

goldencheck diff data.csv                  # compare against git HEAD
goldencheck diff old.csv new.csv           # compare two files
goldencheck diff data.csv --ref main       # compare against a branch

Auto-Fix

Apply automated fixes to clean your data:

goldencheck fix data.csv                          # safe: trim, normalize, fix encoding
goldencheck fix data.csv --mode moderate          # + standardize case
goldencheck fix data.csv --mode aggressive --force # + coerce types
goldencheck fix data.csv --dry-run                # preview changes

Watch Mode

Continuously monitor a directory for data quality:

goldencheck watch data/ --interval 30        # re-scan every 30s
goldencheck watch data/ --exit-on error      # CI mode: fail on first error

LLM Boost

Add --llm-boost to enhance profiler findings with LLM intelligence. The LLM receives a representative sample of your data and:

Finds issues profilers miss — semantic understanding (e.g., "12345" in a name column)
Upgrades severity — knows "emails should be required" even if the profiler only says "INFO"
Discovers relationships — identifies temporal ordering between columns like signup_date and last_login
Downgrades false positives — "mixed phone formats are common, not an error"

# Using OpenAI
export OPENAI_API_KEY=sk-...
goldencheck data.csv --llm-boost --llm-provider openai --no-tui

# Using Anthropic
export ANTHROPIC_API_KEY=sk-ant-...
goldencheck data.csv --llm-boost --no-tui

Cost: ~$0.01 per scan (one API call with representative samples, not per-row).

Budget control:

export GOLDENCHECK_LLM_BUDGET=0.50  # max spend per scan in USD

Configuration (goldencheck.yml)

version: 1

settings:
  sample_size: 100000
  fail_on: error

columns:
  email:
    type: string
    required: true
    format: email
    unique: true

  age:
    type: integer
    range: [0, 120]

  status:
    type: string
    enum: [active, inactive, pending, closed]

relations:
  - type: temporal_order
    columns: [start_date, end_date]

ignore:
  - column: notes
    check: nullability

Only pinned rules appear in this file — not every finding. The ignore list prevents dismissed findings from reappearing.

CLI Reference

Command	Description
`goldencheck <file>`	Scan and launch TUI
`goldencheck scan <file>`	Explicit scan (supports `--smart`, `--guided`)
`goldencheck validate <file>`	Validate against goldencheck.yml
`goldencheck review <file>`	Scan + validate, launch TUI
`goldencheck init <file>`	Interactive setup wizard (scan → config → CI)
`goldencheck diff <file> [file2]`	Compare two files or against git HEAD
`goldencheck watch <dir>`	Poll directory, re-scan on change
`goldencheck fix <file>`	Auto-fix data quality issues
`goldencheck learn <file>`	Generate LLM validation rules
`goldencheck history`	Show scan history and trends
`goldencheck mcp-serve`	Start MCP server (9 tools)

Flags

Flag	Description
`--no-tui`	Print results to console
`--json`	JSON output
`--fail-on <level>`	Exit 1 on severity: `error` or `warning`
`--domain <name>`	Domain pack: `healthcare`, `finance`, `ecommerce`
`--llm-boost`	Enable LLM enhancement
`--llm-provider <name>`	LLM provider: `anthropic` (default) or `openai`
`--mode <level>`	Fix mode: `safe`, `moderate`, `aggressive`
`--smart`	Auto-triage: pin high-confidence, dismiss low
`--guided`	Walk through findings one-by-one
`--webhook <url>`	POST findings to Slack/PagerDuty/any URL
`--notify-on <trigger>`	Webhook trigger: `grade-drop`, `any-error`, `any-warning`
`--version`	Show version

Benchmarks

Speed

Dataset	Time	Throughput
1K rows	0.05s	19K rows/sec
10K rows	0.23s	43K rows/sec
100K rows	2.29s	44K rows/sec
1M rows	2.07s	482K rows/sec

DQBench v1.0 — Head-to-Head

Tool	Mode	DQBench Score
GoldenCheck	zero-config	88.40
Pandera	best-effort rules	32.51
Soda Core	best-effort rules	22.36
Great Expectations	best-effort rules	21.68

GoldenCheck's zero-config discovery outperforms every competitor — even when they have hand-written rules.

Run the benchmark yourself:

pip install dqbench goldencheck
dqbench run goldencheck

Detection Accuracy

Mode	Column Recall	Cost
Profiler-only (v0.1.0)	87%	$0
Profiler-only (v0.2.0 with confidence)	100%	$0
With LLM Boost	100%	~$0.003-0.01

Tested on a custom benchmark with 341 planted data quality issues across 9 categories.

v0.2.0 improvements: minority wrong-type detection, range profiler chaining, broader temporal heuristics, and confidence scoring pushed profiler-only recall from 87% to 100%.

Raha Benchmark Datasets

Dataset	Column Recall
Flights (2,376 rows)	100% (4/4 columns)
Beers (2,410 rows)	80% (4/5 columns)

Tech Stack

Dependency	Purpose
Polars	All data operations
Typer	CLI framework
Textual	Interactive TUI
Rich	CLI output formatting
Pydantic 2	Config validation

Optional: Anthropic SDK / OpenAI SDK for LLM Boost | MCP SDK for MCP server

MCP Server (Claude Desktop)

GoldenCheck includes an MCP server for Claude Desktop integration:

pip install goldencheck[mcp]

Add to your Claude Desktop config (claude_desktop_config.json):

{
  "mcpServers": {
    "goldencheck": {
      "command": "goldencheck",
      "args": ["mcp-serve"]
    }
  }
}

Available tools:

Tool	Description
`scan`	Scan a file for data quality issues (with optional LLM boost)
`validate`	Validate against pinned rules in goldencheck.yml
`profile`	Get column-level statistics and health score
`health_score`	Quick A-F grade for a data file
`get_column_detail`	Deep-dive into a specific column
`list_checks`	List all available profiler checks

Jupyter / Colab

GoldenCheck renders rich HTML in Jupyter notebooks:

from goldencheck.engine.scanner import scan_file
from goldencheck.engine.confidence import apply_confidence_downgrade
from goldencheck.notebook import ScanResult

findings, profile = scan_file("data.csv")
findings = apply_confidence_downgrade(findings, llm_boost=False)

# Rich HTML display in notebooks
ScanResult(findings=findings, profile=profile)

Contributing

See CONTRIBUTING.md for development setup and guidelines.

License

MIT — see LICENSE

From the maker of GoldenMatch — entity resolution toolkit.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.2.0

May 6, 2026

1.1.2

Apr 11, 2026

1.1.1

Apr 9, 2026

1.1.0

Apr 3, 2026

1.0.2

Mar 29, 2026

1.0.1

Mar 25, 2026

1.0.0

Mar 24, 2026

0.9.0

Mar 24, 2026

This version

0.6.0

Mar 24, 2026

0.5.0

Mar 24, 2026

0.3.0

Mar 24, 2026

0.2.0

Mar 23, 2026

0.1.0

Mar 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

goldencheck-0.6.0.tar.gz (260.6 kB view details)

Uploaded Mar 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

goldencheck-0.6.0-py3-none-any.whl (100.6 kB view details)

Uploaded Mar 24, 2026 Python 3

File details

Details for the file goldencheck-0.6.0.tar.gz.

File metadata

Download URL: goldencheck-0.6.0.tar.gz
Upload date: Mar 24, 2026
Size: 260.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for goldencheck-0.6.0.tar.gz
Algorithm	Hash digest
SHA256	`beceda331344b90548398bc17f49c865677b7f3e7ddbca707d1882dc34c41a9e`
MD5	`38ce240669a4efa859585c02ebbed00e`
BLAKE2b-256	`ff4eb10b6d8b29d6d6a2518c3f38522a584101186354dcfe1f109fdc276eb4cc`

See more details on using hashes here.

File details

Details for the file goldencheck-0.6.0-py3-none-any.whl.

File metadata

Download URL: goldencheck-0.6.0-py3-none-any.whl
Upload date: Mar 24, 2026
Size: 100.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for goldencheck-0.6.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`51e2d6961c6563c8eead298e2ac27941431bb2a48ec129b3bbd0fd2b686fff2c`
MD5	`3f58af1cc5b7ecbbe1c935eaf3c282d6`
BLAKE2b-256	`e8105cb472aad328208848d78462f824e276a54a9cf5bbcf97a189901cbf27c9`

See more details on using hashes here.

goldencheck 0.6.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

GoldenCheck

Why GoldenCheck?

Install

Quick Start

How It Works

What It Detects

Column-Level Profilers

Cross-Column Profilers

Domain Packs

Schema Diff

Auto-Fix

Watch Mode

LLM Boost

Configuration (goldencheck.yml)

CLI Reference

Flags

Benchmarks

Speed

DQBench v1.0 — Head-to-Head

Detection Accuracy

Raha Benchmark Datasets

Tech Stack

MCP Server (Claude Desktop)

Jupyter / Colab

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes