Data validation that discovers rules from your data so you don't have to write them
Project description
GoldenCheck
Data validation that discovers rules from your data so you don't have to write them.
Every competitor makes you write rules first. GoldenCheck flips it: validate first, keep the rules you care about.
Why GoldenCheck?
| GoldenCheck | Great Expectations | Pandera | Pointblank | |
|---|---|---|---|---|
| Rules | Discovered from data | Written by hand | Written by hand | Written by hand |
| Config | Zero to start | Heavy YAML/Python setup | Decorators/schemas | YAML/Python |
| Interface | CLI + interactive TUI | HTML reports | Exceptions | HTML/notebook |
| Learning curve | One command | Hours/days | Moderate | Moderate |
| LLM enhancement | Yes ($0.01/scan) | No | No | No |
Install
pip install goldencheck
With LLM boost support:
pip install goldencheck[llm]
Quick Start
# Scan a file — discovers issues, launches interactive TUI
goldencheck data.csv
# CLI-only output (no TUI)
goldencheck data.csv --no-tui
# With LLM enhancement (requires API key)
goldencheck data.csv --llm-boost --no-tui
# Validate against saved rules (for CI/pipelines)
goldencheck validate data.csv
# JSON output for CI integration
goldencheck data.csv --no-tui --json
How It Works
1. SCAN → goldencheck data.csv
GoldenCheck profiles your data and discovers what "healthy" looks like
2. REVIEW → Interactive TUI shows findings sorted by severity
Each finding has: description, affected rows, sample values
3. PIN → Press Space to promote findings into permanent rules
Dismiss false positives — they won't come back
4. EXPORT → Press F2 to save rules to goldencheck.yml
Human-readable YAML with your pinned rules
5. VALIDATE → goldencheck validate data.csv
Enforce rules in CI with exit codes (0 = pass, 1 = fail)
What It Detects
Column-Level Profilers
| Profiler | What It Catches | Example |
|---|---|---|
| Type inference | String columns that are actually numeric | "Column age is string but 98% are integer" |
| Nullability | Required vs. optional columns | "0 nulls across 50k rows — likely required" |
| Uniqueness | Primary key candidates, near-duplicates | "100% unique — likely primary key" |
| Format detection | Emails, phones, URLs, dates | "94% email format, 6% malformed" |
| Range & distribution | Outliers, min/max bounds | "3 rows have values >10,000" |
| Cardinality | Low-cardinality enum suggestions | "4 unique values — possible enum" |
| Pattern consistency | Mixed formats within a column | "3 phone formats detected" |
Cross-Column Profilers
| Profiler | What It Catches |
|---|---|
| Temporal ordering | start_date > end_date violations |
| Null correlation | Columns that are null together (e.g., address + city + zip) |
LLM Boost
Add --llm-boost to enhance profiler findings with LLM intelligence. The LLM receives a representative sample of your data and:
- Finds issues profilers miss — semantic understanding (e.g., "12345" in a name column)
- Upgrades severity — knows "emails should be required" even if the profiler only says "INFO"
- Discovers relationships — identifies temporal ordering between columns like
signup_dateandlast_login - Downgrades false positives — "mixed phone formats are common, not an error"
# Using OpenAI
export OPENAI_API_KEY=sk-...
goldencheck data.csv --llm-boost --llm-provider openai --no-tui
# Using Anthropic
export ANTHROPIC_API_KEY=sk-ant-...
goldencheck data.csv --llm-boost --no-tui
Cost: ~$0.01 per scan (one API call with representative samples, not per-row).
Budget control:
export GOLDENCHECK_LLM_BUDGET=0.50 # max spend per scan in USD
Configuration (goldencheck.yml)
version: 1
settings:
sample_size: 100000
fail_on: error
columns:
email:
type: string
required: true
format: email
unique: true
age:
type: integer
range: [0, 120]
status:
type: string
enum: [active, inactive, pending, closed]
relations:
- type: temporal_order
columns: [start_date, end_date]
ignore:
- column: notes
check: nullability
Only pinned rules appear in this file — not every finding. The ignore list prevents dismissed findings from reappearing.
CLI Reference
| Command | Description |
|---|---|
goldencheck <file> |
Scan and launch TUI |
goldencheck scan <file> |
Explicit scan |
goldencheck validate <file> |
Validate against goldencheck.yml |
goldencheck review <file> |
Scan + validate, launch TUI |
Flags
| Flag | Description |
|---|---|
--no-tui |
Print results to console |
--json |
JSON output |
--fail-on <level> |
Exit 1 on severity: error or warning |
--llm-boost |
Enable LLM enhancement |
--llm-provider <name> |
LLM provider: anthropic (default) or openai |
--verbose |
Show info-level logs |
--debug |
Show debug-level logs |
--version |
Show version |
Benchmarks
Speed
| Dataset | Time | Throughput |
|---|---|---|
| 1K rows | 0.05s | 19K rows/sec |
| 10K rows | 0.23s | 43K rows/sec |
| 100K rows | 2.29s | 44K rows/sec |
| 1M rows | 2.07s | 482K rows/sec |
Detection Accuracy
| Mode | Column Recall | Cost |
|---|---|---|
| Profiler-only | 87% | $0 |
| With LLM Boost | 100% | ~$0.01 |
Tested on a custom benchmark with 341 planted data quality issues across 9 categories.
Raha Benchmark Datasets
| Dataset | Column Recall |
|---|---|
| Flights (2,376 rows) | 100% (4/4 columns) |
| Beers (2,410 rows) | 80% (4/5 columns) |
Tech Stack
| Dependency | Purpose |
|---|---|
| Polars | All data operations |
| Typer | CLI framework |
| Textual | Interactive TUI |
| Rich | CLI output formatting |
| Pydantic 2 | Config validation |
Optional: Anthropic SDK / OpenAI SDK for LLM Boost
Contributing
See CONTRIBUTING.md for development setup and guidelines.
License
MIT — see LICENSE
From the maker of GoldenMatch — entity resolution toolkit.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file goldencheck-0.1.0.tar.gz.
File metadata
- Download URL: goldencheck-0.1.0.tar.gz
- Upload date:
- Size: 100.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1c5d9d76ab44d3794bc30a8af4dd8be870e8d818d5141199f291834baa5e08a4
|
|
| MD5 |
2432c8ceef27c05978535d0b7ed15a40
|
|
| BLAKE2b-256 |
1db07565e55971c7826df8f084b03f01f31ed319a71825a723f697f7ca6aaa8c
|
File details
Details for the file goldencheck-0.1.0-py3-none-any.whl.
File metadata
- Download URL: goldencheck-0.1.0-py3-none-any.whl
- Upload date:
- Size: 38.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f201fd4350af4bf31034079236af0c0a7ac912b14d0a89380d61b47102e5f637
|
|
| MD5 |
8f76da5d4b08804d469ffa11f0ef6602
|
|
| BLAKE2b-256 |
4ba59fe1ec30f35ed50ca10d5c8bffa6122a9edcf7456d8c058b596f484fd96c
|