Data validation that discovers rules from your data so you don't have to write them
Project description
GoldenCheck
Data validation that discovers rules from your data so you don't have to write them. Built by Ben Severn.
Every competitor makes you write rules first. GoldenCheck flips it: validate first, keep the rules you care about.
Why GoldenCheck?
| GoldenCheck | Great Expectations | Pandera | Pointblank | |
|---|---|---|---|---|
| Rules | Discovered from data | Written by hand | Written by hand | Written by hand |
| Config | Zero to start | Heavy YAML/Python setup | Decorators/schemas | YAML/Python |
| Interface | CLI + interactive TUI | HTML reports | Exceptions | HTML/notebook |
| Learning curve | One command | Hours/days | Moderate | Moderate |
| LLM enhancement | Yes ($0.01/scan) | No | No | No |
| Fix suggestions | Yes, in TUI | No | No | No |
| Confidence scoring | Yes (H/M/L per finding) | No | No | No |
| DQBench Score | 88.40 | 21.68 (best-effort) | 32.51 (best-effort) | 6.94 (auto) |
Install
pip install goldencheck
With LLM boost support:
pip install goldencheck[llm]
With deep profiling & baseline support (scipy, numpy):
pip install goldencheck[baseline]
With semantic type inference for baseline (sentence-transformers):
pip install goldencheck[baseline,semantic]
JavaScript / TypeScript
npm install goldencheck
Edge-safe core (browsers, Cloudflare Workers, Vercel Edge):
import { scanData, TabularData } from "goldencheck/core";
Node.js (file reading, CLI, MCP):
import { readFile, scanData } from "goldencheck/node";
Quick Start
# Scan a file — discovers issues, launches interactive TUI
goldencheck data.csv
# CLI-only output (no TUI)
goldencheck data.csv --no-tui
# With LLM enhancement (requires API key)
goldencheck data.csv --llm-boost --no-tui
# Validate against saved rules (for CI/pipelines)
goldencheck validate data.csv
# JSON output for CI integration
goldencheck data.csv --no-tui --json
# Learn baseline (one-time, deep analysis)
goldencheck baseline data.csv
# Scan with drift detection (fast, uses saved baseline)
goldencheck scan new_data.csv
TypeScript Quick Start
// Scan an array of records (edge-safe — works anywhere)
import { scanData, TabularData, Severity } from "goldencheck";
const data = new TabularData([
{ id: 1, email: "alice@example.com", age: 30, status: "active" },
{ id: 2, email: "bob@test.com", age: -5, status: "inactive" },
{ id: 3, email: "not-an-email", age: 25, status: "active" },
]);
const { findings, profile } = scanData(data);
for (const f of findings) {
console.log(`[${f.severity === Severity.ERROR ? "ERROR" : "WARNING"}] ${f.column}: ${f.message}`);
}
// Scan a CSV file (Node.js)
import { readFile, scanData, applyConfidenceDowngrade, healthScore } from "goldencheck/node";
const data = readFile("data.csv");
const result = scanData(data, { domain: "healthcare" });
const findings = applyConfidenceDowngrade(result.findings, false);
// Health score
const byCol = {};
for (const f of findings) {
if (f.severity >= 2) {
byCol[f.column] ??= { errors: 0, warnings: 0 };
byCol[f.column][f.severity === 3 ? "errors" : "warnings"]++;
}
}
const { grade, points } = healthScore(byCol);
console.log(`Health: ${grade} (${points}/100)`);
// Validate against pinned rules
import { readFile, scanData, validateConfig, validateData } from "goldencheck/node";
import { readFileSync } from "node:fs";
import YAML from "yaml";
const config = validateConfig(YAML.parse(readFileSync("goldencheck.yml", "utf-8")));
const data = readFile("data.csv");
const findings = validateData(data, config);
// Create baseline and detect drift
import { readFile, createBaseline, serializeBaseline, scanData } from "goldencheck/node";
import { runDriftChecks, deserializeBaseline } from "goldencheck";
import { writeFileSync, readFileSync } from "node:fs";
// Learn baseline
const data = readFile("reference.csv");
const baseline = createBaseline(data);
writeFileSync("baseline.json", serializeBaseline(baseline));
// Later: detect drift
const newData = readFile("production.csv");
const saved = deserializeBaseline(readFileSync("baseline.json", "utf-8"));
const driftFindings = runDriftChecks(newData, saved);
// LLM-enhanced scanning (edge-safe)
import { scanData, TabularData, callLlm, parseLlmResponse, mergeLlmFindings, buildSampleBlocks } from "goldencheck";
const data = new TabularData(records);
const result = scanData(data, { returnSample: true });
const blocks = buildSampleBlocks(result.sample, result.findings);
const { text } = await callLlm("anthropic", JSON.stringify(blocks));
const llmResponse = parseLlmResponse(text);
if (llmResponse) {
const enhanced = mergeLlmFindings(result.findings, llmResponse);
}
How It Works
1. SCAN → goldencheck data.csv
GoldenCheck profiles your data and discovers what "healthy" looks like
2. REVIEW → Interactive TUI shows findings sorted by severity
Each finding has: description, affected rows, sample values
3. PIN → Press Space to promote findings into permanent rules
Dismiss false positives — they won't come back
4. EXPORT → Press F2 to save rules to goldencheck.yml
Human-readable YAML with your pinned rules
5. VALIDATE → goldencheck validate data.csv
Enforce rules in CI with exit codes (0 = pass, 1 = fail)
What It Detects
Column-Level Profilers
| Profiler | What It Catches | Example |
|---|---|---|
| Type inference | String columns that are actually numeric | "Column age is string but 98% are integer" |
| Nullability | Required vs. optional columns | "0 nulls across 50k rows — likely required" |
| Uniqueness | Primary key candidates, near-duplicates | "100% unique — likely primary key" |
| Format detection | Emails, phones, URLs, dates | "94% email format, 6% malformed" |
| Range & distribution | Outliers, min/max bounds | "3 rows have values >10,000" |
| Cardinality | Low-cardinality enum suggestions | "4 unique values — possible enum" |
| Pattern consistency | Mixed formats within a column | "3 phone formats detected" |
Cross-Column Profilers
| Profiler | What It Catches |
|---|---|
| Temporal ordering | start_date > end_date violations |
| Null correlation | Columns that are null together (e.g., address + city + zip) |
| Numeric cross-column | value > max violations (e.g., claim_amount > policy_max) |
| Age vs DOB | Age column doesn't match calculated age from date_of_birth |
Baseline Deep Profiling & Drift Detection
Run goldencheck baseline once to build a statistical profile of healthy data. On every subsequent scan, GoldenCheck compares the new data against the saved baseline and reports drift across 13 check types:
| Check Type | What It Catches |
|---|---|
distribution_drift |
Value distribution has shifted significantly |
entropy_drift |
Entropy of column values has changed |
bound_violation |
Values exceed historical min/max bounds |
benford_drift |
Leading-digit distribution deviates from Benford's Law |
fd_violation |
Functional dependency between columns is broken |
key_uniqueness_loss |
Previously unique column now has duplicates |
temporal_order_drift |
Historical column ordering constraint violated |
type_drift |
Dominant semantic type of column has changed |
correlation_break |
Previously correlated columns are no longer correlated |
new_correlation |
New unexpected correlation appeared |
pattern_drift |
Value format/pattern distribution has shifted |
new_pattern |
New structural patterns appeared in a column |
The baseline is built using 6 techniques: statistical profiler (distributions, Benford's Law, entropy), constraint miner (functional dependencies, temporal orders), semantic type inferrer (embeddings + keywords), correlation analyzer (Pearson, Cramér's V), pattern grammar inducer, and confidence prior builder.
Domain Packs
Improve detection accuracy with domain-specific type definitions:
goldencheck scan data.csv --domain healthcare # NPI, ICD, insurance, patient types
goldencheck scan data.csv --domain finance # accounts, routing, CUSIP, transactions
goldencheck scan data.csv --domain ecommerce # SKUs, orders, tracking, products
Domain packs add semantic types that reduce false positives and improve classification for industry-specific data.
Schema Diff
Compare two versions of a data file:
goldencheck diff data.csv # compare against git HEAD
goldencheck diff old.csv new.csv # compare two files
goldencheck diff data.csv --ref main # compare against a branch
Auto-Fix
Apply automated fixes to clean your data:
goldencheck fix data.csv # safe: trim, normalize, fix encoding
goldencheck fix data.csv --mode moderate # + standardize case
goldencheck fix data.csv --mode aggressive --force # + coerce types
goldencheck fix data.csv --dry-run # preview changes
Watch Mode
Continuously monitor a directory for data quality:
goldencheck watch data/ --interval 30 # re-scan every 30s
goldencheck watch data/ --exit-on error # CI mode: fail on first error
REST API
Run GoldenCheck as a microservice:
goldencheck serve --port 8000
# Scan via file upload
curl -X POST http://localhost:8000/scan --data-binary @data.csv
# Scan via URL
curl -X POST http://localhost:8000/scan/url -d '{"url": "https://example.com/data.csv"}'
Database Scanning
Scan tables directly — no CSV export needed:
pip install goldencheck[db]
goldencheck scan-db "postgresql://user:pass@host/db" --table orders
goldencheck scan-db "snowflake://..." --query "SELECT * FROM orders WHERE date > '2024-01-01'"
Scheduled Runs
Cron-like scheduling with webhook notifications:
goldencheck schedule data/*.csv --interval hourly --webhook https://hooks.slack.com/...
goldencheck schedule data/*.csv --interval daily --notify-on grade-drop
LLM Boost
Add --llm-boost to enhance profiler findings with LLM intelligence. The LLM receives a representative sample of your data and:
- Finds issues profilers miss — semantic understanding (e.g., "12345" in a name column)
- Upgrades severity — knows "emails should be required" even if the profiler only says "INFO"
- Discovers relationships — identifies temporal ordering between columns like
signup_dateandlast_login - Downgrades false positives — "mixed phone formats are common, not an error"
# Using OpenAI
export OPENAI_API_KEY=sk-...
goldencheck data.csv --llm-boost --llm-provider openai --no-tui
# Using Anthropic
export ANTHROPIC_API_KEY=sk-ant-...
goldencheck data.csv --llm-boost --no-tui
Cost: ~$0.01 per scan (one API call with representative samples, not per-row).
Budget control:
export GOLDENCHECK_LLM_BUDGET=0.50 # max spend per scan in USD
Configuration (goldencheck.yml)
version: 1
settings:
sample_size: 100000
fail_on: error
columns:
email:
type: string
required: true
format: email
unique: true
age:
type: integer
range: [0, 120]
status:
type: string
enum: [active, inactive, pending, closed]
relations:
- type: temporal_order
columns: [start_date, end_date]
ignore:
- column: notes
check: nullability
Only pinned rules appear in this file — not every finding. The ignore list prevents dismissed findings from reappearing.
CLI Reference
| Command | Description |
|---|---|
goldencheck <file> |
Scan and launch TUI |
goldencheck scan <file> |
Explicit scan (supports --smart, --guided) |
goldencheck validate <file> |
Validate against goldencheck.yml |
goldencheck review <file> |
Scan + validate, launch TUI |
goldencheck init <file> |
Interactive setup wizard (scan → config → CI) |
goldencheck diff <file> [file2] |
Compare two files or against git HEAD |
goldencheck watch <dir> |
Poll directory, re-scan on change |
goldencheck fix <file> |
Auto-fix data quality issues |
goldencheck baseline <file> |
Deep-profile data and save statistical baseline to YAML |
goldencheck learn <file> |
Generate LLM validation rules |
goldencheck history |
Show scan history and trends |
goldencheck serve |
Start REST API server |
goldencheck scan-db <conn> |
Scan a database table directly |
goldencheck schedule <files> |
Run scans on a cron schedule |
goldencheck mcp-serve |
Start MCP server (19 tools) |
Flags
| Flag | Description |
|---|---|
--no-tui |
Print results to console |
--json |
JSON output |
--fail-on <level> |
Exit 1 on severity: error or warning |
--domain <name> |
Domain pack: healthcare, finance, ecommerce |
--llm-boost |
Enable LLM enhancement |
--llm-provider <name> |
LLM provider: anthropic (default) or openai |
--mode <level> |
Fix mode: safe, moderate, aggressive |
--smart |
Auto-triage: pin high-confidence, dismiss low |
--guided |
Walk through findings one-by-one |
--webhook <url> |
POST findings to Slack/PagerDuty/any URL |
--notify-on <trigger> |
Webhook trigger: grade-drop, any-error, any-warning |
--baseline <path> |
Path to baseline YAML for drift detection |
--no-baseline |
Skip auto-discovery of goldencheck_baseline.yaml |
--skip <technique> |
Skip a baseline technique (can repeat) |
--update |
Update existing baseline instead of overwriting |
-o <path> |
Output path for baseline file (default: goldencheck_baseline.yaml) |
--version |
Show version |
TypeScript CLI
npx goldencheck-js scan data.csv --json
npx goldencheck-js scan data.csv --domain healthcare
npx goldencheck-js health-score data.csv
npx goldencheck-js profile data.csv
npx goldencheck-js validate data.csv --config goldencheck.yml
npx goldencheck-js baseline data.csv --output baseline.json
npx goldencheck-js fix data.csv --mode safe
npx goldencheck-js diff old.csv new.csv
npx goldencheck-js demo
TypeScript Architecture
goldencheck (npm)
├── goldencheck/core # Edge-safe: browsers, Workers, Edge Runtime
│ ├── types # Finding, Severity, DatasetProfile, Config types
│ ├── data # TabularData — zero-dep columnar abstraction
│ ├── profilers # 10 column profilers + 4 relation profilers
│ ├── semantic # Type classifier, suppression, 3 domain packs
│ ├── engine # Scanner, confidence, validator, triage, differ, fixer
│ ├── baseline # Statistical profiling, constraints, correlation, patterns
│ ├── drift # 13 drift checks against saved baseline
│ ├── llm # Anthropic + OpenAI via fetch(), merger, budget
│ ├── agent # Strategy, handoff, review queue
│ └── reporters # JSON, CI
└── goldencheck/node # Node.js >= 20
├── reader # CSV, Parquet (via nodejs-polars)
├── mcp # MCP server (7 tools)
├── a2a # Agent-to-Agent HTTP server
├── tui # ANSI terminal output
├── db-scanner # Postgres, MySQL, SQLite
└── watcher # Directory polling
Benchmarks
Speed
| Dataset | Time | Throughput |
|---|---|---|
| 1K rows | 0.05s | 19K rows/sec |
| 10K rows | 0.23s | 43K rows/sec |
| 100K rows | 2.29s | 44K rows/sec |
| 1M rows | 2.07s | 482K rows/sec |
DQBench v1.0 — Head-to-Head
| Tool | Mode | DQBench Score |
|---|---|---|
| GoldenCheck | zero-config | 88.40 |
| Pandera | best-effort rules | 32.51 |
| Soda Core | best-effort rules | 22.36 |
| Great Expectations | best-effort rules | 21.68 |
GoldenCheck's zero-config discovery outperforms every competitor — even when they have hand-written rules.
Run the benchmark yourself:
pip install dqbench goldencheck
dqbench run goldencheck
Detection Accuracy
| Mode | Column Recall | Cost |
|---|---|---|
| Profiler-only (v0.1.0) | 87% | $0 |
| Profiler-only (v0.2.0 with confidence) | 100% | $0 |
| With LLM Boost | 100% | ~$0.003-0.01 |
Tested on a custom benchmark with 341 planted data quality issues across 9 categories.
v0.2.0 improvements: minority wrong-type detection, range profiler chaining, broader temporal heuristics, and confidence scoring pushed profiler-only recall from 87% to 100%.
Raha Benchmark Datasets
| Dataset | Column Recall |
|---|---|
| Flights (2,376 rows) | 100% (4/4 columns) |
| Beers (2,410 rows) | 80% (4/5 columns) |
Tech Stack
| Dependency | Purpose |
|---|---|
| Polars | All data operations |
| Typer | CLI framework |
| Textual | Interactive TUI |
| Rich | CLI output formatting |
| Pydantic 2 | Config validation |
Optional: Anthropic SDK / OpenAI SDK for LLM Boost | MCP SDK for MCP server | scipy + numpy for deep baseline profiling ([baseline]) | sentence-transformers for semantic type inference in baseline ([semantic])
TypeScript / Node.js
| Dependency | Purpose |
|---|---|
| Zero runtime deps | Core package has no dependencies (edge-safe) |
| nodejs-polars | Parquet reading (optional, Node.js only) |
| csv-parse | CSV reading (Node.js only) |
| @modelcontextprotocol/sdk | MCP server (Node.js only) |
MCP Server (Claude Desktop)
GoldenCheck includes an MCP server for Claude Desktop integration:
pip install goldencheck[mcp]
Add to your Claude Desktop config (claude_desktop_config.json):
{
"mcpServers": {
"goldencheck": {
"command": "goldencheck",
"args": ["mcp-serve"]
}
}
}
Available tools:
| Tool | Description |
|---|---|
scan |
Scan a file for data quality issues (with optional LLM boost) |
validate |
Validate against pinned rules in goldencheck.yml |
profile |
Get column-level statistics and health score |
health_score |
Quick A-F grade for a data file |
get_column_detail |
Deep-dive into a specific column |
list_checks |
List all available profiler checks |
Remote MCP Server
GoldenCheck is available as a hosted MCP server on Smithery — connect from any MCP client without installing anything.
Claude Desktop / Claude Code:
{
"mcpServers": {
"goldencheck": {
"url": "https://goldencheck-mcp-production.up.railway.app/mcp/"
}
}
}
Local server:
pip install goldencheck[mcp]
goldencheck mcp-serve
19 tools available: scan files, validate rules, profile columns, health-score datasets, auto-configure validation, explain findings, compare domains, suggest fixes.
Jupyter / Colab
GoldenCheck renders rich HTML in Jupyter notebooks:
from goldencheck.engine.scanner import scan_file
from goldencheck.engine.confidence import apply_confidence_downgrade
from goldencheck.notebook import ScanResult
findings, profile = scan_file("data.csv")
findings = apply_confidence_downgrade(findings, llm_boost=False)
# Rich HTML display in notebooks
ScanResult(findings=findings, profile=profile)
API Quick Reference
Python
import goldencheck
# Scan a CSV for quality issues
findings = goldencheck.scan_file("data.csv")
for f in findings:
print(f"[{f.severity}] {f.column}: {f.check} — {f.message}")
# Create baseline and detect drift
from goldencheck import create_baseline, scan_file
baseline = create_baseline("data.csv")
baseline.save("goldencheck_baseline.yaml")
findings, profile = scan_file("data.csv", baseline="goldencheck_baseline.yaml")
# Health score
score = goldencheck.health_score("data.csv")
print(score) # e.g. "B (78/100)"
TypeScript
import { scanData, TabularData, Severity } from "goldencheck";
// Scan records (edge-safe)
const data = new TabularData(records);
const { findings, profile } = scanData(data);
for (const f of findings) {
console.log(`[${f.severity === Severity.ERROR ? "ERROR" : "WARNING"}] ${f.column}: ${f.message}`);
}
import { readFile, scanData, applyConfidenceDowngrade, healthScore } from "goldencheck/node";
// Scan a CSV file (Node.js)
const data = readFile("data.csv");
const result = scanData(data, { domain: "healthcare" });
const findings = applyConfidenceDowngrade(result.findings, false);
// Health score
const byCol = {};
for (const f of findings) {
if (f.severity >= 2) {
byCol[f.column] ??= { errors: 0, warnings: 0 };
byCol[f.column][f.severity === 3 ? "errors" : "warnings"]++;
}
}
const { grade, points } = healthScore(byCol);
console.log(`Health: ${grade} (${points}/100)`);
import { readFile, createBaseline, serializeBaseline } from "goldencheck/node";
import { runDriftChecks, deserializeBaseline } from "goldencheck";
import { writeFileSync, readFileSync } from "node:fs";
// Create baseline and detect drift
const data = readFile("reference.csv");
const baseline = createBaseline(data);
writeFileSync("baseline.json", serializeBaseline(baseline));
const newData = readFile("production.csv");
const saved = deserializeBaseline(readFileSync("baseline.json", "utf-8"));
const driftFindings = runDriftChecks(newData, saved);
Contributing
See CONTRIBUTING.md for development setup and guidelines.
Author
License
MIT — see LICENSE
Part of the Golden Suite
| Tool | Purpose | Install |
|---|---|---|
| GoldenCheck | Validate & profile data quality | pip install goldencheck / npm install goldencheck |
| GoldenFlow | Transform & standardize data | pip install goldenflow |
| GoldenMatch | Deduplicate & match records | pip install goldenmatch |
| GoldenPipe | Orchestrate the full pipeline | pip install goldenpipe |
Companion projects:
- dbt-goldencheck — data validation as a dbt test.
- goldencheck-types — community-contributed domain type packs.
- goldencheck-action — GitHub Action for CI with PR comments.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file goldencheck-1.2.0.tar.gz.
File metadata
- Download URL: goldencheck-1.2.0.tar.gz
- Upload date:
- Size: 337.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
73a14649dc6aacf5e83753e21142f69e01c528be9ee22b5d308121a32af96962
|
|
| MD5 |
74e31e7596c0b521eed3422079e3f636
|
|
| BLAKE2b-256 |
cdce6d129d5ac5b0156cd6d001013ae4ec42e648a0ed00085be7bc9670c4db8b
|
File details
Details for the file goldencheck-1.2.0-py3-none-any.whl.
File metadata
- Download URL: goldencheck-1.2.0-py3-none-any.whl
- Upload date:
- Size: 168.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5d9f1319384c150ad1dc2e66d4843785f990e60dbd7fcf8b545916c23db37166
|
|
| MD5 |
d0dd994587cd4bbeb0e31637f3b4a0c8
|
|
| BLAKE2b-256 |
b089dc79622bcd4fddc3cb6d89cca1f01f79e077540023cf3c4ef5fe322c0e0b
|