Explains WHY two datasets differ, in plain English, with eval-backed accuracy claims.
Project description
wherefore
Tells you why two datasets differ — not just that they do.
Data diffing tools tell you that 40 rows mismatched in created_at.
None of them tell you why. wherefore is the layer on top of a diff
that answers that question — in plain English, with real example rows
cited — and honestly says "I don't recognize this pattern" rather than
confidently guessing wrong.
Free and key-free out of the box. No account, no server, no API
key required for the diffing and pattern-matching — only the optional
AI narrative (--explain) talks to an external API, and that's off by
default.
pip install -e .
wherefore compare old_export.csv new_export.csv
$ wherefore compare old_export.csv new_export.csv --explain
Calling Claude for 1 cluster(s)...
Compared 5 source rows against 5 target rows.
Matched rows: 5
hire_date: 5 mismatches, matches 'timezone_shift' (confidence 1.00)
AI: Every affected row is shifted forward by exactly 5 hours,
consistent with a UTC-vs-local-time mismatch introduced during
the export. Likely cause: the source system's timestamps were
re-interpreted in the wrong timezone during migration.
Full report written to report.md
That's a real run, real output — not a mockup. Try it on your own files in two minutes: see Quickstart.
Contents
Quickstart · Why this exists · What's built · Architecture · Evals · All flags · Contributing
Quickstart
git clone https://github.com/tracelore/wherefore.git
cd wherefore
./dev_setup.sh
This creates a .venv/, installs everything, and runs the test suite
(should show 316 passed, no API key needed — the test suite uses a
fake AI provider, zero network calls). Safe to re-run.
Then, on any two files of yours:
wherefore compare old_export.csv new_export.csv --output report.md
Works the same with .csv, .json, .parquet, or .xlsx/.xls —
mix and match freely, format is auto-detected per file. No key column
needed; wherefore finds one. If it picks wrong, or you have many
table pairs to check at once (a real migration is dozens of tables,
not one), see usage details below.
Don't have a Parquet or Excel file handy? Make one from the CSVs
above in two lines, run inside the same activated .venv from
dev_setup.sh (so pandas/pyarrow are already available) — note
parse_dates=[...] on any datetime column, since Parquet and Excel
store dates natively and pandas needs to know which column is one
before writing (without it, the column round-trips as plain text and
date-based patterns like timezone_shift won't be detected —
confirmed by testing this exact gap):
python3 -c "
import pandas as pd
df = pd.read_csv('old_export.csv', parse_dates=['hire_date']) # name your actual date column
df.to_parquet('old_export.parquet', index=False)
df.to_excel('old_export.xlsx', index=False)
"
wherefore compare old_export.parquet new_export.parquet # or .xlsx
Want the AI explanation, not just the statistical match?
export ANTHROPIC_API_KEY="sk-ant-..."
wherefore compare old_export.csv new_export.csv --explain
Sensitive-looking values (emails, SSNs, card numbers, phone numbers) are redacted before anything is sent — on by default, see Privacy & data handling.
Manual setup, if you'd rather not run the script
python3 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install --upgrade pip
pip install -e ".[dev]"
pytest tests/ -v
Requires Python 3.10+. Tested on 3.10–3.12. On very recent Python
(3.14+), pandas/numpy are compatible, but if pip install fails, a
smaller transitive dependency without 3.14 wheels yet is the likely
cause — try 3.11/3.12 if you hit this.
Why this exists
Imagine two boxes of identical LEGO sets. Someone copied box A into box B, but a few pieces are missing or the wrong color. Most tools that check this say: "12 pieces are different." That's it.
wherefore looks at those 12 differences and says: "These aren't
random — every one has the same color swapped the same way, consistent
with a colorblind sort. That's your root cause." It explains the
pattern behind the differences, not just the differences themselves.
To know if it's actually doing this well — not just sounding plausible — we build our own "messed-up" datasets on purpose, with a known, labeled answer, and grade whether the tool finds it. That's the eval harness, and it's first-class here, not an afterthought.
This is not a thin prompt wrapper around an LLM. The AI sits behind a deterministic clustering and statistical-signature step, and every accuracy claim below is backed by that eval harness against labeled ground truth — not vibes.
What's built
🚧 Actively built in public. The full pipeline is real, end-to-end: statistical detection, AI explanation, and a scored eval harness.
| Formats | CSV, JSON, Parquet, Excel — local or s3://, auto-detected, mix-and-match |
| Modes | One file pair (compare) or a whole directory (compare-dir) |
| Taxonomy | 8 failure patterns built & tested: timezone_shift, truncation, enum_drift, null_type_coercion, float_precision, encoding_mismatch, dedup_failure, key_mismatch |
| AI layer | Verified against the real Claude API twice — manually and via the scored eval harness — 100% match on a small (seven-fixture) sample |
| Privacy | Redacts emails/SSNs/cards/phones before any --explain call, on by default |
| Tests | 316 passing, including a real (mocked) S3 round-trip and end-to-end runs against real generated files |
dedup_failure and key_mismatch are structurally different from the
other six — dedup_failure detects duplicated rows (re-inserted with
a new key, not the same key twice); key_mismatch detects a row whose
join key was reformatted (EMP-1001 vs EMP1001) so it never matched
at all. Both show up as extra/missing rows rather than a column-level
mismatch, both have their own clustering path
(detect_row_presence_patterns), and both are verified by real,
dedicated tests — including a regression test confirming they don't
false-positive on each other's fixtures, and a regression test for a
real false positive caught while building key_mismatch (two
genuinely unrelated keys sharing a domain's ID prefix scored close
enough on a similarity heuristic to need a deterministic check
instead — see TAXONOMY.md). Neither is yet wired into the automated
eval harness above (that harness currently only scores column-mismatch
patterns) — tracked honestly as a gap, not hidden.
Not built yet: wiring dedup_failure/key_mismatch into the eval
harness, more fixture coverage at scale, and database connectivity
(Postgres, MySQL,
SQLite). File-based sources — local and s3:// — and CSV/JSON/Parquet/
Excel are all supported today. See TAXONOMY.md for
the current pattern list and what's planned next.
The harder bugs this surfaced, if you're curious
Building the 4th pattern (null_type_coercion) surfaced three real
bugs spanning the comparison engine, the file loaders, and the eval
harness itself. Building the 5th (float_precision) surfaced a
subtler one: a magnitude-based heuristic that looked right scored a
real false positive on an adversarial test case, fixed by checking the
underlying mechanism (an exact float32 round-trip) directly instead of
approximating its size. Full account, including how each was found and
fixed: TAXONOMY_TODO.md.
Architecture
source file, target file (CSV, JSON, Parquet, or Excel)
│
▼
loaders + key matching (exact by default; --fuzzy-keys for reformatted keys)
│
▼
comparison engine (wraps datacompy; schema-aware diffing)
│
▼
deterministic clustering (groups mismatches; runs cheap statistical
│ signature checks — NO causal claims here)
▼
candidate pattern(s), confidence-scored
│
├─── default: stop here, statistics only (free, no API key)
│
▼ with --explain
AI reasoning layer (Claude; redacts sensitive patterns by default;
│ writes the causal narrative, cites real rows,
│ honestly flags "unrecognized" when nothing fits)
▼
Markdown report (statistics always; AI narrative alongside
the evidence, never instead of it)
Failure patterns are data, not code. Each one is a YAML file under
src/wherefore/taxonomy/patterns/, validated against a strict schema.
Adding a new pattern means writing a YAML file and a small corruptor
function — never touching clustering or reasoning code. See
CONTRIBUTING.md for the contract.
Clustering and reasoning are deliberately separated. Clustering only ever produces statistical observations ("these 12 rows differ by exactly 5 hours"). Causal attribution ("this is a timezone bug") is the AI's job, every time — if clustering started asserting causes, the AI layer would become decorative and the evals would stop measuring anything meaningful.
Evals — why trust the explanations?
Because we control the ground truth. The synthetic data generator
creates clean datasets, then deliberately corrupts them using a known
failure pattern — recording exactly what it did in a committed
ground_truth.json. The eval harness runs the real pipeline against
these labeled fixtures and scores the result as precision/recall per
pattern, tracking "correctly said unrecognized" separately from
"confidently named the wrong pattern" — very different failure modes a
naive right/wrong scorer would conflate.
Statistical mode, free, no API key, against all 7 fixtures:
$ python3 -m evals.harness.run_eval
Total cases: 7
Overall accuracy (correct match + honest abstain): 100.00%
encoding_mismatch: precision=1.00 recall=1.00
enum_drift: precision=1.00 recall=1.00
float_precision: precision=1.00 recall=1.00
null_type_coercion: precision=1.00 recall=1.00
timezone_shift: precision=1.00 recall=1.00
truncation: precision=1.00 recall=1.00
LLM mode (python3 -m evals.harness.run_eval --llm, real API
calls, scores the AI's final answer instead of clustering's raw
statistics) — also 100%, including the one fixture designed to
test something the statistics alone can't: a cluster that legitimately
matches two patterns at once, where the AI correctly picked the right
one by reasoning about the actual values, not by defaulting to
whichever candidate came first.
Both are reproducible — clone the repo, run the commands yourself.
Seven fixtures proves the mechanism works end-to-end against the real
API; it doesn't prove either layer is bulletproof at scale. That's the
honest caveat, and expanding fixture coverage is the tracked next step
in TAXONOMY_TODO.md.
The multi-candidate case, if you want the detail
null_type_coercion and enum_drift can legitimately both match the
same cluster — a null consistently coerced to one sentinel string is,
statistically, also a "consistent value mapping." Clustering reports
both honestly rather than guessing which is "more right," since that's
a causal judgment that belongs to the AI layer, not clustering. The
eval harness scores this correctly too: a true pattern counts as found
if it appears anywhere among the reported candidates, not only if it's
listed first. Full story of how this was found and fixed:
TAXONOMY_TODO.md.
Usage
One file pair
wherefore compare old_export.csv new_export.csv --key employee_id
--key is optional — omit it and wherefore looks for a column that
looks like an identifier (mostly-unique values, often named with "id"
or "key"). If the same record has a differently-formatted key on each
side (e.g. EMP-1001 vs EMP1001, common after a migration), add
--fuzzy-keys.
Files can live in S3, not just on disk — mix and match freely:
pip install "wherefore[s3]" # boto3 is optional, only needed for s3:// paths
wherefore compare s3://old-bucket/accounts.csv s3://new-bucket/accounts.csv
Uses the standard AWS credential chain (env vars, ~/.aws/credentials,
IAM role, AWS_PROFILE) — wherefore doesn't invent its own.
A whole migration, not one table
$ wherefore compare-dir old_exports new_exports --output-dir reports
Found 3 matching file pair(s). Comparing...
[DIFF] accounts.csv: 1 finding(s) (timezone_shift)
[DIFF] patients.csv: 1 finding(s) (truncation)
[OK] transactions.csv: no mismatches
Done: 3 compared, 0 skipped. Reports written to reports/
Files are matched by identical filename between the two
directories — no fuzzy matching at the file level, since guessing
wrong about which two tables you're comparing is worse than guessing
wrong about a row key. A pair that can't be compared (bad format, no
detectable key) is skipped and reported, not fatal to the rest of the
batch. Every compare flag works here too, applied to every pair.
What you get without an API key
Real diffing, real grouping, real pattern matching — and a confidence
score that's a genuine deterministic measurement (e.g. "every
mismatched value differs by exactly the same 5-hour delta"), not an AI
guess. If nothing in the taxonomy matches, wherefore says
pattern unrecognized rather than forcing one.
What --explain adds
The plain-English why, shown alongside the statistical evidence it was reasoned from, not instead of it — so you can check the claim yourself. In testing, the AI correctly identified a genuinely random, non-matching corruption and proposed real alternative hypotheses (a bad join, a mis-wired column) instead of inventing a pattern that wasn't there.
Privacy & data handling
--explain sends mismatched cell values to the Claude API. Before
that happens, values are checked against a redaction layer — emails,
SSNs, credit card numbers, US phone numbers — on by default, no flag
needed. Anything masked is called out in the output
(Redacted before sending to Claude: email). Disable with
--no-redact if you've already vetted your data.
Be precise about scope: this is pattern-based detection of
structurally recognizable sensitive data, not a general PII scanner
— it won't know that a name or a home address is sensitive. Full
detail, including a documented false-positive case found during
testing (long numeric IDs can resemble card numbers):
SECURITY.md.
All flags
Expand
wherefore compare SOURCE TARGET [OPTIONS]
--key TEXT Join key column. Auto-detected if omitted.
--fuzzy-keys Allow approximate key matching (e.g. 'CUST-001' vs 'CUST001').
--output TEXT Where to write the report (default: report.md).
--confidence-threshold FLOAT Minimum confidence to count as a pattern match (default: 0.9).
--explain Generate plain-English AI explanations via the Claude API.
Requires ANTHROPIC_API_KEY. Makes real, billed API calls. Off by default.
--no-redact Disable automatic redaction of emails/SSNs/cards/phones before
sending data to Claude with --explain. Redaction is ON by default.
wherefore compare-dir SOURCE_DIR TARGET_DIR [OPTIONS]
--output-dir TEXT Directory for one report per pair (default: reports).
--key, --fuzzy-keys, --confidence-threshold, --explain, --no-redact Same as `compare`, applied to every pair.
Contributing
Contributions are welcome, especially new taxonomy patterns. Start
with CONTRIBUTING.md — the pattern contract, why
patterns are built corruptor-first rather than YAML-first, and the
design decisions worth knowing before you dig in.
Found a security issue? See SECURITY.md.
License
Apache License 2.0 — see LICENSE. Contributions are
accepted under the same license (see NOTICE for attribution).
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file wherefore-0.1.0.tar.gz.
File metadata
- Download URL: wherefore-0.1.0.tar.gz
- Upload date:
- Size: 82.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
86a3b93857508accc835704f8e79b55344c287cff1bcf234f05ca31572ce645a
|
|
| MD5 |
7bc0e9b3908911974f18c1fb03ea4b25
|
|
| BLAKE2b-256 |
fc922249b18cfb2ca83831695ed79d364cab7724c190168dd0e5cd344e08b1a1
|
File details
Details for the file wherefore-0.1.0-py3-none-any.whl.
File metadata
- Download URL: wherefore-0.1.0-py3-none-any.whl
- Upload date:
- Size: 95.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
519168afa3abebff2d5565b25995685cde2da57cdcf041ecc5f0810c03b698ee
|
|
| MD5 |
7ab7fb37c1b004f4bbda7b1d78100b19
|
|
| BLAKE2b-256 |
3776e03647bd80dcb102d3b9bc3a31990057eb4e040a7562d850c99df7bda4f0
|