Explains WHY two datasets differ, in plain English, with eval-backed accuracy claims.

These details have not been verified by PyPI

Project links

Project description

wherefore

Tells you why two datasets differ — not just that they do.

Data diffing tools tell you that 40 rows mismatched in created_at. None of them tell you why. wherefore is the layer on top of a diff that answers that question — in plain English, with real example rows cited — and honestly says "I don't recognize this pattern" rather than confidently guessing wrong.

Free and key-free out of the box. No account, no server, no API key required for the diffing and pattern-matching — only the optional AI narrative (--explain) talks to an external API, and that's off by default.

pip install -e .
wherefore compare old_export.csv new_export.csv

$ wherefore compare old_export.csv new_export.csv --explain
Calling Claude for 1 cluster(s)...
Compared 5 source rows against 5 target rows.
Matched rows: 5
  hire_date: 5 mismatches, matches 'timezone_shift' (confidence 1.00)
    AI: Every affected row is shifted forward by exactly 5 hours,
    consistent with a UTC-vs-local-time mismatch introduced during
    the export. Likely cause: the source system's timestamps were
    re-interpreted in the wrong timezone during migration.

Full report written to report.md

That's a real run, real output — not a mockup. Try it on your own files in two minutes: see Quickstart.

Quickstart · Why this exists · What's built · Architecture · Evals · All flags · Contributing

Quickstart

git clone https://github.com/tracelore/wherefore.git
cd wherefore
./dev_setup.sh

This creates a .venv/, installs everything, and runs the test suite (should show 316 passed, no API key needed — the test suite uses a fake AI provider, zero network calls). Safe to re-run.

Then, on any two files of yours:

wherefore compare old_export.csv new_export.csv --output report.md

Works the same with .csv, .json, .parquet, or .xlsx/.xls — mix and match freely, format is auto-detected per file. No key column needed; wherefore finds one. If it picks wrong, or you have many table pairs to check at once (a real migration is dozens of tables, not one), see usage details below.

Don't have a Parquet or Excel file handy? Make one from the CSVs above in two lines, run inside the same activated .venv from dev_setup.sh (so pandas/pyarrow are already available) — note parse_dates=[...] on any datetime column, since Parquet and Excel store dates natively and pandas needs to know which column is one before writing (without it, the column round-trips as plain text and date-based patterns like timezone_shift won't be detected — confirmed by testing this exact gap):

python3 -c "
import pandas as pd
df = pd.read_csv('old_export.csv', parse_dates=['hire_date'])  # name your actual date column
df.to_parquet('old_export.parquet', index=False)
df.to_excel('old_export.xlsx', index=False)
"
wherefore compare old_export.parquet new_export.parquet   # or .xlsx

Want the AI explanation, not just the statistical match?

export ANTHROPIC_API_KEY="sk-ant-..."
wherefore compare old_export.csv new_export.csv --explain

Sensitive-looking values (emails, SSNs, card numbers, phone numbers) are redacted before anything is sent — on by default, see Privacy & data handling.

Manual setup, if you'd rather not run the script

python3 -m venv .venv
source .venv/bin/activate      # Windows: .venv\Scripts\activate
pip install --upgrade pip
pip install -e ".[dev]"
pytest tests/ -v

Requires Python 3.10+. Tested on 3.10–3.12. On very recent Python (3.14+), pandas/numpy are compatible, but if pip install fails, a smaller transitive dependency without 3.14 wheels yet is the likely cause — try 3.11/3.12 if you hit this.

Why this exists

Imagine two boxes of identical LEGO sets. Someone copied box A into box B, but a few pieces are missing or the wrong color. Most tools that check this say: "12 pieces are different." That's it.

wherefore looks at those 12 differences and says: "These aren't random — every one has the same color swapped the same way, consistent with a colorblind sort. That's your root cause." It explains the pattern behind the differences, not just the differences themselves.

To know if it's actually doing this well — not just sounding plausible — we build our own "messed-up" datasets on purpose, with a known, labeled answer, and grade whether the tool finds it. That's the eval harness, and it's first-class here, not an afterthought.

This is not a thin prompt wrapper around an LLM. The AI sits behind a deterministic clustering and statistical-signature step, and every accuracy claim below is backed by that eval harness against labeled ground truth — not vibes.

What's built

🚧 Actively built in public. The full pipeline is real, end-to-end: statistical detection, AI explanation, and a scored eval harness.


Formats	CSV, JSON, Parquet, Excel — local or `s3://`, auto-detected, mix-and-match
Modes	One file pair (`compare`) or a whole directory (`compare-dir`)
Taxonomy	8 failure patterns built & tested: `timezone_shift`, `truncation`, `enum_drift`, `null_type_coercion`, `float_precision`, `encoding_mismatch`, `dedup_failure`, `key_mismatch`
AI layer	Verified against the real Claude API twice — manually and via the scored eval harness — 100% match on a small (seven-fixture) sample
Privacy	Redacts emails/SSNs/cards/phones before any `--explain` call, on by default
Tests	316 passing, including a real (mocked) S3 round-trip and end-to-end runs against real generated files

dedup_failure and key_mismatch are structurally different from the other six — dedup_failure detects duplicated rows (re-inserted with a new key, not the same key twice); key_mismatch detects a row whose join key was reformatted (EMP-1001 vs EMP1001) so it never matched at all. Both show up as extra/missing rows rather than a column-level mismatch, both have their own clustering path (detect_row_presence_patterns), and both are verified by real, dedicated tests — including a regression test confirming they don't false-positive on each other's fixtures, and a regression test for a real false positive caught while building key_mismatch (two genuinely unrelated keys sharing a domain's ID prefix scored close enough on a similarity heuristic to need a deterministic check instead — see TAXONOMY.md). Neither is yet wired into the automated eval harness above (that harness currently only scores column-mismatch patterns) — tracked honestly as a gap, not hidden.

Not built yet: wiring dedup_failure/key_mismatch into the eval harness, more fixture coverage at scale, and database connectivity (Postgres, MySQL, SQLite). File-based sources — local and s3:// — and CSV/JSON/Parquet/ Excel are all supported today. See TAXONOMY.md for the current pattern list and what's planned next.

The harder bugs this surfaced, if you're curious

Building the 4th pattern (null_type_coercion) surfaced three real bugs spanning the comparison engine, the file loaders, and the eval harness itself. Building the 5th (float_precision) surfaced a subtler one: a magnitude-based heuristic that looked right scored a real false positive on an adversarial test case, fixed by checking the underlying mechanism (an exact float32 round-trip) directly instead of approximating its size. Full account, including how each was found and fixed: TAXONOMY_TODO.md.

Architecture

source file, target file        (CSV, JSON, Parquet, or Excel)
        │
        ▼
 loaders + key matching   (exact by default; --fuzzy-keys for reformatted keys)
        │
        ▼
 comparison engine        (wraps datacompy; schema-aware diffing)
        │
        ▼
 deterministic clustering  (groups mismatches; runs cheap statistical
        │                   signature checks — NO causal claims here)
        ▼
 candidate pattern(s), confidence-scored
        │
        ├─── default: stop here, statistics only (free, no API key)
        │
        ▼  with --explain
 AI reasoning layer        (Claude; redacts sensitive patterns by default;
        │                   writes the causal narrative, cites real rows,
        │                   honestly flags "unrecognized" when nothing fits)
        ▼
 Markdown report           (statistics always; AI narrative alongside
                             the evidence, never instead of it)

Failure patterns are data, not code. Each one is a YAML file under src/wherefore/taxonomy/patterns/, validated against a strict schema. Adding a new pattern means writing a YAML file and a small corruptor function — never touching clustering or reasoning code. See CONTRIBUTING.md for the contract.

Clustering and reasoning are deliberately separated. Clustering only ever produces statistical observations ("these 12 rows differ by exactly 5 hours"). Causal attribution ("this is a timezone bug") is the AI's job, every time — if clustering started asserting causes, the AI layer would become decorative and the evals would stop measuring anything meaningful.

Evals — why trust the explanations?

Because we control the ground truth. The synthetic data generator creates clean datasets, then deliberately corrupts them using a known failure pattern — recording exactly what it did in a committed ground_truth.json. The eval harness runs the real pipeline against these labeled fixtures and scores the result as precision/recall per pattern, tracking "correctly said unrecognized" separately from "confidently named the wrong pattern" — very different failure modes a naive right/wrong scorer would conflate.

Statistical mode, free, no API key, against all 7 fixtures:

$ python3 -m evals.harness.run_eval
Total cases: 7
Overall accuracy (correct match + honest abstain): 100.00%

  encoding_mismatch: precision=1.00 recall=1.00
  enum_drift: precision=1.00 recall=1.00
  float_precision: precision=1.00 recall=1.00
  null_type_coercion: precision=1.00 recall=1.00
  timezone_shift: precision=1.00 recall=1.00
  truncation: precision=1.00 recall=1.00

LLM mode (python3 -m evals.harness.run_eval --llm, real API calls, scores the AI's final answer instead of clustering's raw statistics) — also 100%, including the one fixture designed to test something the statistics alone can't: a cluster that legitimately matches two patterns at once, where the AI correctly picked the right one by reasoning about the actual values, not by defaulting to whichever candidate came first.

Both are reproducible — clone the repo, run the commands yourself. Seven fixtures proves the mechanism works end-to-end against the real API; it doesn't prove either layer is bulletproof at scale. That's the honest caveat, and expanding fixture coverage is the tracked next step in TAXONOMY_TODO.md.

The multi-candidate case, if you want the detail

null_type_coercion and enum_drift can legitimately both match the same cluster — a null consistently coerced to one sentinel string is, statistically, also a "consistent value mapping." Clustering reports both honestly rather than guessing which is "more right," since that's a causal judgment that belongs to the AI layer, not clustering. The eval harness scores this correctly too: a true pattern counts as found if it appears anywhere among the reported candidates, not only if it's listed first. Full story of how this was found and fixed: TAXONOMY_TODO.md.

Usage

One file pair

wherefore compare old_export.csv new_export.csv --key employee_id

--key is optional — omit it and wherefore looks for a column that looks like an identifier (mostly-unique values, often named with "id" or "key"). If the same record has a differently-formatted key on each side (e.g. EMP-1001 vs EMP1001, common after a migration), add --fuzzy-keys.

Files can live in S3, not just on disk — mix and match freely:

pip install "wherefore[s3]"   # boto3 is optional, only needed for s3:// paths
wherefore compare s3://old-bucket/accounts.csv s3://new-bucket/accounts.csv

Uses the standard AWS credential chain (env vars, ~/.aws/credentials, IAM role, AWS_PROFILE) — wherefore doesn't invent its own.

A whole migration, not one table

$ wherefore compare-dir old_exports new_exports --output-dir reports
Found 3 matching file pair(s). Comparing...

  [DIFF] accounts.csv: 1 finding(s) (timezone_shift)
  [DIFF] patients.csv: 1 finding(s) (truncation)
  [OK] transactions.csv: no mismatches

Done: 3 compared, 0 skipped. Reports written to reports/

Files are matched by identical filename between the two directories — no fuzzy matching at the file level, since guessing wrong about which two tables you're comparing is worse than guessing wrong about a row key. A pair that can't be compared (bad format, no detectable key) is skipped and reported, not fatal to the rest of the batch. Every compare flag works here too, applied to every pair.

What you get without an API key

Real diffing, real grouping, real pattern matching — and a confidence score that's a genuine deterministic measurement (e.g. "every mismatched value differs by exactly the same 5-hour delta"), not an AI guess. If nothing in the taxonomy matches, wherefore says pattern unrecognized rather than forcing one.

What `--explain` adds

The plain-English why, shown alongside the statistical evidence it was reasoned from, not instead of it — so you can check the claim yourself. In testing, the AI correctly identified a genuinely random, non-matching corruption and proposed real alternative hypotheses (a bad join, a mis-wired column) instead of inventing a pattern that wasn't there.

Privacy & data handling

--explain sends mismatched cell values to the Claude API. Before that happens, values are checked against a redaction layer — emails, SSNs, credit card numbers, US phone numbers — on by default, no flag needed. Anything masked is called out in the output (Redacted before sending to Claude: email). Disable with --no-redact if you've already vetted your data.

Be precise about scope: this is pattern-based detection of structurally recognizable sensitive data, not a general PII scanner — it won't know that a name or a home address is sensitive. Full detail, including a documented false-positive case found during testing (long numeric IDs can resemble card numbers): SECURITY.md.

All flags

Expand

wherefore compare SOURCE TARGET [OPTIONS]

  --key TEXT                   Join key column. Auto-detected if omitted.
  --fuzzy-keys                 Allow approximate key matching (e.g. 'CUST-001' vs 'CUST001').
  --output TEXT                Where to write the report (default: report.md).
  --confidence-threshold FLOAT Minimum confidence to count as a pattern match (default: 0.9).
  --explain                    Generate plain-English AI explanations via the Claude API.
                                Requires ANTHROPIC_API_KEY. Makes real, billed API calls. Off by default.
  --no-redact                  Disable automatic redaction of emails/SSNs/cards/phones before
                                sending data to Claude with --explain. Redaction is ON by default.

wherefore compare-dir SOURCE_DIR TARGET_DIR [OPTIONS]

  --output-dir TEXT             Directory for one report per pair (default: reports).
  --key, --fuzzy-keys, --confidence-threshold, --explain, --no-redact   Same as `compare`, applied to every pair.

Contributing

Contributions are welcome, especially new taxonomy patterns. Start with CONTRIBUTING.md — the pattern contract, why patterns are built corruptor-first rather than YAML-first, and the design decisions worth knowing before you dig in.

Found a security issue? See SECURITY.md.

License

Apache License 2.0 — see LICENSE. Contributions are accepted under the same license (see NOTICE for attribution).

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.4.0

Jul 2, 2026

0.3.1

Jun 25, 2026

0.3.0

Jun 22, 2026

0.2.0

Jun 22, 2026

This version

0.1.0

Jun 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wherefore-0.1.0.tar.gz (82.3 kB view details)

Uploaded Jun 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

wherefore-0.1.0-py3-none-any.whl (95.3 kB view details)

Uploaded Jun 21, 2026 Python 3

File details

Details for the file wherefore-0.1.0.tar.gz.

File metadata

Download URL: wherefore-0.1.0.tar.gz
Upload date: Jun 21, 2026
Size: 82.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.6

File hashes

Hashes for wherefore-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`86a3b93857508accc835704f8e79b55344c287cff1bcf234f05ca31572ce645a`
MD5	`7bc0e9b3908911974f18c1fb03ea4b25`
BLAKE2b-256	`fc922249b18cfb2ca83831695ed79d364cab7724c190168dd0e5cd344e08b1a1`

See more details on using hashes here.

File details

Details for the file wherefore-0.1.0-py3-none-any.whl.

File metadata

Download URL: wherefore-0.1.0-py3-none-any.whl
Upload date: Jun 21, 2026
Size: 95.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.6

File hashes

Hashes for wherefore-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`519168afa3abebff2d5565b25995685cde2da57cdcf041ecc5f0810c03b698ee`
MD5	`7ab7fb37c1b004f4bbda7b1d78100b19`
BLAKE2b-256	`3776e03647bd80dcb102d3b9bc3a31990057eb4e040a7562d850c99df7bda4f0`

See more details on using hashes here.

wherefore 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

wherefore

Contents

Quickstart

Why this exists

What's built

Architecture

Evals — why trust the explanations?

Usage

One file pair

A whole migration, not one table

What you get without an API key

What --explain adds

Privacy & data handling

All flags

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

What `--explain` adds