Entity resolution toolkit โ deduplicate records, match across sources, and maintain golden records
Project description
๐ก GoldenMatch
Find duplicate records in 30 seconds. No rules to write, no models to train.
Zero-config entity resolution for Python & TypeScript โ with a self-verifying auto-config that tells you when it's unsure.
Pair drilldown in the web workbench: cluster members, field-level diff, and a one-line NL explanation per pair. pip install goldenmatch[web] then goldenmatch serve-ui <project>. More screenshots โ
# Python
pip install goldenmatch && goldenmatch dedupe customers.csv
# TypeScript / Node.js
npm install goldenmatch
๐ v1.6.0 (Python) + v0.4.0 (npm) โ cross-language Learning Memory parity โ A correction written by Python applies identically in TypeScript and vice versa: byte-identical SHA-256 hashes, the same SQLite schema, the same collision-safe re-anchor algorithm, verified every CI run by JSON + SQLite + apply-outcome parity tests on both sides. Steward decisions, unmerges, LLM votes, and agent approvals persist to a local SQLite store, re-anchor across row reorders via record-hash, and apply automatically on the next run. The pipeline reports
Memory: N applied, M stale, K stale-ambiguous, J unanchorablein postflight. New CLI subgroupgoldenmatch memory(andgoldenmatch-js memoryin TS), five new MCP tools per runtime, andgoldenmatch.add_correction()/learn()/memory_stats(). Off by default. See Learning Memory.v1.5.0 โ Auto-config preflight + postflight verification layer (still on by default). See Auto-Config Verification. Built by Ben Severn.
Why GoldenMatch?
- Zero-config โ auto-detects columns, picks scorers, and runs. No training data needed
- 97.2% F1 on DBLP-ACM out of the box. DQBench ER score: 95.30
- Learning Memory โ corrections from stewards, unmerges, and LLM votes persist to disk and apply automatically on the next run; survives row reorders via record-hash re-anchoring (v1.6.0)
- Privacy-preserving โ match across organizations without sharing raw data (PPRL, 92.4% F1)
- 35 MCP tools โ use from Claude Desktop, Claude Code, or any AI assistant (Smithery)
- Production-ready โ Postgres sync, daemon mode, lineage tracking, review queues
Choose your path
| I want to... | Go here |
|---|---|
| Deduplicate a CSV right now | Quick Start |
| Use from Claude Desktop / AI assistant | MCP Server |
| Build AI agents that deduplicate | ER Agent (A2A) |
| Write Python code | Python API |
| Write TypeScript / Node.js | TypeScript API |
| Deploy to Vercel Edge / Cloudflare Workers | TypeScript API |
| Use the interactive TUI | TUI Guide |
| Train the system on my corrections | Learning Memory |
All features (click to expand)
Matching
- 10+ scoring methods โ exact, Jaro-Winkler, Levenshtein, token sort, soundex, ensemble, embedding, record embedding, dice, jaccard + plugin extensible
- 8+ blocking strategies โ static, adaptive, sorted neighborhood, multi-pass, ANN, ann_pairs, canopy, learned (data-driven predicate selection)
- Fellegi-Sunter probabilistic matching โ EM-trained m/u probabilities, automatic threshold estimation
- LLM scorer with budget controls โ GPT-4o-mini scores borderline pairs for just $0.04. Budget caps, model tiering, graceful degradation
- Cross-encoder reranking โ re-score borderline pairs with a pre-trained cross-encoder for higher precision
- Schema-free matching โ auto-maps columns between different schemas (full_name -> first_name + last_name)
Data Quality
- GoldenCheck integration โ
pip install goldenmatch[quality]adds data quality scanning (encoding, Unicode, format validation) - GoldenFlow transforms โ
pip install goldenmatch[transform]normalizes phone numbers, dates, categorical spelling - Anomaly detection โ flag fake emails, placeholder data, suspicious records
Golden Records
- 5 merge strategies โ most_complete, majority_vote, source_priority, most_recent, first_non_null
- Quality-weighted survivorship โ fields scored by source quality from GoldenCheck
- Field-level provenance โ tracks which source row contributed each field
- Cluster quality scoring โ clusters labeled
strong/weak/split; oversized clusters auto-split via MST
Privacy
- PPRL multi-party linkage โ match across organizations without sharing raw data (92.4% F1 on FEBRL4)
- PPRL auto-configuration โ profiles your data and picks optimal fields, bloom filter parameters, and threshold
Integration
- REST API + MCP Server โ 30 tools for matching, explaining, reviewing, data quality, and transforms
- A2A Agent โ 10 skills for AI-to-AI autonomous entity resolution
- Database sync โ incremental Postgres matching with persistent ANN index
- Enterprise connectors โ Snowflake, Databricks, BigQuery, HubSpot, Salesforce
- DuckDB backend โ out-of-core processing for 10M+ records without Spark
- Ray distributed backend โ scale to 50M+ records with
pip install goldenmatch[ray] - dbt integration โ
dbt-goldenmatchpackage for DuckDB-based ER in dbt pipelines
Learning Memory (v1.6.0)
- Persistent corrections โ every steward decision, unmerge, boost-tab y/n, LLM vote, and agent approve/reject writes to a local SQLite (or Postgres) store
- Re-anchor via record_hash โ corrections survive row reordering and refresh; ambiguous re-anchors report as
stale_ambiguousrather than misapplying - Automatic application โ
dedupe_dfandmatch_dfoverlay learned thresholds before scoring and apply hard 1.0/0.0 overrides after; postflight reports impact - Threshold learner โ trust-weighted grid search auto-tunes matchkey thresholds once 10+ corrections accumulate
- CLI / Python / MCP triad โ
goldenmatch memory stats|learn|export|import|show,goldenmatch.add_correction()/learn()/memory_stats(), and 5 new MCP tools (list_corrections,add_correction,learn_thresholds,memory_stats,memory_export) - Off by default โ zero-config posture preserved; opt in via
config.memory.enabled = True
Developer Experience
- Gold-themed TUI โ interactive interface with keyboard shortcuts, live threshold tuning
- Active learning boost โ label 10 borderline pairs in the TUI, retrain a classifier for 99% accuracy
- Review queue โ REST endpoint surfaces borderline pairs for data steward approval
- Merge preview + undo โ rollback any run or unmerge individual records
- Lineage tracking โ every merge decision saved with per-field score breakdown
- Natural language explainability โ template-based per-pair and per-cluster explanations at zero LLM cost
- Evaluation CLI โ
goldenmatch evaluatereports precision/recall/F1 against ground truth - 7 domain packs โ electronics, software, healthcare, financial, real estate, people, retail
- Plugin architecture โ extend with custom scorers, transforms, connectors via pip
- Streaming / CDC mode โ incremental record matching with micro-batch or immediate processing
- GitHub Actions "Try It" โ zero-install demo via
workflow_dispatch - Codespaces ready โ one-click dev environment
TypeScript / Node.js
GoldenMatch ships an npm package with full feature parity โ same scorers, clustering, golden records, and YAML configs.
npm install goldenmatch
import { dedupe } from "goldenmatch";
const rows = [
{ id: 1, name: "John Smith", email: "john@example.com", zip: "12345" },
{ id: 2, name: "Jon Smith", email: "john@example.com", zip: "12345" },
{ id: 3, name: "Jane Doe", email: "jane@example.com", zip: "54321" },
];
const result = dedupe(rows, {
fuzzy: { name: 0.85 },
blocking: ["zip"],
threshold: 0.85,
});
console.log(result.stats); // { totalRecords: 3, totalClusters: 2, ... }
- Edge-safe core โ runs in browsers, Vercel Edge Runtime, Cloudflare Workers, Deno
- Feature parity with Python: fuzzy scorers, probabilistic Fellegi-Sunter, PPRL, graph ER, LLM reranking, MCP/REST/A2A servers, 11+ CLI commands, interactive TUI
- 478 tests, strict TypeScript (
noUncheckedIndexedAccess,exactOptionalPropertyTypes) - Zero-dep install works โ optional peer deps unlock native paths (hnswlib-node, @huggingface/transformers for ONNX cross-encoder, piscina for worker threads, pg/duckdb/snowflake for data connectors)
Full docs: benzsevern.github.io/goldenmatch/typescript See packages/goldenmatch-js/examples/ for 10+ usage examples.
Web UI
pip install 'goldenmatch[web]'
goldenmatch serve-ui # current dir as project
goldenmatch serve-ui packages/python/goldenmatch/web/demo # bundled demo project
Localhost browser workbench. Editorial gold-on-cream design, single process, no auth โ for the dev-on-a-laptop case.
It surfaces the engine's full capability stack as 7 pages:
| Page | What you can do |
|---|---|
Project (/) |
Browse saved runs, auto-run from data.csv, see GoldenCheck quality findings as a banner |
Workbench (/workbench) |
Edit matchkey rules + threshold + standardization + blocking + per-row matchkey type (exact / weighted / probabilistic). Run sampled previews. Save back to goldenmatch.yml (atomic write + .bak). Auto-configure with optional domain-pack pinning (electronics, people, healthcare, โฆ). |
Inspector (/runs/{name}) |
Cluster table + member view + pair drilldown with field-level diff + one-line NL prose explanation per pair. Label pairs (mirrors to Learning Memory). Unmerge a record or shatter a cluster. F1/precision/recall vs your labels. |
Match (/match) |
One-to-many target ร reference workflow. Different output shape from dedupe โ flat target โ reference mapping + unmatched targets. |
Compare (/compare) |
Run A vs B classification (CCMS): unchanged / merged / partitioned / overlapping per cluster, plus the Talburt-Wang Index over the whole transformation. No labels needed. |
Sensitivity (/sensitivity) |
Sweep one parameter (threshold / blocking max-block-size / per-matchkey threshold), CCMS-compare each point against the baseline. Cluster-count sparkline + most-stable-value report. |
Memory (/memory) |
Browse the Learning Memory store (corrections + sources + trust + matchkey). Trigger a learn pass. Stored adjustments table. |
Workbench
Every change validates through the same Pydantic schema the engine uses; 422
errors render inline next to the offending field. Save writes the canonical
shape (matchkey: singular, the shape goldenmatch dedupe reads) and snapshots
the prior file to goldenmatch.yml.bak before clobbering.
Inspector
Each pair card shows a one-line template explanation above the field
breakdown โ derived from the field scores via
goldenmatch.core.explain.explain_pair_nl, no LLM cost. Labels mirror to
the same MemoryStore the pipeline reads on every run via
apply_corrections, so the loop closes end-to-end.
Compare runs (CCMS)
CCMS classification (Talburt et al., arXiv:2601.02824v1, 2026): every cluster from run A is mapped to one of unchanged / merged / partitioned / overlapping with respect to run B. Mismatched row-ID coverage between the two runs surfaces as a clean 400 with the engine's diagnostic intact.
Sensitivity sweep
Re-runs the pipeline at each sweep value on a sampled slice (default 500 rows, configurable per-request up to 10K), CCMS-compares each point against the baseline, and surfaces the most-stable value alongside the per-point TWI / cluster-count / case breakdown.
Match (target ร reference)
Different output shape from dedupe โ match has no clusters. Both target and reference paths are resolved under the project root with a path-traversal guard. Auto-configure mode skips the workbench rules and profiles both files together.
Memory store browser
Every label you save in the inspector mirrors into the engine's Learning Memory store. The pipeline reads it on every run, so the next dedupe picks up the decision automatically. Threshold tuning fires at โฅ10 corrections; weight learning at โฅ50.
Build / dev
# Backend tests
pytest packages/python/goldenmatch/tests/web -q # 100+ tests
# Frontend build (TypeScript + Vite)
pnpm -C packages/python/goldenmatch/web/frontend install
pnpm -C packages/python/goldenmatch/web/frontend test
pnpm -C packages/python/goldenmatch/web/frontend build
# Stage build output into the wheel-included static dir
python packages/python/goldenmatch/scripts/build_web.py
Frontend source lives outside the package at web/frontend/; build
output lands inside the package at goldenmatch/web/static/ (gitignored
except for a .gitkeep, included in the wheel via force-include). The
dev server (pnpm dev) proxies /api/v1/* to http://localhost:5050.
Installation
pip install goldenmatch # core (files only)
pip install goldenmatch[embeddings] # + sentence-transformers, FAISS
pip install goldenmatch[llm] # + Claude/OpenAI for LLM boost
pip install goldenmatch[postgres] # + Postgres database sync
pip install goldenmatch[snowflake] # + Snowflake connector
pip install goldenmatch[bigquery] # + BigQuery connector
pip install goldenmatch[databricks] # + Databricks connector
pip install goldenmatch[salesforce] # + Salesforce connector
pip install goldenmatch[duckdb] # + DuckDB backend
pip install goldenmatch[quality] # + GoldenCheck data quality scanning
pip install goldenmatch[web] # + localhost browser workbench (FastAPI + React)
# Run the setup wizard to configure GPU, API keys, and database:
goldenmatch setup
Python API
GoldenMatch exposes 95 functions and classes from a single import. See examples/ for complete runnable scripts.
import goldenmatch as gm
Quick Start
import goldenmatch as gm
# Deduplicate a CSV (zero-config)
result = gm.dedupe("customers.csv")
# Exact + fuzzy matching
result = gm.dedupe("customers.csv", exact=["email"], fuzzy={"name": 0.85, "zip": 0.95})
result.golden.write_csv("deduped.csv")
print(result) # DedupeResult(records=5000, clusters=847, match_rate=12.0%)
# Match across files
result = gm.match("new_customers.csv", "master.csv", fuzzy={"name": 0.85})
result.to_csv("matches.csv")
# With YAML config
result = gm.dedupe("data.csv", config="config.yaml")
# With LLM scorer for product matching
result = gm.dedupe("products.csv", fuzzy={"title": 0.80}, llm_scorer=True)
# With Ray backend for large datasets
result = gm.dedupe("huge.parquet", exact=["email"], backend="ray")
Learning Memory (v1.6.0)
GoldenMatch can remember past steward decisions and apply them automatically on every subsequent run. Reject a pair once -- it stays rejected. Approve a borderline pair once -- it stays approved. After 10+ corrections accumulate against a matchkey, the learner adjusts its threshold so the system stops needing the same correction twice. Off by default; enable via config.memory.enabled = True or a memory: block in YAML. Full guide: Learning Memory docs.
goldenmatch.yml:
matchkeys:
- name: identity
type: weighted
threshold: 0.85
fields:
- field: name
scorer: jaro_winkler
transforms: [lowercase, strip]
weight: 1.0
- field: email
scorer: exact
weight: 1.0
blocking:
strategy: static
keys:
- fields: [zip]
transforms: [lowercase]
memory:
enabled: true
backend: sqlite
path: .goldenmatch/memory.db
reanchor: true
dataset: customers
learning:
threshold_min_corrections: 10
weights_min_corrections: 50
Three commands users actually run:
# 1. First run -- produces the review queue
goldenmatch dedupe customers.csv --config goldenmatch.yml
# 2. Steward decides borderline pairs (writes to .goldenmatch/memory.db)
goldenmatch review --config goldenmatch.yml # interactive TUI
# 3. Re-run -- corrections apply automatically; postflight reports impact
goldenmatch dedupe customers.csv --config goldenmatch.yml
# > Memory: 12 corrections applied, 0 stale, 0 stale-ambiguous, 0 unanchorable
Python API equivalent:
import goldenmatch
# Programmatically register a correction
goldenmatch.add_correction(
id_a=42, id_b=87, decision="reject", source="steward",
reason="Different EIN despite name match", dataset="customers",
)
# Force a learning pass (otherwise auto-runs at next pipeline call)
adjustments = goldenmatch.learn()
print(f"Adjusted {len(adjustments)} matchkey thresholds")
# Inspect what's stored
print(goldenmatch.memory_stats())
MCP equivalent (from Claude Desktop / Code):
"Show me uncertain pairs from the last goldenmatch run on customers.csv, then mark rows 17 and 23 as not-a-match because they have different EINs."
The host LLM calls list_corrections -> add_correction -> learn_thresholds.
Auto-Config Verification (v1.5.0)
Zero-config used to crash on bibliographic and domain-extracted schemas โ auto-config would emit a matchkey referencing __title_key__ without enabling config.domain, and the pipeline would raise ValueError: Missing required columns. v1.5.0 closes the gap with a preflight + postflight verification layer that runs automatically around auto_configure_df.
Preflight (gm.preflight) runs 6 checks at the end of auto_configure_df:
- column resolution (auto-repairs missing domain-extracted columns by enabling
config.domain) - cardinality bounds on exact matchkeys (drops near-unique and near-constant keys)
- block-size sanity (flags blocks that would stall the scorer)
- remote-asset demotion (any
embedding,record_embedding, or cross-encoder rerank is demoted unless you passallow_remote_assets=True) - confidence-gated weight capping (low-confidence fields cap at weight 0.3)
Unrepairable issues raise ConfigValidationError with the full PreflightReport attached as err.report. Repaired issues stay on the report as findings with repaired=True.
Postflight (gm.postflight) runs 4 signals after scoring, before clustering:
- score-distribution histogram + bimodality detection (auto-nudges threshold on clear bimodality)
- blocking-recall estimate (gated at 10K+ rows)
- preliminary cluster sizes + oversized-cluster bottleneck pair
- threshold-band overlap percentage (advises
--llm-autowhen overlap > 20% and LLM is off)
The report attaches to DedupeResult.postflight_report / MatchResult.postflight_report.
import goldenmatch as gm
import polars as pl
df = pl.read_csv("bibliography.csv")
# Zero-config -- preflight + postflight run automatically
result = gm.dedupe_df(df)
# Inspect the preflight report (private-by-convention underscore)
for finding in result.config._preflight_report.findings:
print(f"[{finding.severity}] {finding.check}: {finding.message}")
# Inspect postflight signals (public)
sig = result.postflight_report.signals
print(f"Scored {sig['total_pairs_scored']} pairs")
print(f"Threshold overlap: {sig['threshold_overlap_pct']:.1%}")
print(f"Oversized clusters: {len(sig['oversized_clusters'])}")
Offline by default. Remote-asset scorers are demoted unless you opt in:
cfg = gm.auto_configure_df(df, allow_remote_assets=True) # loads cross-encoder etc.
Strict mode for parity runs. strict=True still computes postflight signals and emits advisories, but skips threshold adjustments โ use it for DQBench, regression suites, and any reproducible output:
cfg = gm.auto_configure_df(df, strict=True)
New classifier smarts in v1.5.0:
- Columns with cardinality โฅ 0.95 are classified as
identifier, notphone/zip/numeric. - New
yearcol_type routes to blocking, not scoring. - New
multi_namecol_type handles comma/semicolon-delimited author-style fields. - Low-confidence fields (< 0.5) cap at weight 0.3.
See examples/verification_inspection.py and examples/strict_mode_parity.py for runnable walkthroughs.
Privacy-Preserving Linkage
import goldenmatch as gm
# Auto-configured PPRL (picks fields and threshold automatically)
result = gm.pprl_link("hospital_a.csv", "hospital_b.csv")
print(f"Found {result['match_count']} matches across {len(result['clusters'])} clusters")
# Manual field selection
result = gm.pprl_link("party_a.csv", "party_b.csv",
fields=["first_name", "last_name", "dob", "zip"],
threshold=0.85, security_level="high")
# Auto-config analysis
config = gm.pprl_auto_config(df)
print(config.recommended_fields) # ['first_name', 'last_name', 'zip_code', 'birth_year']
Evaluate Accuracy
import goldenmatch as gm
# Measure precision/recall/F1 against ground truth
metrics = gm.evaluate("data.csv", config="config.yaml", ground_truth="gt.csv")
print(f"F1: {metrics['f1']:.1%}, Precision: {metrics['precision']:.1%}")
# Evaluate programmatically
result = gm.evaluate_pairs(predicted_pairs, ground_truth_set)
print(result.f1)
Build Configs Programmatically
import goldenmatch as gm
# Auto-generate config from data
config = gm.auto_configure([("data.csv", "source")])
# Or build manually
config = gm.GoldenMatchConfig(
matchkeys=[
gm.MatchkeyConfig(name="exact_email", type="exact",
fields=[gm.MatchkeyField(field="email", transforms=["lowercase"])]),
gm.MatchkeyConfig(name="fuzzy_name", type="weighted", threshold=0.85,
fields=[
gm.MatchkeyField(field="name", scorer="jaro_winkler", weight=0.7),
gm.MatchkeyField(field="zip", scorer="exact", weight=0.3),
]),
],
blocking=gm.BlockingConfig(strategy="learned"),
llm_scorer=gm.LLMScorerConfig(enabled=True, mode="cluster"),
backend="ray",
)
Streaming / Incremental
import goldenmatch as gm
# Match a single new record against existing data
matches = gm.match_one(new_record, existing_df, matchkey)
# Stream processor for continuous matching
processor = gm.StreamProcessor(df, config)
matches = processor.process_record(new_record)
Advanced Features
import goldenmatch as gm
# Domain extraction
rulebooks = gm.discover_rulebooks() # 7 built-in packs
enhanced_df, low_conf = gm.extract_with_rulebook(df, "title", rulebooks["electronics"])
# Fellegi-Sunter probabilistic
em_result = gm.train_em(df, matchkey, n_sample_pairs=10000)
pairs = gm.score_probabilistic(block_df, matchkey, em_result)
# Explain a match decision
explanation = gm.explain_pair(record_a, record_b, matchkey)
# Cluster operations
gm.unmerge_record(record_id, clusters) # Remove from cluster
gm.unmerge_cluster(cluster_id, clusters) # Shatter to singletons
# Data quality
df, fixes = gm.auto_fix_dataframe(df)
anomalies = gm.detect_anomalies(df)
column_map = gm.auto_map_columns(df_a, df_b) # Schema matching
# Graph ER (multi-table)
clusters = gm.run_graph_er(entities, relationships)
Setup Wizard
Run goldenmatch setup for an interactive walkthrough:
Guides you through GPU mode selection, Vertex AI / Colab / local GPU configuration, LLM boost API keys, and database sync โ with copy-paste commands at every step.
Why GoldenMatch?
| GoldenMatch | dedupe | recordlinkage | Zingg | Splink | |
|---|---|---|---|---|---|
| Zero-config mode | Yes | No (requires training) | No (manual config) | No (Spark required) | No (SQL config) |
| Fuzzy + probabilistic + LLM | All three | Probabilistic only | Probabilistic only | ML-based | Probabilistic only |
| Privacy-preserving (PPRL) | Built-in (92.4% F1) | No | No | No | No |
| Interactive TUI | Yes | No | No | No | No |
| Golden record synthesis | 5 strategies | No | No | No | No |
| MCP server (AI integration) | Yes (35 tools) | No | No | No | No |
| Database sync | Postgres + DuckDB | No | No | No | Spark/DuckDB |
Single pip install |
Yes | Yes | Yes | No (Java/Spark) | Yes |
| Polars-native | Yes | No (pandas) | No (pandas) | No (Spark) | Yes (DuckDB) |
GoldenMatch is the only tool that combines zero-config operation, probabilistic matching, LLM scoring, privacy-preserving linkage, and golden record synthesis in a single Python package.
Quick Start
Zero-Config (no YAML needed)
goldenmatch dedupe customers.csv
Auto-detects column types (name, email, phone, zip, address, description), assigns appropriate scorers, picks blocking strategy, and launches the TUI for review.
With Config
goldenmatch dedupe customers.csv --config config.yaml --output-all --output-dir results/
Match Mode
goldenmatch match targets.csv --against reference.csv --config config.yaml --output-all
Database Sync
# First run: full scan, create metadata tables
goldenmatch sync --table customers --connection-string "$DATABASE_URL" --config config.yaml
# Subsequent runs: incremental (only new records)
goldenmatch sync --table customers --connection-string "$DATABASE_URL"
How It Works
Files/DB โ Ingest โ Standardize โ Block โ Score โ Cluster โ Golden Records โ Output
โ โ
SQL blocking 10 scorers
ANN blocking ensemble
7 strategies embeddings
parallel blocks
Pipeline:
- Ingest โ CSV, Excel, Parquet, or Postgres table
- Standardize โ configurable per-column transforms
- Block โ reduce comparison space (multi-pass, ANN, canopy, etc.)
- Score โ compare record pairs with appropriate scorer
- Cluster โ group matches via Union-Find; auto-split oversized clusters via MST; assign quality labels (
strong/weak/split) - Golden โ merge each cluster into one canonical record using quality-weighted survivorship (5 strategies); track field-level provenance
- Output โ files (CSV/Parquet) or database tables + lineage JSON sidecar with provenance
Config Reference
matchkeys:
- name: exact_email
type: exact
fields:
- field: email
transforms: [lowercase, strip]
- name: fuzzy_name_zip
type: weighted
threshold: 0.85
rerank: true # re-score borderline pairs with cross-encoder
rerank_band: 0.1 # pairs within threshold +/- 0.1 get reranked
fields:
- field: first_name
scorer: jaro_winkler
weight: 0.4
transforms: [lowercase, strip]
- field: last_name
scorer: jaro_winkler
weight: 0.4
transforms: [lowercase, strip]
- field: zip
scorer: exact
weight: 0.2
- name: semantic
type: weighted
threshold: 0.80
fields:
- columns: [title, authors, venue]
scorer: record_embedding
weight: 1.0
column_weights: {title: 2.0, authors: 1.0, venue: 0.5} # bias embedding toward title
llm_scorer:
enabled: true # score borderline pairs with GPT/Claude
auto_threshold: 0.95 # auto-accept pairs above this
candidate_lo: 0.75 # LLM scores pairs in [0.75, 0.95]
# provider: openai # auto-detected from OPENAI_API_KEY
# model: gpt-4o-mini # default, cheapest option
blocking:
strategy: adaptive # static | adaptive | sorted_neighborhood | multi_pass | ann | ann_pairs | canopy
auto_select: true # auto-pick best key by histogram analysis
keys:
- fields: [zip]
- fields: [last_name]
transforms: [lowercase, soundex]
golden_rules:
default_strategy: most_complete
auto_split: true # Auto-split oversized clusters via MST
quality_weighting: true # Use GoldenCheck quality scores in survivorship
weak_cluster_threshold: 0.3 # Edge gap threshold for confidence downgrade
field_rules:
email: { strategy: majority_vote }
first_name: { strategy: source_priority, source_priority: [crm, marketing] }
output:
directory: ./output
format: csv
Scorers
| Scorer | Description | Best For |
|---|---|---|
exact |
Binary match | Email, phone, ID |
jaro_winkler |
Edit distance similarity | Names |
levenshtein |
Normalized Levenshtein | General strings |
token_sort |
Order-invariant token matching | Names, addresses |
soundex_match |
Phonetic match | Names |
ensemble |
max(jaro_winkler, token_sort, soundex) | Names with reordering |
embedding |
Cosine similarity of sentence embeddings | Semantic matching |
record_embedding |
Embed concatenated fields | Cross-field semantic matching |
dice |
Dice coefficient on bloom filters | Privacy-preserving matching |
jaccard |
Jaccard similarity on bloom filters | Privacy-preserving matching |
Blocking Strategies
| Strategy | Description |
|---|---|
static |
Group by blocking key (default) |
adaptive |
Static + recursive sub-blocking for oversized blocks |
sorted_neighborhood |
Sliding window over sorted records |
multi_pass |
Union of blocks from multiple passes (best for noisy data) |
ann |
ANN via FAISS on sentence-transformer embeddings |
ann_pairs |
Direct-pair ANN scoring (50-100x faster than ann) |
canopy |
TF-IDF canopy clustering |
learned |
Data-driven predicate selection (auto-discovers blocking rules) |
Database Integration
GoldenMatch can sync against live Postgres databases with incremental matching:
pip install goldenmatch[postgres]
goldenmatch sync \
--table customers \
--connection-string "postgresql://user:pass@localhost/mydb" \
--config config.yaml
Features:
- Incremental sync โ only processes records added since last run
- Hybrid blocking โ SQL WHERE clauses for exact fields + FAISS ANN for semantic fields, results unioned
- Persistent ANN index โ disk cache + DB source of truth, progressive embedding across runs
- Golden record versioning โ append-only with
is_currentflag, full audit trail - Cluster management โ persistent clusters with merge, conflict detection, max size safety cap
Metadata tables (auto-created):
| Table | Purpose |
|---|---|
gm_state |
Processing state, watermarks |
gm_clusters |
Persistent cluster membership |
gm_golden_records |
Versioned golden records |
gm_embeddings |
Cached embeddings for ANN |
gm_match_log |
Audit trail of all match decisions |
SQL Extensions
Use GoldenMatch directly from PostgreSQL or DuckDB:
-- PostgreSQL
CREATE EXTENSION goldenmatch_pg;
SELECT goldenmatch.goldenmatch_dedupe_table('customers', '{"exact": ["email"]}');
SELECT goldenmatch.goldenmatch_score('John Smith', 'Jon Smyth', 'jaro_winkler');
# DuckDB
pip install goldenmatch-duckdb
import duckdb, goldenmatch_duckdb
con = duckdb.connect()
goldenmatch_duckdb.register(con)
con.sql("SELECT goldenmatch_score('John Smith', 'Jon Smyth', 'jaro_winkler')")
See goldenmatch-extensions for installation and full documentation.
LLM Boost (Optional)
For harder datasets where zero-shot scoring isn't enough:
pip install goldenmatch[llm]
# First run: LLM labels ~300 pairs (~$0.30), fine-tunes embedding model
goldenmatch dedupe products.csv --llm-boost
# Subsequent runs: uses saved model ($0)
goldenmatch dedupe products.csv --llm-boost
Tiered auto-escalation:
- Level 1 โ zero-shot (free, instant)
- Level 2 โ bi-encoder fine-tuning (~$0.20, ~2 min CPU)
- Level 3 โ Ditto-style cross-encoder with data augmentation (~$0.50, ~5 min CPU)
Active sampling selects the most informative pairs for the LLM to label (uncertainty, disagreement, boundary, diversity), reducing label cost by ~45% compared to random sampling.
Iterative calibration: When many borderline pairs exist, iterative calibration samples ~100 pairs per round, learns the optimal threshold via grid search, and applies it to all candidates โ typically converging in 2-3 rounds.
Note: LLM boost is most valuable for product matching with local models (MiniLM) where it improved Abt-Buy from 44.5% to 59.5% F1. For structured data (names, addresses, bibliographic), fuzzy matching alone achieves 97%+ F1.
Benchmarks
Leipzig Entity Resolution Benchmarks
| Dataset | Best Strategy | F1 | Cost |
|---|---|---|---|
| DBLP-ACM (2.6K vs 2.3K) | multi-pass + fuzzy | 97.2% | $0 |
| DBLP-Scholar (2.6K vs 64K) | multi-pass + fuzzy | 74.7% | $0 |
| Abt-Buy (1K vs 1K) | Vertex AI + GPT-4o-mini scorer | 81.7% | ~$0.74 |
| Abt-Buy (zero-shot) | Vertex AI embeddings | 62.8% | ~$0.05 |
| Amazon-Google (1.4K vs 3.2K) | Vertex AI + reranking | 44.0% | ~$0.10 |
Structured data (names, addresses, bibliographic): RapidFuzz multi-pass fuzzy matching at 97.2% โ zero cost, zero labels. Product matching: Vertex AI embeddings for candidate generation + GPT-4o-mini scorer for borderline pairs achieves 81.7% at ~$0.74 total cost.
Throughput (Scale Curve)
Measured on a laptop (17GB RAM) with exact + fuzzy matching, blocking, clustering, and golden record generation:
| Records | Time | Throughput | Pairs Found | Memory |
|---|---|---|---|---|
| 1,000 | 0.2s | 5,500 rec/s | 210 | 101 MB |
| 10,000 | 1.4s | 7,300 rec/s | 7,000 | 123 MB |
| 100,000 | 12s | 8,200 rec/s | 571,000 | 544 MB |
Fuzzy matching speedup: Parallel block scoring + intra-field early termination reduced 100K fuzzy matching from ~100s to ~39s (2.5x) through the pipeline. The 1M exact-only benchmark runs in 7.8s.
Equipment data (401K rows): 27,937 clusters, 384,650 matched, 323s. LLM calibration learned threshold from 200 pairs (~$0.01). ANN fallback created 363 sub-blocks from 15 oversized blocks.
For datasets over 1M records, use goldenmatch sync (database mode) with incremental matching and persistent ANN indexing. See Large Dataset Mode.
How GoldenMatch Compares
| GoldenMatch | dedupe | Splink | Zingg | Ditto | |
|---|---|---|---|---|---|
| Abt-Buy F1 | 81.7% | ~75% | ~70% | ~80% | 89.3% |
| DBLP-ACM F1 | 97.2% | ~96% | ~95% | ~96% | 99.0% |
| Training required | No | Yes | Yes | Yes | Yes (1000+) |
| Zero-config | Yes | No | No | No | No |
| Interactive TUI | Yes | No | No | No | No |
| Database sync | Postgres | Cloud (paid) | No | No | No |
| REST API / MCP | Both | Cloud only | No | No | No |
| GPU required | No | No | No | Spark | Yes |
GoldenMatch's sweet spot is ease of use + competitive accuracy. On bibliographic matching (DBLP-ACM), GoldenMatch hits 97.2% with zero config. On product matching (Abt-Buy), the LLM scorer reaches 81.7% โ within 8pts of Ditto's 89.3%, but with zero training labels and no GPU. Ditto requires 1000+ hand-labeled pairs and a GPU.
Library Comparison (v1.2.7)
Head-to-head against Splink, Dedupe, and RecordLinkage on two datasets. GoldenMatch uses explicit config, zero training data.
Febrl (5,000 synthetic PII records, 6,538 true pairs):
| Library | Precision | Recall | F1 | Time |
|---|---|---|---|---|
| Splink | 1.000 | 0.995 | 0.998 | 2.0s |
| GoldenMatch | 1.000 | 0.943 | 0.971 | 6.8s |
| Dedupe | 1.000 | 0.865 | 0.928 | 7.2s |
| RecordLinkage | 0.999 | 0.733 | 0.845 | 2.2s |
DBLP-ACM (4,910 bibliographic records, 2,224 true matches):
| Library | Precision | Recall | F1 | Time |
|---|---|---|---|---|
| RecordLinkage | 0.888 | 0.961 | 0.923 | 13.0s |
| GoldenMatch | 0.891 | 0.945 | 0.918 | 6.2s |
| Dedupe | 0.604 | 0.936 | 0.734 | 10.5s |
| Splink | 0.646 | 0.834 | 0.728 | 3.4s |
Key takeaway: GoldenMatch is the most consistent performer โ top-2 F1 on both datasets with zero training data. Splink dominates structured PII but struggles on non-PII. RecordLinkage wins DBLP-ACM but lags on PII.
Febrl explicit config example
config = GoldenMatchConfig(
blocking=BlockingConfig(
strategy="multi_pass",
passes=[
BlockingKeyConfig(fields=["surname"], transforms=["soundex"]),
BlockingKeyConfig(fields=["given_name"], transforms=["soundex"]),
BlockingKeyConfig(fields=["postcode"], transforms=[]),
BlockingKeyConfig(fields=["date_of_birth"], transforms=[]),
],
max_block_size=500, skip_oversized=True,
),
matchkeys=[MatchkeyConfig(
name="person", type="weighted", threshold=0.7,
fields=[
MatchkeyField(field="given_name", scorer="jaro_winkler", weight=2.0, transforms=["lowercase", "strip"]),
MatchkeyField(field="surname", scorer="jaro_winkler", weight=2.0, transforms=["lowercase", "strip"]),
MatchkeyField(field="date_of_birth", scorer="exact", weight=1.5),
MatchkeyField(field="address_1", scorer="token_sort", weight=1.0, transforms=["lowercase", "strip"]),
MatchkeyField(field="postcode", scorer="exact", weight=0.5),
],
)],
)
result = goldenmatch.dedupe_df(df, config=config)
Large Dataset Mode
For datasets over 1M records, use database sync mode. GoldenMatch processes records in chunks, maintains a persistent ANN index, and matches incrementally:
# Load into Postgres, then sync
goldenmatch sync --table customers --connection-string "$DATABASE_URL" --config config.yaml
# Watch for new records continuously
goldenmatch watch --table customers --connection-string "$DATABASE_URL" --interval 30
How it works:
- Reads in configurable chunks (default 10K) โ never loads entire table into memory
- Hybrid blocking: SQL WHERE for exact fields + persistent FAISS ANN for semantic fields
- Progressive embedding: computes 100K embeddings per run, ANN improves over time
- Persistent clusters with golden record versioning
Scale: Tested to 10M+ records in Postgres. For 100M+, use larger chunk sizes and dedicated Postgres infrastructure.
Interactive TUI
GoldenMatch includes a gold-themed interactive terminal UI:
- Auto-config summary โ first screen shows detected columns, scorers, and blocking strategy with Run/Edit/Save options
- Pipeline progress โ full-screen progress with stage tracker (โ/โ/โ) on first run, footer bar on re-runs
- Split-view matches โ cluster list on the left, golden record + member details on the right
- Live threshold slider โ arrow keys adjust threshold in 0.05 increments with instant cluster count preview
- Keyboard shortcuts โ
1-6jump to tabs (Data, Config, Matches, Golden, Boost, Export),F5run,?show all shortcuts,Ctrl+Eexport
Data profiling:
Match results with cluster detail:
Golden records:
Settings Persistence
GoldenMatch saves preferences across sessions:
- Global:
~/.goldenmatch/settings.yamlโ output mode, default model, API keys - Project:
.goldenmatch.yamlโ column mappings, thresholds, blocking config
Settings tuned in the TUI can be saved to the project file. Next run picks them up automatically.
CLI Reference
| Command | Description |
|---|---|
goldenmatch demo |
Built-in demo with sample data |
goldenmatch setup |
Interactive setup wizard (GPU, API keys, database) |
goldenmatch dedupe FILE [...] |
Deduplicate one or more files |
goldenmatch match TARGET --against REF |
Match target against reference |
goldenmatch sync --table TABLE |
Sync against Postgres database |
goldenmatch watch --table TABLE |
Live stream mode (continuous polling, --daemon for service mode) |
goldenmatch schedule --every 1h FILE |
Run deduplication on a schedule |
goldenmatch serve FILE [...] |
Start REST API server |
goldenmatch mcp-serve FILE [...] |
Start MCP server (Claude Desktop) |
goldenmatch rollback RUN_ID |
Undo a previous merge run |
goldenmatch unmerge RECORD_ID |
Remove a record from its cluster |
goldenmatch runs |
List previous runs for rollback |
goldenmatch init |
Interactive config wizard |
goldenmatch interactive FILE [...] |
Launch TUI |
goldenmatch profile FILE |
Profile data quality |
goldenmatch evaluate FILE --gt GT.csv |
Evaluate matching against ground truth |
goldenmatch incremental BASE --new NEW |
Match new records against existing base |
goldenmatch analyze-blocking FILE |
Analyze data and suggest blocking strategies |
goldenmatch label FILE --config --gt |
Interactively label pairs to build ground truth CSV |
goldenmatch config save/load/list/show |
Manage config presets |
goldenmatch memory stats/learn/export/import/show |
Manage Learning Memory store (v1.6.0) |
Key dedupe flags:
| Flag | Description |
|---|---|
--anomalies |
Detect fake emails, placeholder data, suspicious records |
--preview |
Show what will change before writing (merge preview) |
--diff / --diff-html |
Generate before/after change report |
--dashboard |
Before/after data quality dashboard (HTML) |
--html-report |
Detailed match report with charts |
--chunked |
Large dataset mode (process in chunks) |
--llm-boost |
Improve accuracy with LLM-labeled training |
--daemon |
Run watch mode as a background service with health endpoint |
s3:// / gs:// / az:// |
Read directly from cloud storage |
Remote MCP Server
GoldenMatch is available as a hosted MCP server on Smithery โ connect from any MCP client without installing anything.
Claude Desktop / Claude Code:
{
"mcpServers": {
"goldenmatch": {
"url": "https://goldenmatch-mcp-production.up.railway.app/mcp/"
}
}
}
Local server (if you prefer to run locally):
pip install goldenmatch[mcp]
goldenmatch mcp-serve data.csv
35 tools available: deduplicate files, match records, explain decisions, review borderline pairs, privacy-preserving linkage, configure rules, scan data quality, run transforms, synthesize golden records, and manage Learning Memory (list_corrections, add_correction, learn_thresholds, memory_stats, memory_export).
Architecture
goldenmatch/
โโโ cli/ # 21 CLI commands (Typer)
โ # Python API: 95 public exports from `import goldenmatch as gm`
โ # -- every feature accessible without knowing internal module structure
โโโ config/ # Pydantic schemas, YAML loader, settings
โโโ core/ # Pipeline: ingest, block, score, cluster, golden, explainer,
โ # report, dashboard, graph, anomaly, diff, rollback,
โ # schema_match, chunked, cloud_ingest, api_connector, scheduler,
โ # llm_scorer, lineage, match_one, evaluate, gpu, vertex_embedder,
โ # probabilistic, learned_blocking, streaming, graph_er, domain
โโโ domains/ # 7 built-in YAML domain packs (electronics, software, healthcare, ...)
โโโ plugins/ # Plugin system (scorers, transforms, connectors, golden strategies)
โโโ connectors/ # Enterprise connectors (Snowflake, Databricks, BigQuery, HubSpot, Salesforce)
โโโ backends/ # DuckDB backend for out-of-core processing
โโโ db/ # Postgres: connector, sync, reconcile, clusters, ANN index
โโโ api/ # REST API server
โโโ mcp/ # MCP server for Claude Desktop
โโโ tui/ # Gold-themed Textual TUI + setup wizard
โโโ utils/ # Transforms, helpers
Run tests: pytest (924 tests)
Part of the Golden Suite
| Tool | Purpose | Install |
|---|---|---|
| GoldenCheck | Validate & profile data quality | pip install goldencheck |
| GoldenFlow | Transform & standardize data | pip install goldenflow |
| GoldenMatch | Deduplicate & match records | pip install goldenmatch |
| GoldenPipe | Orchestrate the full pipeline | pip install goldenpipe |
What's New in v1.4.0
- Scoring & survivorship quality -- MST-based cluster auto-splitting at weakest edges, cluster quality labels (strong/weak/split), quality-weighted survivorship strategies using GoldenCheck scores, field-level provenance tracking.
- Smart auto-config -- auto-config now profiles cleaned data (after GoldenCheck/GoldenFlow), detects data domains and extracts identifiers, selects learned blocking for large datasets, enables reranking for multi-field matchkeys, adjusts thresholds from data quality.
- GoldenFlow integration -- optional data transformation step in the pipeline. Phone normalization, date standardization, categorical correction.
pip install goldenmatch[transform]. llm_autoflag --dedupe_df(df, llm_auto=True)auto-enables LLM scorer ($0.05 budget cap) and memory store when API key detected.
What's New in v1.3.0
- CCMS cluster comparison -- compare two clustering outcomes without ground truth using the Case Count Metric System (Talburt et al.). Classifies each cluster as unchanged, merged, partitioned, or overlapping. Includes Talburt-Wang Index (TWI) for normalized similarity.
- Parameter sensitivity analysis -- sweep threshold, blocking, or matchkey parameters across a range and compare each run against a baseline.
stability_report()identifies optimal value ranges. Failed sweep points are logged and skipped, preserving partial results. - New CLI commands --
goldenmatch compare-clustersfor ad-hoc comparison,goldenmatch sensitivityfor automated parameter tuning. - New Python API --
compare_clusters(),CompareResult,run_sensitivity(),SensitivityResult,SweepParamexported fromgoldenmatch.
What's New in v1.2.7
- Auto-config cardinality guards โ three new guards prevent auto-config failures on edge-case data:
- Blocking: excludes near-unique columns (cardinality_ratio >= 0.95)
- Matchkeys: skips exact matchkeys for low-cardinality columns (cardinality_ratio < 0.01)
- Description columns: routes long text to fuzzy matching (token_sort) alongside embedding
- Library comparison benchmarks โ head-to-head results against Splink, Dedupe, and RecordLinkage on Febrl (0.971 F1) and DBLP-ACM (0.918 F1). GoldenMatch is the most consistent performer across data types.
What's New in v1.2.6
- Iterative LLM calibration โ instead of scoring all candidates, calibrates the decision threshold from
200 sampled pairs. Typically converges in 2-3 rounds at negligible cost ($0.01 on a 401K-row equipment dataset). - ANN hybrid blocking โ oversized blocks that exceed the max block size now fall back to embedding-based ANN sub-blocking automatically, keeping blocks tractable without manual tuning.
- Auto-config classification fixes โ improved heuristics for ID and price fields, utility-based field ranking to select better blocking keys, and LLM-assisted classification for ambiguous column names.
Author
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file goldenmatch-1.7.1.tar.gz.
File metadata
- Download URL: goldenmatch-1.7.1.tar.gz
- Upload date:
- Size: 9.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9a8ad3f58a166b3b66f53c1f49e05f3e6a60785b0bde54b5eb7287095e69bf6a
|
|
| MD5 |
6d597771267ee474db75feabe9ea9c27
|
|
| BLAKE2b-256 |
c6c64963a1797b96ad40cf941c6e0ea9742a2cccd2290b0d7cd2715f62cb1fd4
|
File details
Details for the file goldenmatch-1.7.1-py3-none-any.whl.
File metadata
- Download URL: goldenmatch-1.7.1-py3-none-any.whl
- Upload date:
- Size: 605.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ddb82cc8f02fe3830ba4eeb93c343c5e42539e60e70621428119a997b4dc9bf5
|
|
| MD5 |
705e990bc9a23c45aec7c95d77e2a7fa
|
|
| BLAKE2b-256 |
ad85a16fb03bb7e98e0f8c42055eda63e0fa706b2d8ec1384b868e76cfc4148b
|