Skip to main content

Entity resolution toolkit โ€” deduplicate records, match across sources, and maintain golden records

Project description

๐ŸŸก GoldenMatch

Find duplicate records in 30 seconds. No rules to write, no models to train.

Zero-config entity resolution for Python & TypeScript โ€” with a self-verifying auto-config that tells you when it's unsure.


PyPI npm Python Node License: MIT

CI codecov DQBench ER F1: 97.2%

PyPI downloads npm downloads GitHub stars

Docs Smithery MCP MCP Marketplace Open In Colab

GoldenMatch web workbench โ€” pair drilldown with NL prose

Pair drilldown in the web workbench: cluster members, field-level diff, and a one-line NL explanation per pair. pip install goldenmatch[web] then goldenmatch serve-ui <project>. More screenshots โ†’

# Python
pip install goldenmatch && goldenmatch dedupe customers.csv

# TypeScript / Node.js
npm install goldenmatch

๐Ÿ†• v1.6.0 (Python) + v0.4.0 (npm) โ€” cross-language Learning Memory parity โ€” A correction written by Python applies identically in TypeScript and vice versa: byte-identical SHA-256 hashes, the same SQLite schema, the same collision-safe re-anchor algorithm, verified every CI run by JSON + SQLite + apply-outcome parity tests on both sides. Steward decisions, unmerges, LLM votes, and agent approvals persist to a local SQLite store, re-anchor across row reorders via record-hash, and apply automatically on the next run. The pipeline reports Memory: N applied, M stale, K stale-ambiguous, J unanchorable in postflight. New CLI subgroup goldenmatch memory (and goldenmatch-js memory in TS), five new MCP tools per runtime, and goldenmatch.add_correction() / learn() / memory_stats(). Off by default. See Learning Memory.

v1.5.0 โ€” Auto-config preflight + postflight verification layer (still on by default). See Auto-Config Verification. Built by Ben Severn.


Why GoldenMatch?

  • Zero-config โ€” auto-detects columns, picks scorers, and runs. No training data needed
  • 97.2% F1 on DBLP-ACM out of the box. DQBench ER score: 95.30
  • Learning Memory โ€” corrections from stewards, unmerges, and LLM votes persist to disk and apply automatically on the next run; survives row reorders via record-hash re-anchoring (v1.6.0)
  • Privacy-preserving โ€” match across organizations without sharing raw data (PPRL, 92.4% F1)
  • 35 MCP tools โ€” use from Claude Desktop, Claude Code, or any AI assistant (Smithery)
  • Production-ready โ€” Postgres sync, daemon mode, lineage tracking, review queues

Choose your path

I want to... Go here
Deduplicate a CSV right now Quick Start
Use from Claude Desktop / AI assistant MCP Server
Build AI agents that deduplicate ER Agent (A2A)
Write Python code Python API
Write TypeScript / Node.js TypeScript API
Deploy to Vercel Edge / Cloudflare Workers TypeScript API
Use the interactive TUI TUI Guide
Train the system on my corrections Learning Memory

All features (click to expand)

Matching

  • 10+ scoring methods โ€” exact, Jaro-Winkler, Levenshtein, token sort, soundex, ensemble, embedding, record embedding, dice, jaccard + plugin extensible
  • 8+ blocking strategies โ€” static, adaptive, sorted neighborhood, multi-pass, ANN, ann_pairs, canopy, learned (data-driven predicate selection)
  • Fellegi-Sunter probabilistic matching โ€” EM-trained m/u probabilities, automatic threshold estimation
  • LLM scorer with budget controls โ€” GPT-4o-mini scores borderline pairs for just $0.04. Budget caps, model tiering, graceful degradation
  • Cross-encoder reranking โ€” re-score borderline pairs with a pre-trained cross-encoder for higher precision
  • Schema-free matching โ€” auto-maps columns between different schemas (full_name -> first_name + last_name)

Data Quality

  • GoldenCheck integration โ€” pip install goldenmatch[quality] adds data quality scanning (encoding, Unicode, format validation)
  • GoldenFlow transforms โ€” pip install goldenmatch[transform] normalizes phone numbers, dates, categorical spelling
  • Anomaly detection โ€” flag fake emails, placeholder data, suspicious records

Golden Records

  • 5 merge strategies โ€” most_complete, majority_vote, source_priority, most_recent, first_non_null
  • Quality-weighted survivorship โ€” fields scored by source quality from GoldenCheck
  • Field-level provenance โ€” tracks which source row contributed each field
  • Cluster quality scoring โ€” clusters labeled strong/weak/split; oversized clusters auto-split via MST

Privacy

  • PPRL multi-party linkage โ€” match across organizations without sharing raw data (92.4% F1 on FEBRL4)
  • PPRL auto-configuration โ€” profiles your data and picks optimal fields, bloom filter parameters, and threshold

Integration

  • REST API + MCP Server โ€” 30 tools for matching, explaining, reviewing, data quality, and transforms
  • A2A Agent โ€” 10 skills for AI-to-AI autonomous entity resolution
  • Database sync โ€” incremental Postgres matching with persistent ANN index
  • Enterprise connectors โ€” Snowflake, Databricks, BigQuery, HubSpot, Salesforce
  • DuckDB backend โ€” out-of-core processing for 10M+ records without Spark
  • Ray distributed backend โ€” scale to 50M+ records with pip install goldenmatch[ray]
  • dbt integration โ€” dbt-goldenmatch package for DuckDB-based ER in dbt pipelines

Learning Memory (v1.6.0)

  • Persistent corrections โ€” every steward decision, unmerge, boost-tab y/n, LLM vote, and agent approve/reject writes to a local SQLite (or Postgres) store
  • Re-anchor via record_hash โ€” corrections survive row reordering and refresh; ambiguous re-anchors report as stale_ambiguous rather than misapplying
  • Automatic application โ€” dedupe_df and match_df overlay learned thresholds before scoring and apply hard 1.0/0.0 overrides after; postflight reports impact
  • Threshold learner โ€” trust-weighted grid search auto-tunes matchkey thresholds once 10+ corrections accumulate
  • CLI / Python / MCP triad โ€” goldenmatch memory stats|learn|export|import|show, goldenmatch.add_correction() / learn() / memory_stats(), and 5 new MCP tools (list_corrections, add_correction, learn_thresholds, memory_stats, memory_export)
  • Off by default โ€” zero-config posture preserved; opt in via config.memory.enabled = True

Developer Experience

  • Gold-themed TUI โ€” interactive interface with keyboard shortcuts, live threshold tuning
  • Active learning boost โ€” label 10 borderline pairs in the TUI, retrain a classifier for 99% accuracy
  • Review queue โ€” REST endpoint surfaces borderline pairs for data steward approval
  • Merge preview + undo โ€” rollback any run or unmerge individual records
  • Lineage tracking โ€” every merge decision saved with per-field score breakdown
  • Natural language explainability โ€” template-based per-pair and per-cluster explanations at zero LLM cost
  • Evaluation CLI โ€” goldenmatch evaluate reports precision/recall/F1 against ground truth
  • 7 domain packs โ€” electronics, software, healthcare, financial, real estate, people, retail
  • Plugin architecture โ€” extend with custom scorers, transforms, connectors via pip
  • Streaming / CDC mode โ€” incremental record matching with micro-batch or immediate processing
  • GitHub Actions "Try It" โ€” zero-install demo via workflow_dispatch
  • Codespaces ready โ€” one-click dev environment

TypeScript / Node.js

GoldenMatch ships an npm package with full feature parity โ€” same scorers, clustering, golden records, and YAML configs.

npm install goldenmatch
import { dedupe } from "goldenmatch";

const rows = [
  { id: 1, name: "John Smith", email: "john@example.com", zip: "12345" },
  { id: 2, name: "Jon Smith",  email: "john@example.com", zip: "12345" },
  { id: 3, name: "Jane Doe",   email: "jane@example.com", zip: "54321" },
];

const result = dedupe(rows, {
  fuzzy: { name: 0.85 },
  blocking: ["zip"],
  threshold: 0.85,
});

console.log(result.stats);  // { totalRecords: 3, totalClusters: 2, ... }
  • Edge-safe core โ€” runs in browsers, Vercel Edge Runtime, Cloudflare Workers, Deno
  • Feature parity with Python: fuzzy scorers, probabilistic Fellegi-Sunter, PPRL, graph ER, LLM reranking, MCP/REST/A2A servers, 11+ CLI commands, interactive TUI
  • 478 tests, strict TypeScript (noUncheckedIndexedAccess, exactOptionalPropertyTypes)
  • Zero-dep install works โ€” optional peer deps unlock native paths (hnswlib-node, @huggingface/transformers for ONNX cross-encoder, piscina for worker threads, pg/duckdb/snowflake for data connectors)

Full docs: benzsevern.github.io/goldenmatch/typescript See packages/goldenmatch-js/examples/ for 10+ usage examples.

Web UI

pip install 'goldenmatch[web]'
goldenmatch serve-ui                                         # current dir as project
goldenmatch serve-ui packages/python/goldenmatch/web/demo    # bundled demo project

Localhost browser workbench. Editorial gold-on-cream design, single process, no auth โ€” for the dev-on-a-laptop case.

Project page

It surfaces the engine's full capability stack as 7 pages:

Page What you can do
Project (/) Browse saved runs, auto-run from data.csv, see GoldenCheck quality findings as a banner
Workbench (/workbench) Edit matchkey rules + threshold + standardization + blocking + per-row matchkey type (exact / weighted / probabilistic). Run sampled previews. Save back to goldenmatch.yml (atomic write + .bak). Auto-configure with optional domain-pack pinning (electronics, people, healthcare, โ€ฆ).
Inspector (/runs/{name}) Cluster table + member view + pair drilldown with field-level diff + one-line NL prose explanation per pair. Label pairs (mirrors to Learning Memory). Unmerge a record or shatter a cluster. F1/precision/recall vs your labels.
Match (/match) One-to-many target ร— reference workflow. Different output shape from dedupe โ€” flat target โ†’ reference mapping + unmatched targets.
Compare (/compare) Run A vs B classification (CCMS): unchanged / merged / partitioned / overlapping per cluster, plus the Talburt-Wang Index over the whole transformation. No labels needed.
Sensitivity (/sensitivity) Sweep one parameter (threshold / blocking max-block-size / per-matchkey threshold), CCMS-compare each point against the baseline. Cluster-count sparkline + most-stable-value report.
Memory (/memory) Browse the Learning Memory store (corrections + sources + trust + matchkey). Trigger a learn pass. Stored adjustments table.

Workbench

Workbench

Every change validates through the same Pydantic schema the engine uses; 422 errors render inline next to the offending field. Save writes the canonical shape (matchkey: singular, the shape goldenmatch dedupe reads) and snapshots the prior file to goldenmatch.yml.bak before clobbering.

Inspector

Inspector

Each pair card shows a one-line template explanation above the field breakdown โ€” derived from the field scores via goldenmatch.core.explain.explain_pair_nl, no LLM cost. Labels mirror to the same MemoryStore the pipeline reads on every run via apply_corrections, so the loop closes end-to-end.

Compare runs (CCMS)

Compare

CCMS classification (Talburt et al., arXiv:2601.02824v1, 2026): every cluster from run A is mapped to one of unchanged / merged / partitioned / overlapping with respect to run B. Mismatched row-ID coverage between the two runs surfaces as a clean 400 with the engine's diagnostic intact.

Sensitivity sweep

Sensitivity

Re-runs the pipeline at each sweep value on a sampled slice (default 500 rows, configurable per-request up to 10K), CCMS-compares each point against the baseline, and surfaces the most-stable value alongside the per-point TWI / cluster-count / case breakdown.

Match (target ร— reference)

Match

Different output shape from dedupe โ€” match has no clusters. Both target and reference paths are resolved under the project root with a path-traversal guard. Auto-configure mode skips the workbench rules and profiles both files together.

Memory store browser

Memory

Every label you save in the inspector mirrors into the engine's Learning Memory store. The pipeline reads it on every run, so the next dedupe picks up the decision automatically. Threshold tuning fires at โ‰ฅ10 corrections; weight learning at โ‰ฅ50.

Build / dev

# Backend tests
pytest packages/python/goldenmatch/tests/web -q     # 100+ tests

# Frontend build (TypeScript + Vite)
pnpm -C packages/python/goldenmatch/web/frontend install
pnpm -C packages/python/goldenmatch/web/frontend test
pnpm -C packages/python/goldenmatch/web/frontend build

# Stage build output into the wheel-included static dir
python packages/python/goldenmatch/scripts/build_web.py

Frontend source lives outside the package at web/frontend/; build output lands inside the package at goldenmatch/web/static/ (gitignored except for a .gitkeep, included in the wheel via force-include). The dev server (pnpm dev) proxies /api/v1/* to http://localhost:5050.

Installation

pip install goldenmatch                    # core (files only)
pip install goldenmatch[embeddings]        # + sentence-transformers, FAISS
pip install goldenmatch[llm]               # + Claude/OpenAI for LLM boost
pip install goldenmatch[postgres]          # + Postgres database sync
pip install goldenmatch[snowflake]        # + Snowflake connector
pip install goldenmatch[bigquery]         # + BigQuery connector
pip install goldenmatch[databricks]       # + Databricks connector
pip install goldenmatch[salesforce]       # + Salesforce connector
pip install goldenmatch[duckdb]           # + DuckDB backend
pip install goldenmatch[quality]          # + GoldenCheck data quality scanning
pip install goldenmatch[web]              # + localhost browser workbench (FastAPI + React)

# Run the setup wizard to configure GPU, API keys, and database:
goldenmatch setup

Python API

GoldenMatch exposes 95 functions and classes from a single import. See examples/ for complete runnable scripts.

import goldenmatch as gm

Quick Start

import goldenmatch as gm

# Deduplicate a CSV (zero-config)
result = gm.dedupe("customers.csv")

# Exact + fuzzy matching
result = gm.dedupe("customers.csv", exact=["email"], fuzzy={"name": 0.85, "zip": 0.95})
result.golden.write_csv("deduped.csv")
print(result)  # DedupeResult(records=5000, clusters=847, match_rate=12.0%)

# Match across files
result = gm.match("new_customers.csv", "master.csv", fuzzy={"name": 0.85})
result.to_csv("matches.csv")

# With YAML config
result = gm.dedupe("data.csv", config="config.yaml")

# With LLM scorer for product matching
result = gm.dedupe("products.csv", fuzzy={"title": 0.80}, llm_scorer=True)

# With Ray backend for large datasets
result = gm.dedupe("huge.parquet", exact=["email"], backend="ray")

Learning Memory (v1.6.0)

GoldenMatch can remember past steward decisions and apply them automatically on every subsequent run. Reject a pair once -- it stays rejected. Approve a borderline pair once -- it stays approved. After 10+ corrections accumulate against a matchkey, the learner adjusts its threshold so the system stops needing the same correction twice. Off by default; enable via config.memory.enabled = True or a memory: block in YAML. Full guide: Learning Memory docs.

goldenmatch.yml:

matchkeys:
  - name: identity
    type: weighted
    threshold: 0.85
    fields:
      - field: name
        scorer: jaro_winkler
        transforms: [lowercase, strip]
        weight: 1.0
      - field: email
        scorer: exact
        weight: 1.0

blocking:
  strategy: static
  keys:
    - fields: [zip]
      transforms: [lowercase]

memory:
  enabled: true
  backend: sqlite
  path: .goldenmatch/memory.db
  reanchor: true
  dataset: customers
  learning:
    threshold_min_corrections: 10
    weights_min_corrections: 50

Three commands users actually run:

# 1. First run -- produces the review queue
goldenmatch dedupe customers.csv --config goldenmatch.yml

# 2. Steward decides borderline pairs (writes to .goldenmatch/memory.db)
goldenmatch review --config goldenmatch.yml      # interactive TUI

# 3. Re-run -- corrections apply automatically; postflight reports impact
goldenmatch dedupe customers.csv --config goldenmatch.yml
# > Memory: 12 corrections applied, 0 stale, 0 stale-ambiguous, 0 unanchorable

Python API equivalent:

import goldenmatch

# Programmatically register a correction
goldenmatch.add_correction(
    id_a=42, id_b=87, decision="reject", source="steward",
    reason="Different EIN despite name match", dataset="customers",
)

# Force a learning pass (otherwise auto-runs at next pipeline call)
adjustments = goldenmatch.learn()
print(f"Adjusted {len(adjustments)} matchkey thresholds")

# Inspect what's stored
print(goldenmatch.memory_stats())

MCP equivalent (from Claude Desktop / Code):

"Show me uncertain pairs from the last goldenmatch run on customers.csv, then mark rows 17 and 23 as not-a-match because they have different EINs."

The host LLM calls list_corrections -> add_correction -> learn_thresholds.

Auto-Config Verification (v1.5.0)

Zero-config used to crash on bibliographic and domain-extracted schemas โ€” auto-config would emit a matchkey referencing __title_key__ without enabling config.domain, and the pipeline would raise ValueError: Missing required columns. v1.5.0 closes the gap with a preflight + postflight verification layer that runs automatically around auto_configure_df.

Preflight (gm.preflight) runs 6 checks at the end of auto_configure_df:

  • column resolution (auto-repairs missing domain-extracted columns by enabling config.domain)
  • cardinality bounds on exact matchkeys (drops near-unique and near-constant keys)
  • block-size sanity (flags blocks that would stall the scorer)
  • remote-asset demotion (any embedding, record_embedding, or cross-encoder rerank is demoted unless you pass allow_remote_assets=True)
  • confidence-gated weight capping (low-confidence fields cap at weight 0.3)

Unrepairable issues raise ConfigValidationError with the full PreflightReport attached as err.report. Repaired issues stay on the report as findings with repaired=True.

Postflight (gm.postflight) runs 4 signals after scoring, before clustering:

  • score-distribution histogram + bimodality detection (auto-nudges threshold on clear bimodality)
  • blocking-recall estimate (gated at 10K+ rows)
  • preliminary cluster sizes + oversized-cluster bottleneck pair
  • threshold-band overlap percentage (advises --llm-auto when overlap > 20% and LLM is off)

The report attaches to DedupeResult.postflight_report / MatchResult.postflight_report.

import goldenmatch as gm
import polars as pl

df = pl.read_csv("bibliography.csv")

# Zero-config -- preflight + postflight run automatically
result = gm.dedupe_df(df)

# Inspect the preflight report (private-by-convention underscore)
for finding in result.config._preflight_report.findings:
    print(f"[{finding.severity}] {finding.check}: {finding.message}")

# Inspect postflight signals (public)
sig = result.postflight_report.signals
print(f"Scored {sig['total_pairs_scored']} pairs")
print(f"Threshold overlap: {sig['threshold_overlap_pct']:.1%}")
print(f"Oversized clusters: {len(sig['oversized_clusters'])}")

Offline by default. Remote-asset scorers are demoted unless you opt in:

cfg = gm.auto_configure_df(df, allow_remote_assets=True)  # loads cross-encoder etc.

Strict mode for parity runs. strict=True still computes postflight signals and emits advisories, but skips threshold adjustments โ€” use it for DQBench, regression suites, and any reproducible output:

cfg = gm.auto_configure_df(df, strict=True)

New classifier smarts in v1.5.0:

  • Columns with cardinality โ‰ฅ 0.95 are classified as identifier, not phone / zip / numeric.
  • New year col_type routes to blocking, not scoring.
  • New multi_name col_type handles comma/semicolon-delimited author-style fields.
  • Low-confidence fields (< 0.5) cap at weight 0.3.

See examples/verification_inspection.py and examples/strict_mode_parity.py for runnable walkthroughs.

Privacy-Preserving Linkage

import goldenmatch as gm

# Auto-configured PPRL (picks fields and threshold automatically)
result = gm.pprl_link("hospital_a.csv", "hospital_b.csv")
print(f"Found {result['match_count']} matches across {len(result['clusters'])} clusters")

# Manual field selection
result = gm.pprl_link("party_a.csv", "party_b.csv",
    fields=["first_name", "last_name", "dob", "zip"],
    threshold=0.85, security_level="high")

# Auto-config analysis
config = gm.pprl_auto_config(df)
print(config.recommended_fields)  # ['first_name', 'last_name', 'zip_code', 'birth_year']

Evaluate Accuracy

import goldenmatch as gm

# Measure precision/recall/F1 against ground truth
metrics = gm.evaluate("data.csv", config="config.yaml", ground_truth="gt.csv")
print(f"F1: {metrics['f1']:.1%}, Precision: {metrics['precision']:.1%}")

# Evaluate programmatically
result = gm.evaluate_pairs(predicted_pairs, ground_truth_set)
print(result.f1)

Build Configs Programmatically

import goldenmatch as gm

# Auto-generate config from data
config = gm.auto_configure([("data.csv", "source")])

# Or build manually
config = gm.GoldenMatchConfig(
    matchkeys=[
        gm.MatchkeyConfig(name="exact_email", type="exact",
            fields=[gm.MatchkeyField(field="email", transforms=["lowercase"])]),
        gm.MatchkeyConfig(name="fuzzy_name", type="weighted", threshold=0.85,
            fields=[
                gm.MatchkeyField(field="name", scorer="jaro_winkler", weight=0.7),
                gm.MatchkeyField(field="zip", scorer="exact", weight=0.3),
            ]),
    ],
    blocking=gm.BlockingConfig(strategy="learned"),
    llm_scorer=gm.LLMScorerConfig(enabled=True, mode="cluster"),
    backend="ray",
)

Streaming / Incremental

import goldenmatch as gm

# Match a single new record against existing data
matches = gm.match_one(new_record, existing_df, matchkey)

# Stream processor for continuous matching
processor = gm.StreamProcessor(df, config)
matches = processor.process_record(new_record)

Advanced Features

import goldenmatch as gm

# Domain extraction
rulebooks = gm.discover_rulebooks()  # 7 built-in packs
enhanced_df, low_conf = gm.extract_with_rulebook(df, "title", rulebooks["electronics"])

# Fellegi-Sunter probabilistic
em_result = gm.train_em(df, matchkey, n_sample_pairs=10000)
pairs = gm.score_probabilistic(block_df, matchkey, em_result)

# Explain a match decision
explanation = gm.explain_pair(record_a, record_b, matchkey)

# Cluster operations
gm.unmerge_record(record_id, clusters)  # Remove from cluster
gm.unmerge_cluster(cluster_id, clusters)  # Shatter to singletons

# Data quality
df, fixes = gm.auto_fix_dataframe(df)
anomalies = gm.detect_anomalies(df)
column_map = gm.auto_map_columns(df_a, df_b)  # Schema matching

# Graph ER (multi-table)
clusters = gm.run_graph_er(entities, relationships)

Setup Wizard

Run goldenmatch setup for an interactive walkthrough:

Setup Wizard

Guides you through GPU mode selection, Vertex AI / Colab / local GPU configuration, LLM boost API keys, and database sync โ€” with copy-paste commands at every step.

GPU Selection

Why GoldenMatch?

GoldenMatch dedupe recordlinkage Zingg Splink
Zero-config mode Yes No (requires training) No (manual config) No (Spark required) No (SQL config)
Fuzzy + probabilistic + LLM All three Probabilistic only Probabilistic only ML-based Probabilistic only
Privacy-preserving (PPRL) Built-in (92.4% F1) No No No No
Interactive TUI Yes No No No No
Golden record synthesis 5 strategies No No No No
MCP server (AI integration) Yes (35 tools) No No No No
Database sync Postgres + DuckDB No No No Spark/DuckDB
Single pip install Yes Yes Yes No (Java/Spark) Yes
Polars-native Yes No (pandas) No (pandas) No (Spark) Yes (DuckDB)

GoldenMatch is the only tool that combines zero-config operation, probabilistic matching, LLM scoring, privacy-preserving linkage, and golden record synthesis in a single Python package.

Quick Start

Zero-Config (no YAML needed)

goldenmatch dedupe customers.csv

Auto-detects column types (name, email, phone, zip, address, description), assigns appropriate scorers, picks blocking strategy, and launches the TUI for review.

With Config

goldenmatch dedupe customers.csv --config config.yaml --output-all --output-dir results/

Match Mode

goldenmatch match targets.csv --against reference.csv --config config.yaml --output-all

Database Sync

# First run: full scan, create metadata tables
goldenmatch sync --table customers --connection-string "$DATABASE_URL" --config config.yaml

# Subsequent runs: incremental (only new records)
goldenmatch sync --table customers --connection-string "$DATABASE_URL"

How It Works

Files/DB โ†’ Ingest โ†’ Standardize โ†’ Block โ†’ Score โ†’ Cluster โ†’ Golden Records โ†’ Output
                                     โ†‘        โ†‘
                              SQL blocking   10 scorers
                              ANN blocking   ensemble
                              7 strategies   embeddings
                                             parallel blocks

Pipeline:

  1. Ingest โ€” CSV, Excel, Parquet, or Postgres table
  2. Standardize โ€” configurable per-column transforms
  3. Block โ€” reduce comparison space (multi-pass, ANN, canopy, etc.)
  4. Score โ€” compare record pairs with appropriate scorer
  5. Cluster โ€” group matches via Union-Find; auto-split oversized clusters via MST; assign quality labels (strong/weak/split)
  6. Golden โ€” merge each cluster into one canonical record using quality-weighted survivorship (5 strategies); track field-level provenance
  7. Output โ€” files (CSV/Parquet) or database tables + lineage JSON sidecar with provenance

Config Reference

matchkeys:
  - name: exact_email
    type: exact
    fields:
      - field: email
        transforms: [lowercase, strip]

  - name: fuzzy_name_zip
    type: weighted
    threshold: 0.85
    rerank: true             # re-score borderline pairs with cross-encoder
    rerank_band: 0.1         # pairs within threshold +/- 0.1 get reranked
    fields:
      - field: first_name
        scorer: jaro_winkler
        weight: 0.4
        transforms: [lowercase, strip]
      - field: last_name
        scorer: jaro_winkler
        weight: 0.4
        transforms: [lowercase, strip]
      - field: zip
        scorer: exact
        weight: 0.2

  - name: semantic
    type: weighted
    threshold: 0.80
    fields:
      - columns: [title, authors, venue]
        scorer: record_embedding
        weight: 1.0
        column_weights: {title: 2.0, authors: 1.0, venue: 0.5}  # bias embedding toward title

llm_scorer:
  enabled: true              # score borderline pairs with GPT/Claude
  auto_threshold: 0.95       # auto-accept pairs above this
  candidate_lo: 0.75         # LLM scores pairs in [0.75, 0.95]
  # provider: openai         # auto-detected from OPENAI_API_KEY
  # model: gpt-4o-mini       # default, cheapest option

blocking:
  strategy: adaptive         # static | adaptive | sorted_neighborhood | multi_pass | ann | ann_pairs | canopy
  auto_select: true          # auto-pick best key by histogram analysis
  keys:
    - fields: [zip]
    - fields: [last_name]
      transforms: [lowercase, soundex]

golden_rules:
  default_strategy: most_complete
  auto_split: true                  # Auto-split oversized clusters via MST
  quality_weighting: true           # Use GoldenCheck quality scores in survivorship
  weak_cluster_threshold: 0.3       # Edge gap threshold for confidence downgrade
  field_rules:
    email: { strategy: majority_vote }
    first_name: { strategy: source_priority, source_priority: [crm, marketing] }

output:
  directory: ./output
  format: csv

Scorers

Scorer Description Best For
exact Binary match Email, phone, ID
jaro_winkler Edit distance similarity Names
levenshtein Normalized Levenshtein General strings
token_sort Order-invariant token matching Names, addresses
soundex_match Phonetic match Names
ensemble max(jaro_winkler, token_sort, soundex) Names with reordering
embedding Cosine similarity of sentence embeddings Semantic matching
record_embedding Embed concatenated fields Cross-field semantic matching
dice Dice coefficient on bloom filters Privacy-preserving matching
jaccard Jaccard similarity on bloom filters Privacy-preserving matching

Blocking Strategies

Strategy Description
static Group by blocking key (default)
adaptive Static + recursive sub-blocking for oversized blocks
sorted_neighborhood Sliding window over sorted records
multi_pass Union of blocks from multiple passes (best for noisy data)
ann ANN via FAISS on sentence-transformer embeddings
ann_pairs Direct-pair ANN scoring (50-100x faster than ann)
canopy TF-IDF canopy clustering
learned Data-driven predicate selection (auto-discovers blocking rules)

Database Integration

GoldenMatch can sync against live Postgres databases with incremental matching:

pip install goldenmatch[postgres]

goldenmatch sync \
  --table customers \
  --connection-string "postgresql://user:pass@localhost/mydb" \
  --config config.yaml

Features:

  • Incremental sync โ€” only processes records added since last run
  • Hybrid blocking โ€” SQL WHERE clauses for exact fields + FAISS ANN for semantic fields, results unioned
  • Persistent ANN index โ€” disk cache + DB source of truth, progressive embedding across runs
  • Golden record versioning โ€” append-only with is_current flag, full audit trail
  • Cluster management โ€” persistent clusters with merge, conflict detection, max size safety cap

Metadata tables (auto-created):

Table Purpose
gm_state Processing state, watermarks
gm_clusters Persistent cluster membership
gm_golden_records Versioned golden records
gm_embeddings Cached embeddings for ANN
gm_match_log Audit trail of all match decisions

SQL Extensions

Use GoldenMatch directly from PostgreSQL or DuckDB:

-- PostgreSQL
CREATE EXTENSION goldenmatch_pg;
SELECT goldenmatch.goldenmatch_dedupe_table('customers', '{"exact": ["email"]}');
SELECT goldenmatch.goldenmatch_score('John Smith', 'Jon Smyth', 'jaro_winkler');
# DuckDB
pip install goldenmatch-duckdb
import duckdb, goldenmatch_duckdb
con = duckdb.connect()
goldenmatch_duckdb.register(con)
con.sql("SELECT goldenmatch_score('John Smith', 'Jon Smyth', 'jaro_winkler')")

See goldenmatch-extensions for installation and full documentation.

LLM Boost (Optional)

For harder datasets where zero-shot scoring isn't enough:

pip install goldenmatch[llm]

# First run: LLM labels ~300 pairs (~$0.30), fine-tunes embedding model
goldenmatch dedupe products.csv --llm-boost

# Subsequent runs: uses saved model ($0)
goldenmatch dedupe products.csv --llm-boost

Tiered auto-escalation:

  • Level 1 โ€” zero-shot (free, instant)
  • Level 2 โ€” bi-encoder fine-tuning (~$0.20, ~2 min CPU)
  • Level 3 โ€” Ditto-style cross-encoder with data augmentation (~$0.50, ~5 min CPU)

Active sampling selects the most informative pairs for the LLM to label (uncertainty, disagreement, boundary, diversity), reducing label cost by ~45% compared to random sampling.

Iterative calibration: When many borderline pairs exist, iterative calibration samples ~100 pairs per round, learns the optimal threshold via grid search, and applies it to all candidates โ€” typically converging in 2-3 rounds.

Note: LLM boost is most valuable for product matching with local models (MiniLM) where it improved Abt-Buy from 44.5% to 59.5% F1. For structured data (names, addresses, bibliographic), fuzzy matching alone achieves 97%+ F1.

Benchmarks

Leipzig Entity Resolution Benchmarks

Dataset Best Strategy F1 Cost
DBLP-ACM (2.6K vs 2.3K) multi-pass + fuzzy 97.2% $0
DBLP-Scholar (2.6K vs 64K) multi-pass + fuzzy 74.7% $0
Abt-Buy (1K vs 1K) Vertex AI + GPT-4o-mini scorer 81.7% ~$0.74
Abt-Buy (zero-shot) Vertex AI embeddings 62.8% ~$0.05
Amazon-Google (1.4K vs 3.2K) Vertex AI + reranking 44.0% ~$0.10

Structured data (names, addresses, bibliographic): RapidFuzz multi-pass fuzzy matching at 97.2% โ€” zero cost, zero labels. Product matching: Vertex AI embeddings for candidate generation + GPT-4o-mini scorer for borderline pairs achieves 81.7% at ~$0.74 total cost.

Throughput (Scale Curve)

Measured on a laptop (17GB RAM) with exact + fuzzy matching, blocking, clustering, and golden record generation:

Records Time Throughput Pairs Found Memory
1,000 0.2s 5,500 rec/s 210 101 MB
10,000 1.4s 7,300 rec/s 7,000 123 MB
100,000 12s 8,200 rec/s 571,000 544 MB

Fuzzy matching speedup: Parallel block scoring + intra-field early termination reduced 100K fuzzy matching from ~100s to ~39s (2.5x) through the pipeline. The 1M exact-only benchmark runs in 7.8s.

Equipment data (401K rows): 27,937 clusters, 384,650 matched, 323s. LLM calibration learned threshold from 200 pairs (~$0.01). ANN fallback created 363 sub-blocks from 15 oversized blocks.

For datasets over 1M records, use goldenmatch sync (database mode) with incremental matching and persistent ANN indexing. See Large Dataset Mode.

How GoldenMatch Compares

GoldenMatch dedupe Splink Zingg Ditto
Abt-Buy F1 81.7% ~75% ~70% ~80% 89.3%
DBLP-ACM F1 97.2% ~96% ~95% ~96% 99.0%
Training required No Yes Yes Yes Yes (1000+)
Zero-config Yes No No No No
Interactive TUI Yes No No No No
Database sync Postgres Cloud (paid) No No No
REST API / MCP Both Cloud only No No No
GPU required No No No Spark Yes

GoldenMatch's sweet spot is ease of use + competitive accuracy. On bibliographic matching (DBLP-ACM), GoldenMatch hits 97.2% with zero config. On product matching (Abt-Buy), the LLM scorer reaches 81.7% โ€” within 8pts of Ditto's 89.3%, but with zero training labels and no GPU. Ditto requires 1000+ hand-labeled pairs and a GPU.

Library Comparison (v1.2.7)

Head-to-head against Splink, Dedupe, and RecordLinkage on two datasets. GoldenMatch uses explicit config, zero training data.

Febrl (5,000 synthetic PII records, 6,538 true pairs):

Library Precision Recall F1 Time
Splink 1.000 0.995 0.998 2.0s
GoldenMatch 1.000 0.943 0.971 6.8s
Dedupe 1.000 0.865 0.928 7.2s
RecordLinkage 0.999 0.733 0.845 2.2s

DBLP-ACM (4,910 bibliographic records, 2,224 true matches):

Library Precision Recall F1 Time
RecordLinkage 0.888 0.961 0.923 13.0s
GoldenMatch 0.891 0.945 0.918 6.2s
Dedupe 0.604 0.936 0.734 10.5s
Splink 0.646 0.834 0.728 3.4s

Key takeaway: GoldenMatch is the most consistent performer โ€” top-2 F1 on both datasets with zero training data. Splink dominates structured PII but struggles on non-PII. RecordLinkage wins DBLP-ACM but lags on PII.

Febrl explicit config example
config = GoldenMatchConfig(
    blocking=BlockingConfig(
        strategy="multi_pass",
        passes=[
            BlockingKeyConfig(fields=["surname"], transforms=["soundex"]),
            BlockingKeyConfig(fields=["given_name"], transforms=["soundex"]),
            BlockingKeyConfig(fields=["postcode"], transforms=[]),
            BlockingKeyConfig(fields=["date_of_birth"], transforms=[]),
        ],
        max_block_size=500, skip_oversized=True,
    ),
    matchkeys=[MatchkeyConfig(
        name="person", type="weighted", threshold=0.7,
        fields=[
            MatchkeyField(field="given_name", scorer="jaro_winkler", weight=2.0, transforms=["lowercase", "strip"]),
            MatchkeyField(field="surname", scorer="jaro_winkler", weight=2.0, transforms=["lowercase", "strip"]),
            MatchkeyField(field="date_of_birth", scorer="exact", weight=1.5),
            MatchkeyField(field="address_1", scorer="token_sort", weight=1.0, transforms=["lowercase", "strip"]),
            MatchkeyField(field="postcode", scorer="exact", weight=0.5),
        ],
    )],
)
result = goldenmatch.dedupe_df(df, config=config)

Large Dataset Mode

For datasets over 1M records, use database sync mode. GoldenMatch processes records in chunks, maintains a persistent ANN index, and matches incrementally:

# Load into Postgres, then sync
goldenmatch sync --table customers --connection-string "$DATABASE_URL" --config config.yaml

# Watch for new records continuously
goldenmatch watch --table customers --connection-string "$DATABASE_URL" --interval 30

How it works:

  • Reads in configurable chunks (default 10K) โ€” never loads entire table into memory
  • Hybrid blocking: SQL WHERE for exact fields + persistent FAISS ANN for semantic fields
  • Progressive embedding: computes 100K embeddings per run, ANN improves over time
  • Persistent clusters with golden record versioning

Scale: Tested to 10M+ records in Postgres. For 100M+, use larger chunk sizes and dedicated Postgres infrastructure.

Interactive TUI

GoldenMatch includes a gold-themed interactive terminal UI:

  • Auto-config summary โ€” first screen shows detected columns, scorers, and blocking strategy with Run/Edit/Save options
  • Pipeline progress โ€” full-screen progress with stage tracker (โœ“/โ—/โ—‹) on first run, footer bar on re-runs
  • Split-view matches โ€” cluster list on the left, golden record + member details on the right
  • Live threshold slider โ€” arrow keys adjust threshold in 0.05 increments with instant cluster count preview
  • Keyboard shortcuts โ€” 1-6 jump to tabs (Data, Config, Matches, Golden, Boost, Export), F5 run, ? show all shortcuts, Ctrl+E export

Data profiling:

Data Tab

Match results with cluster detail:

Matches Tab

Golden records:

Golden Tab

Settings Persistence

GoldenMatch saves preferences across sessions:

  • Global: ~/.goldenmatch/settings.yaml โ€” output mode, default model, API keys
  • Project: .goldenmatch.yaml โ€” column mappings, thresholds, blocking config

Settings tuned in the TUI can be saved to the project file. Next run picks them up automatically.

CLI Reference

Command Description
goldenmatch demo Built-in demo with sample data
goldenmatch setup Interactive setup wizard (GPU, API keys, database)
goldenmatch dedupe FILE [...] Deduplicate one or more files
goldenmatch match TARGET --against REF Match target against reference
goldenmatch sync --table TABLE Sync against Postgres database
goldenmatch watch --table TABLE Live stream mode (continuous polling, --daemon for service mode)
goldenmatch schedule --every 1h FILE Run deduplication on a schedule
goldenmatch serve FILE [...] Start REST API server
goldenmatch mcp-serve FILE [...] Start MCP server (Claude Desktop)
goldenmatch rollback RUN_ID Undo a previous merge run
goldenmatch unmerge RECORD_ID Remove a record from its cluster
goldenmatch runs List previous runs for rollback
goldenmatch init Interactive config wizard
goldenmatch interactive FILE [...] Launch TUI
goldenmatch profile FILE Profile data quality
goldenmatch evaluate FILE --gt GT.csv Evaluate matching against ground truth
goldenmatch incremental BASE --new NEW Match new records against existing base
goldenmatch analyze-blocking FILE Analyze data and suggest blocking strategies
goldenmatch label FILE --config --gt Interactively label pairs to build ground truth CSV
goldenmatch config save/load/list/show Manage config presets
goldenmatch memory stats/learn/export/import/show Manage Learning Memory store (v1.6.0)

Key dedupe flags:

Flag Description
--anomalies Detect fake emails, placeholder data, suspicious records
--preview Show what will change before writing (merge preview)
--diff / --diff-html Generate before/after change report
--dashboard Before/after data quality dashboard (HTML)
--html-report Detailed match report with charts
--chunked Large dataset mode (process in chunks)
--llm-boost Improve accuracy with LLM-labeled training
--daemon Run watch mode as a background service with health endpoint
s3:// / gs:// / az:// Read directly from cloud storage

Remote MCP Server

GoldenMatch is available as a hosted MCP server on Smithery โ€” connect from any MCP client without installing anything.

Claude Desktop / Claude Code:

{
  "mcpServers": {
    "goldenmatch": {
      "url": "https://goldenmatch-mcp-production.up.railway.app/mcp/"
    }
  }
}

Local server (if you prefer to run locally):

pip install goldenmatch[mcp]
goldenmatch mcp-serve data.csv

35 tools available: deduplicate files, match records, explain decisions, review borderline pairs, privacy-preserving linkage, configure rules, scan data quality, run transforms, synthesize golden records, and manage Learning Memory (list_corrections, add_correction, learn_thresholds, memory_stats, memory_export).

Architecture

goldenmatch/
โ”œโ”€โ”€ cli/            # 21 CLI commands (Typer)
โ”‚                   #   Python API: 95 public exports from `import goldenmatch as gm`
โ”‚                   #   -- every feature accessible without knowing internal module structure
โ”œโ”€โ”€ config/         # Pydantic schemas, YAML loader, settings
โ”œโ”€โ”€ core/           # Pipeline: ingest, block, score, cluster, golden, explainer,
โ”‚                   #   report, dashboard, graph, anomaly, diff, rollback,
โ”‚                   #   schema_match, chunked, cloud_ingest, api_connector, scheduler,
โ”‚                   #   llm_scorer, lineage, match_one, evaluate, gpu, vertex_embedder,
โ”‚                   #   probabilistic, learned_blocking, streaming, graph_er, domain
โ”œโ”€โ”€ domains/        # 7 built-in YAML domain packs (electronics, software, healthcare, ...)
โ”œโ”€โ”€ plugins/        # Plugin system (scorers, transforms, connectors, golden strategies)
โ”œโ”€โ”€ connectors/     # Enterprise connectors (Snowflake, Databricks, BigQuery, HubSpot, Salesforce)
โ”œโ”€โ”€ backends/       # DuckDB backend for out-of-core processing
โ”œโ”€โ”€ db/             # Postgres: connector, sync, reconcile, clusters, ANN index
โ”œโ”€โ”€ api/            # REST API server
โ”œโ”€โ”€ mcp/            # MCP server for Claude Desktop
โ”œโ”€โ”€ tui/            # Gold-themed Textual TUI + setup wizard
โ””โ”€โ”€ utils/          # Transforms, helpers

Run tests: pytest (924 tests)

Part of the Golden Suite

Tool Purpose Install
GoldenCheck Validate & profile data quality pip install goldencheck
GoldenFlow Transform & standardize data pip install goldenflow
GoldenMatch Deduplicate & match records pip install goldenmatch
GoldenPipe Orchestrate the full pipeline pip install goldenpipe

What's New in v1.4.0

  • Scoring & survivorship quality -- MST-based cluster auto-splitting at weakest edges, cluster quality labels (strong/weak/split), quality-weighted survivorship strategies using GoldenCheck scores, field-level provenance tracking.
  • Smart auto-config -- auto-config now profiles cleaned data (after GoldenCheck/GoldenFlow), detects data domains and extracts identifiers, selects learned blocking for large datasets, enables reranking for multi-field matchkeys, adjusts thresholds from data quality.
  • GoldenFlow integration -- optional data transformation step in the pipeline. Phone normalization, date standardization, categorical correction. pip install goldenmatch[transform].
  • llm_auto flag -- dedupe_df(df, llm_auto=True) auto-enables LLM scorer ($0.05 budget cap) and memory store when API key detected.

What's New in v1.3.0

  • CCMS cluster comparison -- compare two clustering outcomes without ground truth using the Case Count Metric System (Talburt et al.). Classifies each cluster as unchanged, merged, partitioned, or overlapping. Includes Talburt-Wang Index (TWI) for normalized similarity.
  • Parameter sensitivity analysis -- sweep threshold, blocking, or matchkey parameters across a range and compare each run against a baseline. stability_report() identifies optimal value ranges. Failed sweep points are logged and skipped, preserving partial results.
  • New CLI commands -- goldenmatch compare-clusters for ad-hoc comparison, goldenmatch sensitivity for automated parameter tuning.
  • New Python API -- compare_clusters(), CompareResult, run_sensitivity(), SensitivityResult, SweepParam exported from goldenmatch.

What's New in v1.2.7

  • Auto-config cardinality guards โ€” three new guards prevent auto-config failures on edge-case data:
    • Blocking: excludes near-unique columns (cardinality_ratio >= 0.95)
    • Matchkeys: skips exact matchkeys for low-cardinality columns (cardinality_ratio < 0.01)
    • Description columns: routes long text to fuzzy matching (token_sort) alongside embedding
  • Library comparison benchmarks โ€” head-to-head results against Splink, Dedupe, and RecordLinkage on Febrl (0.971 F1) and DBLP-ACM (0.918 F1). GoldenMatch is the most consistent performer across data types.

What's New in v1.2.6

  • Iterative LLM calibration โ€” instead of scoring all candidates, calibrates the decision threshold from 200 sampled pairs. Typically converges in 2-3 rounds at negligible cost ($0.01 on a 401K-row equipment dataset).
  • ANN hybrid blocking โ€” oversized blocks that exceed the max block size now fall back to embedding-based ANN sub-blocking automatically, keeping blocks tractable without manual tuning.
  • Auto-config classification fixes โ€” improved heuristics for ID and price fields, utility-based field ranking to select better blocking keys, and LLM-assisted classification for ambiguous column names.

Author

Ben Severn

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

goldenmatch-1.7.1.tar.gz (9.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

goldenmatch-1.7.1-py3-none-any.whl (605.3 kB view details)

Uploaded Python 3

File details

Details for the file goldenmatch-1.7.1.tar.gz.

File metadata

  • Download URL: goldenmatch-1.7.1.tar.gz
  • Upload date:
  • Size: 9.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for goldenmatch-1.7.1.tar.gz
Algorithm Hash digest
SHA256 9a8ad3f58a166b3b66f53c1f49e05f3e6a60785b0bde54b5eb7287095e69bf6a
MD5 6d597771267ee474db75feabe9ea9c27
BLAKE2b-256 c6c64963a1797b96ad40cf941c6e0ea9742a2cccd2290b0d7cd2715f62cb1fd4

See more details on using hashes here.

File details

Details for the file goldenmatch-1.7.1-py3-none-any.whl.

File metadata

  • Download URL: goldenmatch-1.7.1-py3-none-any.whl
  • Upload date:
  • Size: 605.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for goldenmatch-1.7.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ddb82cc8f02fe3830ba4eeb93c343c5e42539e60e70621428119a997b4dc9bf5
MD5 705e990bc9a23c45aec7c95d77e2a7fa
BLAKE2b-256 ad85a16fb03bb7e98e0f8c42055eda63e0fa706b2d8ec1384b868e76cfc4148b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page