Entity resolution toolkit — deduplicate records, match across sources, and maintain golden records
Project description
GoldenMatch
Entity resolution toolkit — deduplicate records, match across sources, and maintain golden records. Works on files or live databases.
Built with Polars, RapidFuzz, sentence-transformers, and FAISS. Zero-config mode auto-detects your data; optional LLM boost for harder datasets.
See it in action
pip install goldenmatch
goldenmatch demo
Features
- Zero-config —
goldenmatch dedupe file.csvauto-detects columns, picks scorers, runs automatically - Gold-themed TUI — interactive interface with keyboard shortcuts, live threshold tuning, setup wizard
- 10+ scoring methods — exact, Jaro-Winkler, Levenshtein, token sort, soundex, ensemble, embedding, record embedding, dice, jaccard + plugin extensible
- 8+ blocking strategies — static, adaptive, sorted neighborhood, multi-pass, ANN, ann_pairs, canopy, learned (data-driven predicate selection)
- Fellegi-Sunter probabilistic matching — EM-trained m/u probabilities, automatic threshold estimation, comparison vectors with 2/3-level agreement
- Vertex AI embeddings — 85%+ F1 accuracy with no GPU needed (Google Cloud managed API)
- Database sync — incremental Postgres matching with persistent ANN index and golden record versioning
- REST API + MCP Server — real-time matching via HTTP or Claude Desktop (12 tools: match, unmerge, explain, config advisor, etc.)
- Review queue — REST endpoint surfaces borderline pairs for data steward approval/rejection
- Lineage tracking — every merge decision saved to a JSON sidecar with per-field score breakdown
- Daemon mode —
goldenmatch watch --daemonruns as a service with health endpoint and PID file - Anomaly detection — flag fake emails, placeholder data, suspicious records
- Merge preview + undo — see what will change before writing, rollback any run or unmerge individual records
- Active learning boost — label 10 borderline pairs in the TUI, instantly retrain a classifier for 99% accuracy
- Cluster confidence scoring — weakly-connected clusters flagged with bottleneck pair identification
- Single-record matching —
match_oneprimitive for streaming: embed, query ANN, score, return matches - Privacy-preserving matching — bloom filter transforms + Dice/Jaccard scoring for fuzzy matching on encrypted PII
- PPRL multi-party linkage -- match records across organizations without sharing raw data. Auto-configured bloom filters achieve 92.4% F1 on FEBRL4. Trusted third party and SMC modes.
- PPRL auto-configuration -- zero-config PPRL that profiles your data and picks optimal fields, bloom filter parameters, and threshold automatically
- Before/after dashboard — shareable HTML showing data transformation with charts
- Schema-free matching — auto-maps columns between different schemas (full_name -> first_name + last_name)
- Cloud storage — read directly from S3, GCS, or Azure Blob
- API connector — pull from Salesforce, HubSpot, or any REST/GraphQL API
- Scheduled runs — cron-like scheduling with run history
- LLM scorer with budget controls — GPT-4o-mini scores borderline pairs, boosting product matching from 44.5% to 66.3% F1 (precision 35%→95%) for just $0.04. Budget caps, model tiering, graceful degradation
- LLM boost — optional Claude/GPT-4 labeling + fine-tuning for harder datasets
- Golden records — 5 merge strategies (most_complete, majority_vote, source_priority, most_recent, first_non_null)
- Parallel fuzzy scoring — blocks scored concurrently via thread pool with intra-field early termination
- Cross-encoder reranking — re-score borderline pairs with a pre-trained cross-encoder for higher precision
- Auto-select blocking — histogram analysis picks the best blocking key automatically
- Dynamic block splitting — oversized blocks auto-split by highest-cardinality column (zero config)
- Large dataset mode — chunked processing for files that don't fit in memory
- Plugin architecture — extend with custom scorers, transforms, connectors, and golden strategies via pip-installable plugins
- Enterprise connectors — Snowflake, Databricks, BigQuery, HubSpot, Salesforce (optional deps)
- DuckDB backend — out-of-core processing for 10M+ records without Spark
- Natural language explainability — template-based per-pair and per-cluster explanations at zero LLM cost
- Streaming / CDC mode — incremental record matching with micro-batch or immediate processing
- Multi-table graph ER — match across entity types with cross-relationship evidence propagation
- 7 domain packs — pre-built YAML rulebooks for electronics, software, healthcare, financial, real estate, people, retail
- Evaluation CLI —
goldenmatch evaluatereports precision/recall/F1 against ground truth CSV - Incremental matching —
goldenmatch incrementalmatches new CSV records against an existing base dataset - GitHub Actions "Try It" — zero-install demo via
workflow_dispatch(paste a CSV URL, get results) - Codespaces ready — one-click dev environment with
.devcontainerconfig - Ray distributed backend -- scale to 10M+ records with
pip install goldenmatch[ray]and--backend ray. Zero config locally, Ray cluster for 50M+ - Ground truth builder --
goldenmatch labelshows pairs interactively, type y/n/s to build ground truth CSV for accuracy measurement - dbt integration —
dbt-goldenmatchpackage for DuckDB-based entity resolution in dbt pipelines
Installation
pip install goldenmatch # core (files only)
pip install goldenmatch[embeddings] # + sentence-transformers, FAISS
pip install goldenmatch[llm] # + Claude/OpenAI for LLM boost
pip install goldenmatch[postgres] # + Postgres database sync
pip install goldenmatch[snowflake] # + Snowflake connector
pip install goldenmatch[bigquery] # + BigQuery connector
pip install goldenmatch[databricks] # + Databricks connector
pip install goldenmatch[salesforce] # + Salesforce connector
pip install goldenmatch[duckdb] # + DuckDB backend
# Run the setup wizard to configure GPU, API keys, and database:
goldenmatch setup
Setup Wizard
Run goldenmatch setup for an interactive walkthrough:
Guides you through GPU mode selection, Vertex AI / Colab / local GPU configuration, LLM boost API keys, and database sync — with copy-paste commands at every step.
Benchmarks (v0.6.0)
Tested on Leipzig benchmark datasets (DBLP-ACM, Abt-Buy).
Accuracy
| Dataset | Strategy | Precision | Recall | F1 | Cost |
|---|---|---|---|---|---|
| DBLP-ACM (bibliographic) | Weighted fuzzy | 97.2% | 97.1% | 97.2% | $0 |
| DBLP-ACM | Fellegi-Sunter (opt-in) | 98.8% | 57.6% | 72.8% | $0 |
| DBLP-ACM | Learned blocking | 97.6% | 96.3% | 96.9% | $0 |
| Abt-Buy (product) | Embedding + ANN | 35.5% | 59.4% | 44.5% | $0 |
| Abt-Buy | Model extraction + emb | 39.3% | 71.0% | 50.6% | $0 |
| Abt-Buy | Domain + emb + LLM | 94.8% | 58.3% | 72.2% | $0.04 |
| Amazon-Google (software) | emb+ANN + LLM | 63.3% | 35.2% | 45.3% | $0.02 |
PPRL (Privacy-Preserving Record Linkage)
Benchmarked on FEBRL4 (5K vs 5K synthetic person records) and NCVR (North Carolina Voter Registration):
| Strategy | Precision | Recall | F1 | Privacy |
|---|---|---|---|---|
| Normal fuzzy (baseline) | 56.5% | 74.6% | 64.3% | None |
| PPRL auto-config (FEBRL4) | 99.7% | 86.1% | 92.4% | Per-field HMAC |
| PPRL auto-config (NCVR) | 64.0% | 93.8% | 76.1% | Per-field HMAC |
| PPRL paranoid (FEBRL4) | 98.9% | 76.0% | 86.0% | HMAC + balanced |
PPRL with auto-configuration beats manual tuning on both datasets. Zero-config: GoldenMatch profiles your data and picks optimal fields, bloom filter parameters, and threshold automatically.
Speed
| Records | Time | Throughput | Memory |
|---|---|---|---|
| 1,000 | 0.15s | 6,667 rec/s | 101 MB |
| 10,000 | 1.67s | 5,975 rec/s | 123 MB |
| 100,000 | 12.78s | 7,823 rec/s | 546 MB |
Measured on a laptop (Windows 11, Python 3.12, 16GB RAM) with fuzzy + exact + golden record pipeline.
Quick Start
Zero-Config (no YAML needed)
goldenmatch dedupe customers.csv
Auto-detects column types (name, email, phone, zip, address, description), assigns appropriate scorers, picks blocking strategy, and launches the TUI for review.
With Config
goldenmatch dedupe customers.csv --config config.yaml --output-all --output-dir results/
Match Mode
goldenmatch match targets.csv --against reference.csv --config config.yaml --output-all
Database Sync
# First run: full scan, create metadata tables
goldenmatch sync --table customers --connection-string "$DATABASE_URL" --config config.yaml
# Subsequent runs: incremental (only new records)
goldenmatch sync --table customers --connection-string "$DATABASE_URL"
How It Works
Files/DB → Ingest → Standardize → Block → Score → Cluster → Golden Records → Output
↑ ↑
SQL blocking 10 scorers
ANN blocking ensemble
7 strategies embeddings
parallel blocks
Pipeline:
- Ingest — CSV, Excel, Parquet, or Postgres table
- Standardize — configurable per-column transforms
- Block — reduce comparison space (multi-pass, ANN, canopy, etc.)
- Score — compare record pairs with appropriate scorer
- Cluster — group matches via Union-Find
- Golden — merge each cluster into one canonical record
- Output — files (CSV/Parquet) or database tables
Config Reference
matchkeys:
- name: exact_email
type: exact
fields:
- field: email
transforms: [lowercase, strip]
- name: fuzzy_name_zip
type: weighted
threshold: 0.85
rerank: true # re-score borderline pairs with cross-encoder
rerank_band: 0.1 # pairs within threshold +/- 0.1 get reranked
fields:
- field: first_name
scorer: jaro_winkler
weight: 0.4
transforms: [lowercase, strip]
- field: last_name
scorer: jaro_winkler
weight: 0.4
transforms: [lowercase, strip]
- field: zip
scorer: exact
weight: 0.2
- name: semantic
type: weighted
threshold: 0.80
fields:
- columns: [title, authors, venue]
scorer: record_embedding
weight: 1.0
column_weights: {title: 2.0, authors: 1.0, venue: 0.5} # bias embedding toward title
llm_scorer:
enabled: true # score borderline pairs with GPT/Claude
auto_threshold: 0.95 # auto-accept pairs above this
candidate_lo: 0.75 # LLM scores pairs in [0.75, 0.95]
# provider: openai # auto-detected from OPENAI_API_KEY
# model: gpt-4o-mini # default, cheapest option
blocking:
strategy: adaptive # static | adaptive | sorted_neighborhood | multi_pass | ann | ann_pairs | canopy
auto_select: true # auto-pick best key by histogram analysis
keys:
- fields: [zip]
- fields: [last_name]
transforms: [lowercase, soundex]
golden_rules:
default_strategy: most_complete
field_rules:
email: { strategy: majority_vote }
first_name: { strategy: source_priority, source_priority: [crm, marketing] }
output:
directory: ./output
format: csv
Scorers
| Scorer | Description | Best For |
|---|---|---|
exact |
Binary match | Email, phone, ID |
jaro_winkler |
Edit distance similarity | Names |
levenshtein |
Normalized Levenshtein | General strings |
token_sort |
Order-invariant token matching | Names, addresses |
soundex_match |
Phonetic match | Names |
ensemble |
max(jaro_winkler, token_sort, soundex) | Names with reordering |
embedding |
Cosine similarity of sentence embeddings | Semantic matching |
record_embedding |
Embed concatenated fields | Cross-field semantic matching |
dice |
Dice coefficient on bloom filters | Privacy-preserving matching |
jaccard |
Jaccard similarity on bloom filters | Privacy-preserving matching |
Blocking Strategies
| Strategy | Description |
|---|---|
static |
Group by blocking key (default) |
adaptive |
Static + recursive sub-blocking for oversized blocks |
sorted_neighborhood |
Sliding window over sorted records |
multi_pass |
Union of blocks from multiple passes (best for noisy data) |
ann |
ANN via FAISS on sentence-transformer embeddings |
ann_pairs |
Direct-pair ANN scoring (50-100x faster than ann) |
canopy |
TF-IDF canopy clustering |
learned |
Data-driven predicate selection (auto-discovers blocking rules) |
Database Integration
GoldenMatch can sync against live Postgres databases with incremental matching:
pip install goldenmatch[postgres]
goldenmatch sync \
--table customers \
--connection-string "postgresql://user:pass@localhost/mydb" \
--config config.yaml
Features:
- Incremental sync — only processes records added since last run
- Hybrid blocking — SQL WHERE clauses for exact fields + FAISS ANN for semantic fields, results unioned
- Persistent ANN index — disk cache + DB source of truth, progressive embedding across runs
- Golden record versioning — append-only with
is_currentflag, full audit trail - Cluster management — persistent clusters with merge, conflict detection, max size safety cap
Metadata tables (auto-created):
| Table | Purpose |
|---|---|
gm_state |
Processing state, watermarks |
gm_clusters |
Persistent cluster membership |
gm_golden_records |
Versioned golden records |
gm_embeddings |
Cached embeddings for ANN |
gm_match_log |
Audit trail of all match decisions |
LLM Boost (Optional)
For harder datasets where zero-shot scoring isn't enough:
pip install goldenmatch[llm]
# First run: LLM labels ~300 pairs (~$0.30), fine-tunes embedding model
goldenmatch dedupe products.csv --llm-boost
# Subsequent runs: uses saved model ($0)
goldenmatch dedupe products.csv --llm-boost
Tiered auto-escalation:
- Level 1 — zero-shot (free, instant)
- Level 2 — bi-encoder fine-tuning (~$0.20, ~2 min CPU)
- Level 3 — Ditto-style cross-encoder with data augmentation (~$0.50, ~5 min CPU)
Active sampling selects the most informative pairs for the LLM to label (uncertainty, disagreement, boundary, diversity), reducing label cost by ~45% compared to random sampling.
Note: LLM boost is most valuable for product matching with local models (MiniLM) where it improved Abt-Buy from 44.5% to 59.5% F1. For structured data (names, addresses, bibliographic), fuzzy matching alone achieves 97%+ F1.
Benchmarks
Leipzig Entity Resolution Benchmarks
| Dataset | Best Strategy | F1 | Cost |
|---|---|---|---|
| DBLP-ACM (2.6K vs 2.3K) | multi-pass + fuzzy | 97.2% | $0 |
| DBLP-Scholar (2.6K vs 64K) | multi-pass + fuzzy | 74.7% | $0 |
| Abt-Buy (1K vs 1K) | Vertex AI + GPT-4o-mini scorer | 81.7% | ~$0.74 |
| Abt-Buy (zero-shot) | Vertex AI embeddings | 62.8% | ~$0.05 |
| Amazon-Google (1.4K vs 3.2K) | Vertex AI + reranking | 44.0% | ~$0.10 |
Structured data (names, addresses, bibliographic): RapidFuzz multi-pass fuzzy matching at 97.2% — zero cost, zero labels. Product matching: Vertex AI embeddings for candidate generation + GPT-4o-mini scorer for borderline pairs achieves 81.7% at ~$0.74 total cost.
Throughput (Scale Curve)
Measured on a laptop (17GB RAM) with exact + fuzzy matching, blocking, clustering, and golden record generation:
| Records | Time | Throughput | Pairs Found | Memory |
|---|---|---|---|---|
| 1,000 | 0.2s | 5,500 rec/s | 210 | 101 MB |
| 10,000 | 1.4s | 7,300 rec/s | 7,000 | 123 MB |
| 100,000 | 12s | 8,200 rec/s | 571,000 | 544 MB |
Fuzzy matching speedup: Parallel block scoring + intra-field early termination reduced 100K fuzzy matching from ~100s to ~39s (2.5x) through the pipeline. The 1M exact-only benchmark runs in 7.8s.
For datasets over 1M records, use goldenmatch sync (database mode) with incremental matching and persistent ANN indexing. See Large Dataset Mode.
How GoldenMatch Compares
| GoldenMatch | dedupe | Splink | Zingg | Ditto | |
|---|---|---|---|---|---|
| Abt-Buy F1 | 81.7% | ~75% | ~70% | ~80% | 89.3% |
| DBLP-ACM F1 | 97.2% | ~96% | ~95% | ~96% | 99.0% |
| Training required | No | Yes | Yes | Yes | Yes (1000+) |
| Zero-config | Yes | No | No | No | No |
| Interactive TUI | Yes | No | No | No | No |
| Database sync | Postgres | Cloud (paid) | No | No | No |
| REST API / MCP | Both | Cloud only | No | No | No |
| GPU required | No | No | No | Spark | Yes |
GoldenMatch's sweet spot is ease of use + competitive accuracy. On bibliographic matching (DBLP-ACM), GoldenMatch hits 97.2% with zero config. On product matching (Abt-Buy), the LLM scorer reaches 81.7% — within 8pts of Ditto's 89.3%, but with zero training labels and no GPU. Ditto requires 1000+ hand-labeled pairs and a GPU.
Interactive TUI
Large Dataset Mode
For datasets over 1M records, use database sync mode. GoldenMatch processes records in chunks, maintains a persistent ANN index, and matches incrementally:
# Load into Postgres, then sync
goldenmatch sync --table customers --connection-string "$DATABASE_URL" --config config.yaml
# Watch for new records continuously
goldenmatch watch --table customers --connection-string "$DATABASE_URL" --interval 30
How it works:
- Reads in configurable chunks (default 10K) — never loads entire table into memory
- Hybrid blocking: SQL WHERE for exact fields + persistent FAISS ANN for semantic fields
- Progressive embedding: computes 100K embeddings per run, ANN improves over time
- Persistent clusters with golden record versioning
Scale: Tested to 10M+ records in Postgres. For 100M+, use larger chunk sizes and dedicated Postgres infrastructure.
Interactive TUI
GoldenMatch includes a gold-themed interactive terminal UI:
- Auto-config summary — first screen shows detected columns, scorers, and blocking strategy with Run/Edit/Save options
- Pipeline progress — full-screen progress with stage tracker (✓/●/○) on first run, footer bar on re-runs
- Split-view matches — cluster list on the left, golden record + member details on the right
- Live threshold slider — arrow keys adjust threshold in 0.05 increments with instant cluster count preview
- Keyboard shortcuts —
1-6jump to tabs (Data, Config, Matches, Golden, Boost, Export),F5run,?show all shortcuts,Ctrl+Eexport
Data profiling:
Match results with cluster detail:
Golden records:
Settings Persistence
GoldenMatch saves preferences across sessions:
- Global:
~/.goldenmatch/settings.yaml— output mode, default model, API keys - Project:
.goldenmatch.yaml— column mappings, thresholds, blocking config
Settings tuned in the TUI can be saved to the project file. Next run picks them up automatically.
CLI Reference
| Command | Description |
|---|---|
goldenmatch demo |
Built-in demo with sample data |
goldenmatch setup |
Interactive setup wizard (GPU, API keys, database) |
goldenmatch dedupe FILE [...] |
Deduplicate one or more files |
goldenmatch match TARGET --against REF |
Match target against reference |
goldenmatch sync --table TABLE |
Sync against Postgres database |
goldenmatch watch --table TABLE |
Live stream mode (continuous polling, --daemon for service mode) |
goldenmatch schedule --every 1h FILE |
Run deduplication on a schedule |
goldenmatch serve FILE [...] |
Start REST API server |
goldenmatch mcp-serve FILE [...] |
Start MCP server (Claude Desktop) |
goldenmatch rollback RUN_ID |
Undo a previous merge run |
goldenmatch unmerge RECORD_ID |
Remove a record from its cluster |
goldenmatch runs |
List previous runs for rollback |
goldenmatch init |
Interactive config wizard |
goldenmatch interactive FILE [...] |
Launch TUI |
goldenmatch profile FILE |
Profile data quality |
goldenmatch evaluate FILE --gt GT.csv |
Evaluate matching against ground truth |
goldenmatch incremental BASE --new NEW |
Match new records against existing base |
goldenmatch analyze-blocking FILE |
Analyze data and suggest blocking strategies |
goldenmatch label FILE --config --gt |
Interactively label pairs to build ground truth CSV |
goldenmatch config save/load/list/show |
Manage config presets |
Key dedupe flags:
| Flag | Description |
|---|---|
--anomalies |
Detect fake emails, placeholder data, suspicious records |
--preview |
Show what will change before writing (merge preview) |
--diff / --diff-html |
Generate before/after change report |
--dashboard |
Before/after data quality dashboard (HTML) |
--html-report |
Detailed match report with charts |
--chunked |
Large dataset mode (process in chunks) |
--llm-boost |
Improve accuracy with LLM-labeled training |
--daemon |
Run watch mode as a background service with health endpoint |
s3:// / gs:// / az:// |
Read directly from cloud storage |
Architecture
goldenmatch/
├── cli/ # 21 CLI commands (Typer)
├── config/ # Pydantic schemas, YAML loader, settings
├── core/ # Pipeline: ingest, block, score, cluster, golden, explainer,
│ # report, dashboard, graph, anomaly, diff, rollback,
│ # schema_match, chunked, cloud_ingest, api_connector, scheduler,
│ # llm_scorer, lineage, match_one, evaluate, gpu, vertex_embedder,
│ # probabilistic, learned_blocking, streaming, graph_er, domain
├── domains/ # 7 built-in YAML domain packs (electronics, software, healthcare, ...)
├── plugins/ # Plugin system (scorers, transforms, connectors, golden strategies)
├── connectors/ # Enterprise connectors (Snowflake, Databricks, BigQuery, HubSpot, Salesforce)
├── backends/ # DuckDB backend for out-of-core processing
├── db/ # Postgres: connector, sync, reconcile, clusters, ANN index
├── api/ # REST API server
├── mcp/ # MCP server for Claude Desktop
├── tui/ # Gold-themed Textual TUI + setup wizard
└── utils/ # Transforms, helpers
Run tests: pytest (911 tests)
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file goldenmatch-0.7.1.tar.gz.
File metadata
- Download URL: goldenmatch-0.7.1.tar.gz
- Upload date:
- Size: 582.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3484258df27c8c745cbc1189e475b41a9708e120583216a2a7a17b65afb53b97
|
|
| MD5 |
ca0d5dc04d6c52752ab6b6dda07352eb
|
|
| BLAKE2b-256 |
b8f2d0cce47431f7939aae945c9b1fb2930a569bec864833cd1961d6da40faf1
|
File details
Details for the file goldenmatch-0.7.1-py3-none-any.whl.
File metadata
- Download URL: goldenmatch-0.7.1-py3-none-any.whl
- Upload date:
- Size: 311.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
afedaeb747366c7dbe9ae4822901521cfe66f90696ef385a66f31410d66eef84
|
|
| MD5 |
6aed11193ee5d0f3e2783b6ae9e0326b
|
|
| BLAKE2b-256 |
46d4414aa6d79052cd8d1b203abde3d98977a2f8b3d8491b7d0a9e10e281d3e1
|