Entity resolution toolkit — deduplicate records, match across sources, and maintain golden records
Project description
GoldenMatch
Entity resolution toolkit — deduplicate records, match across sources, and maintain golden records. Works on files or live databases.
Built with Polars, RapidFuzz, sentence-transformers, and FAISS. Zero-config mode auto-detects your data; optional LLM boost for harder datasets.
See it in action
pip install goldenmatch
goldenmatch demo
Features
- Zero-config —
goldenmatch dedupe file.csvauto-detects columns, picks scorers, shows auto-config summary - Gold-themed TUI — professional interactive interface with keyboard shortcuts, live threshold tuning, split-view results
- 8 scoring methods — exact, Jaro-Winkler, Levenshtein, token sort, soundex, ensemble, embedding, record embedding
- 7 blocking strategies — static, adaptive, sorted neighborhood, multi-pass, ANN, ann_pairs, canopy
- Database sync — incremental matching against Postgres with persistent ANN index and golden record versioning
- LLM boost — optional Claude/GPT-4 labeling + sentence-transformer fine-tuning for harder datasets
- Golden records — 5 merge strategies (most_complete, majority_vote, source_priority, most_recent, first_non_null)
Installation
pip install goldenmatch # core (files only)
pip install goldenmatch[embeddings] # + sentence-transformers, FAISS
pip install goldenmatch[llm] # + Claude/OpenAI for LLM boost
pip install goldenmatch[postgres] # + Postgres database sync
# Run the setup wizard to configure GPU, API keys, and database:
goldenmatch setup
Setup Wizard
Run goldenmatch setup for an interactive walkthrough:
Guides you through GPU mode selection, Vertex AI / Colab / local GPU configuration, LLM boost API keys, and database sync — with copy-paste commands at every step.
Quick Start
Zero-Config (no YAML needed)
goldenmatch dedupe customers.csv
Auto-detects column types (name, email, phone, zip, address, description), assigns appropriate scorers, picks blocking strategy, and launches the TUI for review.
With Config
goldenmatch dedupe customers.csv --config config.yaml --output-all --output-dir results/
Match Mode
goldenmatch match targets.csv --against reference.csv --config config.yaml --output-all
Database Sync
# First run: full scan, create metadata tables
goldenmatch sync --table customers --connection-string "$DATABASE_URL" --config config.yaml
# Subsequent runs: incremental (only new records)
goldenmatch sync --table customers --connection-string "$DATABASE_URL"
How It Works
Files/DB → Ingest → Standardize → Block → Score → Cluster → Golden Records → Output
↑ ↑
SQL blocking 8 scorers
ANN blocking ensemble
7 strategies embeddings
Pipeline:
- Ingest — CSV, Excel, Parquet, or Postgres table
- Standardize — configurable per-column transforms
- Block — reduce comparison space (multi-pass, ANN, canopy, etc.)
- Score — compare record pairs with appropriate scorer
- Cluster — group matches via Union-Find
- Golden — merge each cluster into one canonical record
- Output — files (CSV/Parquet) or database tables
Config Reference
matchkeys:
- name: exact_email
type: exact
fields:
- field: email
transforms: [lowercase, strip]
- name: fuzzy_name_zip
type: weighted
threshold: 0.85
fields:
- field: first_name
scorer: jaro_winkler
weight: 0.4
transforms: [lowercase, strip]
- field: last_name
scorer: jaro_winkler
weight: 0.4
transforms: [lowercase, strip]
- field: zip
scorer: exact
weight: 0.2
blocking:
strategy: multi_pass # static | adaptive | sorted_neighborhood | multi_pass | ann | ann_pairs | canopy
keys:
- fields: [zip]
- fields: [last_name]
transforms: [lowercase, soundex]
golden_rules:
default_strategy: most_complete
field_rules:
email: { strategy: majority_vote }
first_name: { strategy: source_priority, source_priority: [crm, marketing] }
output:
directory: ./output
format: csv
Scorers
| Scorer | Description | Best For |
|---|---|---|
exact |
Binary match | Email, phone, ID |
jaro_winkler |
Edit distance similarity | Names |
levenshtein |
Normalized Levenshtein | General strings |
token_sort |
Order-invariant token matching | Names, addresses |
soundex_match |
Phonetic match | Names |
ensemble |
max(jaro_winkler, token_sort, soundex) | Names with reordering |
embedding |
Cosine similarity of sentence embeddings | Semantic matching |
record_embedding |
Embed concatenated fields | Cross-field semantic matching |
Blocking Strategies
| Strategy | Description |
|---|---|
static |
Group by blocking key (default) |
adaptive |
Static + recursive sub-blocking for oversized blocks |
sorted_neighborhood |
Sliding window over sorted records |
multi_pass |
Union of blocks from multiple passes (best for noisy data) |
ann |
ANN via FAISS on sentence-transformer embeddings |
ann_pairs |
Direct-pair ANN scoring (50-100x faster than ann) |
canopy |
TF-IDF canopy clustering |
Database Integration
GoldenMatch can sync against live Postgres databases with incremental matching:
pip install goldenmatch[postgres]
goldenmatch sync \
--table customers \
--connection-string "postgresql://user:pass@localhost/mydb" \
--config config.yaml
Features:
- Incremental sync — only processes records added since last run
- Hybrid blocking — SQL WHERE clauses for exact fields + FAISS ANN for semantic fields, results unioned
- Persistent ANN index — disk cache + DB source of truth, progressive embedding across runs
- Golden record versioning — append-only with
is_currentflag, full audit trail - Cluster management — persistent clusters with merge, conflict detection, max size safety cap
Metadata tables (auto-created):
| Table | Purpose |
|---|---|
gm_state |
Processing state, watermarks |
gm_clusters |
Persistent cluster membership |
gm_golden_records |
Versioned golden records |
gm_embeddings |
Cached embeddings for ANN |
gm_match_log |
Audit trail of all match decisions |
LLM Boost (Optional)
For harder datasets where zero-shot scoring isn't enough:
pip install goldenmatch[llm]
# First run: LLM labels ~300 pairs (~$0.30), fine-tunes embedding model
goldenmatch dedupe products.csv --llm-boost
# Subsequent runs: uses saved model ($0)
goldenmatch dedupe products.csv --llm-boost
Tiered auto-escalation:
- Level 1 — zero-shot (free, instant)
- Level 2 — bi-encoder fine-tuning (~$0.20, ~2 min CPU)
- Level 3 — Ditto-style cross-encoder with data augmentation (~$0.50, ~5 min CPU)
Best result: Abt-Buy 59.5% F1 (up from 44.5% zero-shot) with 300 LLM labels and optimal train/score split.
Benchmarks
Leipzig Entity Resolution Benchmarks
| Dataset | Best Strategy | F1 | Time |
|---|---|---|---|
| DBLP-ACM (2.6K vs 2.3K) | Vertex AI embeddings | 97.4% | 119s |
| DBLP-Scholar (2.6K vs 64K) | multi-pass + fuzzy | 74.7% | 83.9s |
| Abt-Buy (1K vs 1K) | Vertex AI embeddings | 84.7% | 53s |
| Amazon-Google (1.4K vs 3.2K) | Vertex AI embeddings | 58.6% | 110s |
Previous best without Vertex AI: Abt-Buy 59.5% (LLM boost), Amazon-Google 40.5% (rec_emb). Vertex AI's text-embedding-004 model provides dramatically better embeddings with no local GPU needed.
Throughput (Scale Curve)
Measured on a laptop (17GB RAM) with exact + fuzzy matching, blocking, clustering, and golden record generation:
| Records | Time | Throughput | Pairs Found | Memory |
|---|---|---|---|---|
| 1,000 | 0.2s | 5,500 rec/s | 210 | 101 MB |
| 10,000 | 1.4s | 7,300 rec/s | 7,000 | 123 MB |
| 100,000 | 12s | 8,200 rec/s | 571,000 | 544 MB |
For datasets over 1M records, use goldenmatch sync (database mode) with incremental matching and persistent ANN indexing. See Large Dataset Mode.
How GoldenMatch Compares
| GoldenMatch | dedupe | Splink | Zingg | Ditto | |
|---|---|---|---|---|---|
| Abt-Buy F1 | 84.7% | ~75% | ~70% | ~80% | 89.3% |
| DBLP-ACM F1 | 97.4% | ~96% | ~95% | ~96% | 99.0% |
| Training required | No | Yes | Yes | Yes | Yes (1000+) |
| Zero-config | Yes | No | No | No | No |
| Interactive TUI | Yes | No | No | No | No |
| Database sync | Postgres | Cloud (paid) | No | No | No |
| REST API / MCP | Both | Cloud only | No | No | No |
| GPU required | No | No | No | Spark | Yes |
GoldenMatch's sweet spot is ease of use + competitive accuracy. Ditto has higher F1 but requires 1000+ manual labels and a GPU. Splink scales to billions on Spark but needs label training. GoldenMatch auto-configures from your data and reaches 85%+ F1 with zero labels.
Interactive TUI
Large Dataset Mode
For datasets over 1M records, use database sync mode. GoldenMatch processes records in chunks, maintains a persistent ANN index, and matches incrementally:
# Load into Postgres, then sync
goldenmatch sync --table customers --connection-string "$DATABASE_URL" --config config.yaml
# Watch for new records continuously
goldenmatch watch --table customers --connection-string "$DATABASE_URL" --interval 30
How it works:
- Reads in configurable chunks (default 10K) — never loads entire table into memory
- Hybrid blocking: SQL WHERE for exact fields + persistent FAISS ANN for semantic fields
- Progressive embedding: computes 100K embeddings per run, ANN improves over time
- Persistent clusters with golden record versioning
Scale: Tested to 10M+ records in Postgres. For 100M+, use larger chunk sizes and dedicated Postgres infrastructure.
Interactive TUI
GoldenMatch includes a gold-themed interactive terminal UI:
- Auto-config summary — first screen shows detected columns, scorers, and blocking strategy with Run/Edit/Save options
- Pipeline progress — full-screen progress with stage tracker (✓/●/○) on first run, footer bar on re-runs
- Split-view matches — cluster list on the left, golden record + member details on the right
- Live threshold slider — arrow keys adjust threshold in 0.05 increments with instant cluster count preview
- Keyboard shortcuts —
1-5jump to tabs,F5run,?show all shortcuts,Ctrl+Eexport
Data profiling:
Match results with cluster detail:
Golden records:
Settings Persistence
GoldenMatch saves preferences across sessions:
- Global:
~/.goldenmatch/settings.yaml— output mode, default model, API keys - Project:
.goldenmatch.yaml— column mappings, thresholds, blocking config
Settings tuned in the TUI can be saved to the project file. Next run picks them up automatically.
CLI Reference
| Command | Description |
|---|---|
goldenmatch dedupe FILE [...] |
Deduplicate one or more files |
goldenmatch match TARGET --against REF |
Match target against reference |
goldenmatch sync --table TABLE --connection-string URL |
Sync against database |
goldenmatch init |
Interactive config wizard |
goldenmatch config save/load/list/delete/show |
Manage config presets |
goldenmatch profile FILE |
Profile data quality |
goldenmatch interactive FILE [...] |
Launch TUI |
Architecture
goldenmatch/
├── cli/ # Typer CLI commands (dedupe, match, sync)
├── config/ # Pydantic schemas, YAML loader, settings persistence
├── core/ # Pipeline modules (ingest, block, score, cluster, golden)
├── db/ # Database integration (connector, blocking, sync, reconcile)
├── tui/ # Textual TUI + MatchEngine
└── utils/ # Transforms, helpers
605+ tests covering all modules. Run with pytest.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file goldenmatch-0.2.0.tar.gz.
File metadata
- Download URL: goldenmatch-0.2.0.tar.gz
- Upload date:
- Size: 45.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
803a0bfa8eeb7f728d2a0f7d4109904a42ca173573eaacf8f3ff629ee07290f8
|
|
| MD5 |
0ace6c3448b2dc0d0ccf50498c78a7b4
|
|
| BLAKE2b-256 |
0f61a9d68c628a268e4fbe86298f9ed641f1e23d85c0383e1a2ed051b05a8ec4
|
File details
Details for the file goldenmatch-0.2.0-py3-none-any.whl.
File metadata
- Download URL: goldenmatch-0.2.0-py3-none-any.whl
- Upload date:
- Size: 182.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
793208fee63884da3f93038d4b9e80e9b732e8e999289dd6a3509c2b7a70a855
|
|
| MD5 |
e6c711cd46ee17f83f81127c139ccd60
|
|
| BLAKE2b-256 |
809d7cc4ea970a0aafa7f0623fafea8a364474c15e24daef2d1495134de04cb4
|