Skip to main content

Entity resolution toolkit — deduplicate records, match across sources, and maintain golden records

Project description

GoldenMatch

Entity resolution toolkit — deduplicate records, match across sources, and maintain golden records. Works on files or live databases.

Built with Polars, RapidFuzz, sentence-transformers, and FAISS. Zero-config mode auto-detects your data; optional LLM boost for harder datasets.

Python 3.11+ License: MIT Tests Open In Colab

See it in action

GoldenMatch Demo

pip install goldenmatch
goldenmatch demo

Features

  • Zero-configgoldenmatch dedupe file.csv auto-detects columns, picks scorers, shows auto-config summary
  • Gold-themed TUI — professional interactive interface with keyboard shortcuts, live threshold tuning, split-view results
  • 8 scoring methods — exact, Jaro-Winkler, Levenshtein, token sort, soundex, ensemble, embedding, record embedding
  • 7 blocking strategies — static, adaptive, sorted neighborhood, multi-pass, ANN, ann_pairs, canopy
  • Database sync — incremental matching against Postgres with persistent ANN index and golden record versioning
  • LLM boost — optional Claude/GPT-4 labeling + sentence-transformer fine-tuning for harder datasets
  • Golden records — 5 merge strategies (most_complete, majority_vote, source_priority, most_recent, first_non_null)

Installation

pip install goldenmatch                    # core (files only)
pip install goldenmatch[embeddings]        # + sentence-transformers, FAISS
pip install goldenmatch[llm]               # + Claude/OpenAI for LLM boost
pip install goldenmatch[postgres]          # + Postgres database sync

# Run the setup wizard to configure GPU, API keys, and database:
goldenmatch setup

Setup Wizard

Run goldenmatch setup for an interactive walkthrough:

Setup Wizard

Guides you through GPU mode selection, Vertex AI / Colab / local GPU configuration, LLM boost API keys, and database sync — with copy-paste commands at every step.

GPU Selection

Quick Start

Zero-Config (no YAML needed)

goldenmatch dedupe customers.csv

Auto-detects column types (name, email, phone, zip, address, description), assigns appropriate scorers, picks blocking strategy, and launches the TUI for review.

With Config

goldenmatch dedupe customers.csv --config config.yaml --output-all --output-dir results/

Match Mode

goldenmatch match targets.csv --against reference.csv --config config.yaml --output-all

Database Sync

# First run: full scan, create metadata tables
goldenmatch sync --table customers --connection-string "$DATABASE_URL" --config config.yaml

# Subsequent runs: incremental (only new records)
goldenmatch sync --table customers --connection-string "$DATABASE_URL"

How It Works

Files/DB → Ingest → Standardize → Block → Score → Cluster → Golden Records → Output
                                     ↑        ↑
                              SQL blocking   8 scorers
                              ANN blocking   ensemble
                              7 strategies   embeddings

Pipeline:

  1. Ingest — CSV, Excel, Parquet, or Postgres table
  2. Standardize — configurable per-column transforms
  3. Block — reduce comparison space (multi-pass, ANN, canopy, etc.)
  4. Score — compare record pairs with appropriate scorer
  5. Cluster — group matches via Union-Find
  6. Golden — merge each cluster into one canonical record
  7. Output — files (CSV/Parquet) or database tables

Config Reference

matchkeys:
  - name: exact_email
    type: exact
    fields:
      - field: email
        transforms: [lowercase, strip]

  - name: fuzzy_name_zip
    type: weighted
    threshold: 0.85
    fields:
      - field: first_name
        scorer: jaro_winkler
        weight: 0.4
        transforms: [lowercase, strip]
      - field: last_name
        scorer: jaro_winkler
        weight: 0.4
        transforms: [lowercase, strip]
      - field: zip
        scorer: exact
        weight: 0.2

blocking:
  strategy: multi_pass  # static | adaptive | sorted_neighborhood | multi_pass | ann | ann_pairs | canopy
  keys:
    - fields: [zip]
    - fields: [last_name]
      transforms: [lowercase, soundex]

golden_rules:
  default_strategy: most_complete
  field_rules:
    email: { strategy: majority_vote }
    first_name: { strategy: source_priority, source_priority: [crm, marketing] }

output:
  directory: ./output
  format: csv

Scorers

Scorer Description Best For
exact Binary match Email, phone, ID
jaro_winkler Edit distance similarity Names
levenshtein Normalized Levenshtein General strings
token_sort Order-invariant token matching Names, addresses
soundex_match Phonetic match Names
ensemble max(jaro_winkler, token_sort, soundex) Names with reordering
embedding Cosine similarity of sentence embeddings Semantic matching
record_embedding Embed concatenated fields Cross-field semantic matching

Blocking Strategies

Strategy Description
static Group by blocking key (default)
adaptive Static + recursive sub-blocking for oversized blocks
sorted_neighborhood Sliding window over sorted records
multi_pass Union of blocks from multiple passes (best for noisy data)
ann ANN via FAISS on sentence-transformer embeddings
ann_pairs Direct-pair ANN scoring (50-100x faster than ann)
canopy TF-IDF canopy clustering

Database Integration

GoldenMatch can sync against live Postgres databases with incremental matching:

pip install goldenmatch[postgres]

goldenmatch sync \
  --table customers \
  --connection-string "postgresql://user:pass@localhost/mydb" \
  --config config.yaml

Features:

  • Incremental sync — only processes records added since last run
  • Hybrid blocking — SQL WHERE clauses for exact fields + FAISS ANN for semantic fields, results unioned
  • Persistent ANN index — disk cache + DB source of truth, progressive embedding across runs
  • Golden record versioning — append-only with is_current flag, full audit trail
  • Cluster management — persistent clusters with merge, conflict detection, max size safety cap

Metadata tables (auto-created):

Table Purpose
gm_state Processing state, watermarks
gm_clusters Persistent cluster membership
gm_golden_records Versioned golden records
gm_embeddings Cached embeddings for ANN
gm_match_log Audit trail of all match decisions

LLM Boost (Optional)

For harder datasets where zero-shot scoring isn't enough:

pip install goldenmatch[llm]

# First run: LLM labels ~300 pairs (~$0.30), fine-tunes embedding model
goldenmatch dedupe products.csv --llm-boost

# Subsequent runs: uses saved model ($0)
goldenmatch dedupe products.csv --llm-boost

Tiered auto-escalation:

  • Level 1 — zero-shot (free, instant)
  • Level 2 — bi-encoder fine-tuning (~$0.20, ~2 min CPU)
  • Level 3 — Ditto-style cross-encoder with data augmentation (~$0.50, ~5 min CPU)

Best result: Abt-Buy 59.5% F1 (up from 44.5% zero-shot) with 300 LLM labels and optimal train/score split.

Benchmarks

Leipzig Entity Resolution Benchmarks

Dataset Best Strategy F1 Time
DBLP-ACM (2.6K vs 2.3K) Vertex AI embeddings 97.4% 119s
DBLP-Scholar (2.6K vs 64K) multi-pass + fuzzy 74.7% 83.9s
Abt-Buy (1K vs 1K) Vertex AI embeddings 84.7% 53s
Amazon-Google (1.4K vs 3.2K) Vertex AI embeddings 58.6% 110s

Previous best without Vertex AI: Abt-Buy 59.5% (LLM boost), Amazon-Google 40.5% (rec_emb). Vertex AI's text-embedding-004 model provides dramatically better embeddings with no local GPU needed.

1M Record Benchmark

1 million records deduplicated in ~15 seconds on a laptop (exact matching, full pipeline).

How GoldenMatch Compares

GoldenMatch dedupe Splink Zingg Ditto
Abt-Buy F1 84.7% ~75% ~70% ~80% 89.3%
DBLP-ACM F1 97.4% ~96% ~95% ~96% 99.0%
Training required No Yes Yes Yes Yes (1000+)
Zero-config Yes No No No No
Interactive TUI Yes No No No No
Database sync Postgres Cloud (paid) No No No
REST API / MCP Both Cloud only No No No
GPU required No No No Spark Yes

GoldenMatch's sweet spot is ease of use + competitive accuracy. Ditto has higher F1 but requires 1000+ manual labels and a GPU. Splink scales to billions on Spark but needs label training. GoldenMatch auto-configures from your data and reaches 85%+ F1 with zero labels.

Interactive TUI

GoldenMatch includes a gold-themed interactive terminal UI:

  • Auto-config summary — first screen shows detected columns, scorers, and blocking strategy with Run/Edit/Save options
  • Pipeline progress — full-screen progress with stage tracker (✓/●/○) on first run, footer bar on re-runs
  • Split-view matches — cluster list on the left, golden record + member details on the right
  • Live threshold slider — arrow keys adjust threshold in 0.05 increments with instant cluster count preview
  • Keyboard shortcuts1-5 jump to tabs, F5 run, ? show all shortcuts, Ctrl+E export

Data profiling:

Data Tab

Match results with cluster detail:

Matches Tab

Golden records:

Golden Tab

Settings Persistence

GoldenMatch saves preferences across sessions:

  • Global: ~/.goldenmatch/settings.yaml — output mode, default model, API keys
  • Project: .goldenmatch.yaml — column mappings, thresholds, blocking config

Settings tuned in the TUI can be saved to the project file. Next run picks them up automatically.

CLI Reference

Command Description
goldenmatch dedupe FILE [...] Deduplicate one or more files
goldenmatch match TARGET --against REF Match target against reference
goldenmatch sync --table TABLE --connection-string URL Sync against database
goldenmatch init Interactive config wizard
goldenmatch config save/load/list/delete/show Manage config presets
goldenmatch profile FILE Profile data quality
goldenmatch interactive FILE [...] Launch TUI

Architecture

goldenmatch/
├── cli/            # Typer CLI commands (dedupe, match, sync)
├── config/         # Pydantic schemas, YAML loader, settings persistence
├── core/           # Pipeline modules (ingest, block, score, cluster, golden)
├── db/             # Database integration (connector, blocking, sync, reconcile)
├── tui/            # Textual TUI + MatchEngine
└── utils/          # Transforms, helpers

605+ tests covering all modules. Run with pytest.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

goldenmatch-0.1.0.tar.gz (45.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

goldenmatch-0.1.0-py3-none-any.whl (170.5 kB view details)

Uploaded Python 3

File details

Details for the file goldenmatch-0.1.0.tar.gz.

File metadata

  • Download URL: goldenmatch-0.1.0.tar.gz
  • Upload date:
  • Size: 45.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for goldenmatch-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d02a65fc99b2102d751f5e184cc8b74bea52e5e7829195d6f393db00e774f4e5
MD5 9893a61d1a04c82a895ea56de99f726e
BLAKE2b-256 dbe46effc4c45cf5d60524535e590ee1b6ef6ae2bbad4b519a77bc048c94bf11

See more details on using hashes here.

File details

Details for the file goldenmatch-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: goldenmatch-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 170.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for goldenmatch-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e06c7a9a50b1165f2de9beb451ea0b8796d729840c1266548c995dcd2cb84a35
MD5 e6bfdcea14ccb0342aa627c7e4e564c5
BLAKE2b-256 6ee0a718a3f07b215a07b4b9dbb1b43e8ec28b42d15de972c496049028468cbe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page