GoldenMatch entity resolution functions for DuckDB
Project description
goldenmatch-duckdb
GoldenMatch entity resolution functions for DuckDB.
pip install goldenmatch-duckdb
Usage
import duckdb
import goldenmatch_duckdb
con = duckdb.connect()
goldenmatch_duckdb.register(con)
# Score two strings
con.sql("SELECT goldenmatch_score('John Smith', 'Jon Smyth', 'jaro_winkler')").show()
# Deduplicate a table
con.sql("""
CREATE TABLE customers AS SELECT * FROM (VALUES
('John', 'john@x.com'),
('JOHN', 'john@x.com'),
('Jane', 'jane@y.com')
) AS t(name, email)
""")
con.sql("SELECT goldenmatch_dedupe_table('customers', '{\"exact\": [\"email\"]}')").show()
# Match two tables
con.sql("SELECT goldenmatch_match_tables('prospects', 'reference', '{\"fuzzy\": {\"name\": 0.85}}')").show()
Functions
| Function | Description |
|---|---|
goldenmatch_score(a, b, scorer) |
Score two strings |
goldenmatch_score_pair(rec_a, rec_b, config) |
Score two JSON records |
goldenmatch_explain(rec_a, rec_b, config) |
Explain a match |
goldenmatch_dedupe_table(table, config) |
Deduplicate a DuckDB table |
goldenmatch_match_tables(target, ref, config) |
Match two DuckDB tables |
goldenmatch_dedupe(json, config) |
Deduplicate JSON records |
goldenmatch_match(target_json, ref_json, config) |
Match JSON records |
Core-API functions
Thin wrappers over goldenmatch's public core APIs. All return JSON strings (scalar functions noted otherwise); table-input functions read the named DuckDB table directly.
| Function | Wraps | Description |
|---|---|---|
goldenmatch_profile_table(table) |
profile_dataframe |
Full profile report for a table (JSON) |
goldenmatch_suggest_threshold(scores_json) |
suggest_threshold |
Otsu threshold over a JSON score list (DOUBLE; NULL when unimodal) |
goldenmatch_detect_domain(columns_json) |
detect_domain |
Detect data domain from a JSON column-name list |
goldenmatch_extract_features(text, kind) |
extract_product_features / extract_software_features / extract_biblio_features |
Extract structured features; kind = product/electronics, software, or biblio |
goldenmatch_evaluate(pairs_json, ground_truth_json) |
evaluate_pairs / evaluate_clusters |
Precision/recall/F1 vs. ground truth (auto-selects by shape) |
goldenmatch_compare_clusters(a_json, b_json) |
compare_clusters |
CCMS / TWI comparison of two clusterings |
goldenmatch_validate_table(table, rules_json) |
validate_dataframe |
Apply validation rules; returns report + quarantined rows |
goldenmatch_autofix_table(table) |
auto_fix_dataframe |
Apply common data fixes; returns fixes + fixed rows |
goldenmatch_detect_anomalies(table, sensitivity) |
detect_anomalies |
Flag suspicious records (low/medium/high) |
goldenmatch_preflight(table, config_json) |
preflight |
Pre-run config validation findings |
goldenmatch_postflight(table, config_json) |
postflight |
Post-run signal report (runs dedupe to derive pair scores) |
goldenmatch_train_em(rows_json, matchkey_json, params_json) |
train_em |
Train Fellegi-Sunter m/u probabilities; returns EMResult JSON |
goldenmatch_score_probabilistic(rows_json, matchkey_json, em_result_json) |
score_probabilistic |
Score pairs with a trained EMResult |
# Otsu threshold suggestion
con.sql("SELECT goldenmatch_suggest_threshold('[0.1,0.12,0.9,0.92]')").show()
# Detect domain from columns
con.sql("SELECT goldenmatch_detect_domain('[\"product_title\",\"brand\",\"sku\"]')").show()
# Profile / validate / auto-fix a table
con.sql("SELECT goldenmatch_profile_table('customers')").show()
# Fellegi-Sunter: train, then score
con.sql("""
SELECT goldenmatch_score_probabilistic(
:rows, :mk,
goldenmatch_train_em(:rows, :mk, '{}')
)
""")
GoldenFlow transforms
8 scalar VARCHAR -> VARCHAR UDFs wrapping GoldenFlow's transform registry --
normalize / canonicalize a column before matching. Byte-equivalent to the
Postgres goldenflow_* functions, so a transform written for one backend
ports to the other. They fail open (pass the input through unchanged) when
goldenflow isn't installed -- pip install goldenflow to enable.
| Function | GoldenFlow transform | Description |
|---|---|---|
goldenflow_normalize_email(value) |
email_normalize |
Normalize an email address |
goldenflow_normalize_phone(value) |
phone_e164 |
Normalize a phone number to E.164 |
goldenflow_normalize_date(value) |
date_iso8601 |
Normalize a date to ISO-8601 |
goldenflow_normalize_name_proper(value) |
name_proper |
Proper-case a personal name |
goldenflow_canonicalize_url(value) |
url_normalize |
Canonicalize a URL |
goldenflow_canonicalize_address(value) |
address_standardize |
Standardize a postal address |
goldenflow_strip(value) |
strip |
Strip leading/trailing whitespace |
goldenflow_whitespace_normalize(value) |
collapse_whitespace |
Collapse internal whitespace runs |
# Normalize columns inline before matching
con.sql("""
SELECT
goldenflow_normalize_email(email) AS email_norm,
goldenflow_normalize_phone(phone) AS phone_e164
FROM customers
""").show()
Requirements
- Python 3.11+
- DuckDB 1.0+
- goldenmatch >= 1.1.0
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file goldenmatch_duckdb-0.6.0.tar.gz.
File metadata
- Download URL: goldenmatch_duckdb-0.6.0.tar.gz
- Upload date:
- Size: 27.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
548d23d9db228c1e23845764b7b24f4720db66bdc973c39cf7fb75556dc59a85
|
|
| MD5 |
46bfbb27c768e08c2a1f1a397b6c396f
|
|
| BLAKE2b-256 |
e49c130bfe45abe63f29a6e1e700709d296c1ce05a569c45a8dc868ae556ded4
|
File details
Details for the file goldenmatch_duckdb-0.6.0-py3-none-any.whl.
File metadata
- Download URL: goldenmatch_duckdb-0.6.0-py3-none-any.whl
- Upload date:
- Size: 19.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1023241865215c894861807e9761b51c23b2b1f28b17d06be1346c3a706c8089
|
|
| MD5 |
14bef8546cd48bde48894b2488d29603
|
|
| BLAKE2b-256 |
0a76cbbb6078b9d5204d29a8d935d00c9bf9ba590e4fdb8d694ae84afdf4475e
|