A package for matching UK addresses using a pretrained Splink model
Project description
High performance UK addresses matcher (geocoder)
Extremely fast address matching using a pre-trained Splink model.
Full time taken: 11.05 seconds
to match 176,640 messy addresses to 273,832 canonical addresses
at a rate of 15,008 addresses per second
(On Macbook M4 Max)
Installation
At the moment this uses a branch of Splink only available on Github.
pip install --pre uk_address_matcher
Usage
High performance address matching using a pre-trained Splink model.
Assuming you have two duckdb dataframes in this format:
| unique_id | address_concat | postcode |
|---|---|---|
| 1 | 123 Fake Street, Faketown | FA1 2KE |
| 2 | 456 Other Road, Otherville | NO1 3WY |
| ... | ... | ... |
Basic Matching
Match them with:
import duckdb
from uk_address_matcher import (
run_deterministic_match_pass,
get_linker,
best_matches_with_distinguishability,
improve_predictions_using_distinguishing_tokens,
)
from uk_address_matcher.cleaning.chunking_strategies import clean_data_with_term_frequencies
from uk_address_matcher.post_linkage.match_candidate_selection import select_top_match_candidates
p_ch = "./example_data/companies_house_addresess_postcode_overlap.parquet"
p_fhrs = "./example_data/fhrs_addresses_sample.parquet"
con = duckdb.connect(database=":memory:")
df_ch = con.read_parquet(p_ch).order("postcode")
df_fhrs = con.read_parquet(p_fhrs).order("postcode")
df_ch_clean = clean_data_with_term_frequencies(df_ch, con=con)
df_fhrs_clean = clean_data_with_term_frequencies(df_fhrs, con=con)
df_fhrs_exact_matches = run_deterministic_match_pass(
con=con,
df_addresses_to_match=df_fhrs_clean,
df_addresses_to_search_within=df_ch_clean,
)
linker = get_linker(
df_addresses_to_match=df_fhrs_exact_matches,
df_addresses_to_search_within=df_ch_clean,
con=con,
include_full_postcode_block=True,
additional_columns_to_retain=["original_address_concat"],
)
# First pass - standard probabilistic linkage
df_predict = linker.inference.predict(
threshold_match_weight=-50
)
df_predict_ddb = df_predict.as_duckdbpyrelation()
# Second pass - improve predictions using distinguishing tokens
df_predict_improved = improve_predictions_using_distinguishing_tokens(
df_predict=df_predict_ddb,
con=con,
match_weight_threshold=-20,
)
# Find best matches within group and compute distinguishability
best_matches = best_matches_with_distinguishability(
df_predict=df_predict_improved,
df_addresses_to_match=df_fhrs_exact_matches,
con=con,
)
# Find top matches in system
match_candidates = select_top_match_candidates(
con=con,
df_exact_matches=df_fhrs_exact_matches,
df_splink_matches=best_matches,
df_canonical=df_ch_clean,
match_weight_threshold=15,
distinguishability_threshold=None,
include_unmatched=True,
)
match_candidates.show(max_width=500, max_rows=20)
Two-Pass Matching Approach
The package uses a two-pass approach to achieve high accuracy matching:
-
First Pass: A standard probabilistic linkage model using Splink generates candidate matches for each input address.
-
Second Pass: Within each candidate group, the model analyzes distinguishing tokens to refine matches:
- Identifies tokens that uniquely distinguish addresses within a candidate group
- Detects "punishment tokens" (tokens in the messy address that don't match the current candidate but do match other candidates)
- Uses this contextual information to improve match scores
This approach is particularly effective when matching to a canonical (deduplicated) address list, as it can identify subtle differences between very similar addresses.
Refer to the example, which has detailed comments, for how to match your data.
See an example of comparing two addresses to get a sense of what it does/how it scores
Run an interactive example in your browser:
Match 5,000 FHRS records to 21,952 companies house records in < 10 seconds.
Investigate and understand how the model works
Development
The scripts and tests will run better if you create .vscode/settings.json with the following:
{
"jupyter.notebookFileRoot": "${workspaceFolder}",
"python.analysis.extraPaths": [
"${workspaceFolder}"
],
"python.testing.pytestEnabled": true,
"python.testing.unittestEnabled": false,
"python.testing.pytestArgs": [
"-v",
"--capture=tee-sys"
]
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file uk_address_matcher-1.0.0.dev21.tar.gz.
File metadata
- Download URL: uk_address_matcher-1.0.0.dev21.tar.gz
- Upload date:
- Size: 1.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.14 {"installer":{"name":"uv","version":"0.9.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a5f35f57051c01d0d60d35a22a5ccfec584207ca882f2d856d99a04cc2c8ab29
|
|
| MD5 |
b4e0df191f4710061b4d1fcaaf7e114a
|
|
| BLAKE2b-256 |
43a9330a2e42f2f8bafe2fb6dd4e73dde34ba137b5f4f125164dd56c7ad2a98a
|
File details
Details for the file uk_address_matcher-1.0.0.dev21-py3-none-any.whl.
File metadata
- Download URL: uk_address_matcher-1.0.0.dev21-py3-none-any.whl
- Upload date:
- Size: 1.8 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.14 {"installer":{"name":"uv","version":"0.9.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b7fb3d7f3a1f5d2f6d2f04a789877a49f76369ec9f48549757a8586578fb9996
|
|
| MD5 |
2e7dbc622700a6430f8af53fa0557019
|
|
| BLAKE2b-256 |
73bd65353633299c2d23e530cd79d210b1eaa2893e6c226d1faf75f0b921e9f7
|