A package for matching UK addresses using a pretrained Splink model
Project description
Matching UK addresses using Splink
High performance address matching using a pre-trained Splink model.
Assuming you have two duckdb dataframes in this format:
unique_id | address_concat | postcode |
---|---|---|
1 | 123 Fake Street, Faketown | FA1 2KE |
2 | 456 Other Road, Otherville | NO1 3WY |
... | ... | ... |
Match them with:
from uk_address_matcher.cleaning_pipelines import (
clean_data_using_precomputed_rel_tok_freq,
)
from uk_address_matcher.splink_model import _performance_predict
df_1_c = clean_data_using_precomputed_rel_tok_freq(df_1, con=con)
df_2_c = clean_data_using_precomputed_rel_tok_freq(df_2, con=con)
linker, predictions = _performance_predict(
[df_1_c, df_2_c],
con=con,
match_weight_threshold=-10,
output_all_cols=True,
include_full_postcode_block=True,
)
Initial tests suggest you can match ~ 500 addresses per second on a laptop.
Refer to the example, which has detailed comments, for how to match your data.
See an example of comparing two addresses to get a sense of what it does/how it scores
Run an interactive example in your browser:
Match 5,000 FHRS records to 21,952 companies house records in < 10 seconds.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for uk_address_matcher-0.0.1.dev8.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | c74cac5241f2e96dba82023362f0137255641c0db014dfd2fe44871bcf20e54f |
|
MD5 | 100612d26c06822eee99aeeb2c148e59 |
|
BLAKE2b-256 | 5c9010271658d3700ef18aa53246d5fe2e56fcbfe62007860c8b8730a4c21f73 |
Close
Hashes for uk_address_matcher-0.0.1.dev8-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b6700c72285d62a38511b20d7daaaa546c170cf5ce0cdb56b4375faab880ef11 |
|
MD5 | a662b7bca40c8a3617f86a549d2887b0 |
|
BLAKE2b-256 | 05ef831bb034e3b45b7392bdeaaee110c434ed650206bfbb3453f942623cc72e |