A package for matching UK addresses using a pretrained Splink model
Project description
Matching UK addresses using Splink
High performance address matching using a pre-trained Splink model.
Assuming you have two duckdb dataframes in this format:
unique_id | address_concat | postcode |
---|---|---|
1 | 123 Fake Street, Faketown | FA1 2KE |
2 | 456 Other Road, Otherville | NO1 3WY |
... | ... | ... |
Match them with:
from uk_address_matcher.cleaning_pipelines import (
clean_data_using_precomputed_rel_tok_freq,
)
from uk_address_matcher.splink_model import _performance_predict
df_1_c = clean_data_using_precomputed_rel_tok_freq(df_1, con=con)
df_2_c = clean_data_using_precomputed_rel_tok_freq(df_2, con=con)
linker, predictions = _performance_predict(
df_addresses_to_match=df_1_c,
df_addresses_to_search_within=df_2_c,
con=con,
match_weight_threshold=-10,
output_all_cols=True,
include_full_postcode_block=True,
)
Initial tests suggest you can match ~ 500 addresses per second on a laptop.
Refer to the example, which has detailed comments, for how to match your data.
See an example of comparing two addresses to get a sense of what it does/how it scores
Run an interactive example in your browser:
Match 5,000 FHRS records to 21,952 companies house records in < 10 seconds.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for uk_address_matcher-0.0.1.dev10.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | c8ca8e0215d7fd14448fc9ac9bf8ca81b1ab1473daac4ad38f8709fa0e5a4811 |
|
MD5 | 20708b9a6d91c87db91e510ad4d99256 |
|
BLAKE2b-256 | cc5cc5f8ad81ca45439d65143ef3b54c64cde854af56eda1e0299e2cb53e4af9 |
Close
Hashes for uk_address_matcher-0.0.1.dev10-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a549a5959e57cd4c27da717915c8dceac387856316375ba7a46351c9eb951696 |
|
MD5 | 196f3117d8e574fa3b0bada1397da6ac |
|
BLAKE2b-256 | e4c70b0562f1b84dc46344244cfa9247b59a1d17073db664e1004171c4ce80a6 |