A package for matching UK addresses using a pretrained Splink model
Project description
Matching UK addresses using Splink
High performance address matching using a pre-trained Splink model.
Assuming you have two duckdb dataframes in this format:
unique_id | address_concat | postcode |
---|---|---|
1 | 123 Fake Street, Faketown | FA1 2KE |
2 | 456 Other Road, Otherville | NO1 3WY |
... | ... | ... |
Match them with:
from uk_address_matcher.cleaning_pipelines import (
clean_data_using_precomputed_rel_tok_freq,
)
from uk_address_matcher.splink_model import _performance_predict
df_1_c = clean_data_using_precomputed_rel_tok_freq(df_1, con=con)
df_2_c = clean_data_using_precomputed_rel_tok_freq(df_2, con=con)
linker, predictions = _performance_predict(
df_addresses_to_match=df_1_c,
df_addresses_to_search_within=df_2_c,
con=con,
match_weight_threshold=-10,
output_all_cols=True,
include_full_postcode_block=True,
)
Initial tests suggest you can match ~ 500 addresses per second on a laptop.
Refer to the example, which has detailed comments, for how to match your data.
See an example of comparing two addresses to get a sense of what it does/how it scores
Run an interactive example in your browser:
Match 5,000 FHRS records to 21,952 companies house records in < 10 seconds.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for uk_address_matcher-0.0.1.dev9.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1aa7c014755fef2936bbf7b2553545c81663318ca3ba77f700f9d782a8a3bacf |
|
MD5 | 034f66cbe16d96f6769a7c5c72ea0b24 |
|
BLAKE2b-256 | c5941698b7864123bc8c590ecc7d72b547ec3eccfec5666a91fad4cfe09f9b98 |
Close
Hashes for uk_address_matcher-0.0.1.dev9-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 130a0190d1c2e2efe0d17f8df62820406d26fa74564515958a720be036daf1fb |
|
MD5 | 640e04f888de69ab08af6220eb89089a |
|
BLAKE2b-256 | 5a98b2be742536528a42d604e33ab06455da7039459dd6f72462d165767bc13e |