Skip to main content

A package for matching UK addresses using a pretrained Splink model

Project description

Matching UK addresses using Splink

High performance address matching using a pre-trained Splink model.

Assuming you have two duckdb dataframes in this format:

unique_id address_concat postcode
1 123 Fake Street, Faketown FA1 2KE
2 456 Other Road, Otherville NO1 3WY
... ... ...

Match them with:

from uk_address_matcher.cleaning_pipelines import (
    clean_data_using_precomputed_rel_tok_freq,
)
from uk_address_matcher.splink_model import _performance_predict

df_1_c = clean_data_using_precomputed_rel_tok_freq(df_1, con=con)
df_2_c = clean_data_using_precomputed_rel_tok_freq(df_2, con=con)


linker, predictions = _performance_predict(
    [df_1_c, df_2_c],
    con=con,
    match_weight_threshold=-10,
    output_all_cols=True,
    include_full_postcode_block=True,
)

Initial tests suggest you can match ~ 500 addresses per second on a laptop.

Refer to the example, which has detailed comments, for how to match your data.

See an example of comparing two addresses to get a sense of what it does/how it scores

Run an interactive example in your browser:

Open In Colab Match 5,000 FHRS records to 21,952 companies house records in < 10 seconds.

Open In Colab Investigate and understand how the model works

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uk_address_matcher-0.0.1.dev8.tar.gz (1.8 MB view hashes)

Uploaded Source

Built Distribution

uk_address_matcher-0.0.1.dev8-py3-none-any.whl (1.8 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page