Skip to main content

A package for matching UK addresses using a pretrained Splink model

Project description

Matching UK addresses using Splink

High performance address matching using a pre-trained Splink model.

Assuming you have two duckdb dataframes in this format:

unique_id address_concat postcode
1 123 Fake Street, Faketown FA1 2KE
2 456 Other Road, Otherville NO1 3WY
... ... ...

Match them with:

from uk_address_matcher.cleaning_pipelines import (
    clean_data_using_precomputed_rel_tok_freq,
)
from uk_address_matcher.splink_model import _performance_predict

df_1_c = clean_data_using_precomputed_rel_tok_freq(df_1, con=con)
df_2_c = clean_data_using_precomputed_rel_tok_freq(df_2, con=con)


linker, predictions = _performance_predict(
    df_addresses_to_match=df_1_c,
    df_addresses_to_search_within=df_2_c,
    con=con,
    match_weight_threshold=-10,
    output_all_cols=True,
    include_full_postcode_block=True,
)

Initial tests suggest you can match ~ 1,000 addresses per second against a list of 30 million addresses on a laptop.

Refer to the example, which has detailed comments, for how to match your data.

See an example of comparing two addresses to get a sense of what it does/how it scores

Run an interactive example in your browser:

Open In Colab Match 5,000 FHRS records to 21,952 companies house records in < 10 seconds.

Open In Colab Investigate and understand how the model works

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uk_address_matcher-0.0.1.dev11.tar.gz (1.8 MB view details)

Uploaded Source

Built Distribution

File details

Details for the file uk_address_matcher-0.0.1.dev11.tar.gz.

File metadata

  • Download URL: uk_address_matcher-0.0.1.dev11.tar.gz
  • Upload date:
  • Size: 1.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.10.8 Darwin/23.3.0

File hashes

Hashes for uk_address_matcher-0.0.1.dev11.tar.gz
Algorithm Hash digest
SHA256 45b0ff6aba21696106344e175b93ff929c2a2fedc005cfc008be6ca7c69bf1ca
MD5 f761db2fd67637a1ce2046e1e2237bff
BLAKE2b-256 7588c9c08c8f0d85fa715d0c87b32dcdeb858563ab80b6708113fede6d8b55df

See more details on using hashes here.

File details

Details for the file uk_address_matcher-0.0.1.dev11-py3-none-any.whl.

File metadata

File hashes

Hashes for uk_address_matcher-0.0.1.dev11-py3-none-any.whl
Algorithm Hash digest
SHA256 aeeeb99259987e7c820d21afdcf6f16a8fd9a2ee54c439e5c97e9e4a63c8501d
MD5 b82beb7ee42f270305f1d82def481931
BLAKE2b-256 e46d45295aef038a96e83d1315b21991c297abbe7718233a53055f24179eb1b5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page