Skip to main content

A package for matching UK addresses using a pretrained Splink model

Project description

High performance UK addresses matcher (geocoder)

Installation

At the moment this uses a branch of Splink only available on Github.

pip install --pre uk_address_matcher

Usage

High performance address matching using a pre-trained Splink model.

Assuming you have two duckdb dataframes in this format:

unique_id address_concat postcode
1 123 Fake Street, Faketown FA1 2KE
2 456 Other Road, Otherville NO1 3WY
... ... ...

Basic Matching

Match them with:

import duckdb

from uk_address_matcher import clean_data_using_precomputed_rel_tok_freq, get_linker

p_ch = "./example_data/companies_house_addresess_postcode_overlap.parquet"
p_fhrs = "./example_data/fhrs_addresses_sample.parquet"

con = duckdb.connect(database=":memory:")

df_ch = con.read_parquet(p_ch).order("postcode")
df_fhrs = con.read_parquet(p_fhrs).order("postcode")

df_ch_clean = clean_data_using_precomputed_rel_tok_freq(df_ch, con=con)
df_fhrs_clean = clean_data_using_precomputed_rel_tok_freq(df_fhrs, con=con)

linker = get_linker(
    df_addresses_to_match=df_fhrs_clean,
    df_addresses_to_search_within=df_ch_clean,
    con=con,
    include_full_postcode_block=True,
    additional_columns_to_retain=["original_address_concat"],
)

# First pass - standard probabilistic linkage
df_predict = linker.inference.predict(
    threshold_match_weight=-50, experimental_optimisation=True
)
df_predict_ddb = df_predict.as_duckdbpyrelation()

# Second pass - improve predictions using distinguishing tokens
from uk_address_matcher.post_linkage.identify_distinguishing_tokens import improve_predictions_using_distinguishing_tokens

df_predict_improved = improve_predictions_using_distinguishing_tokens(
    df_predict=df_predict_ddb,
    con=con,
    match_weight_threshold=-20,
)

Two-Pass Matching Approach

The package uses a two-pass approach to achieve high accuracy matching:

  1. First Pass: A standard probabilistic linkage model using Splink generates candidate matches for each input address.

  2. Second Pass: Within each candidate group, the model analyzes distinguishing tokens to refine matches:

    • Identifies tokens that uniquely distinguish addresses within a candidate group
    • Detects "punishment tokens" (tokens in the messy address that don't match the current candidate but do match other candidates)
    • Uses this contextual information to improve match scores

This approach is particularly effective when matching to a canonical (deduplicated) address list, as it can identify subtle differences between very similar addresses.

Refer to the example, which has detailed comments, for how to match your data.

See an example of comparing two addresses to get a sense of what it does/how it scores

Run an interactive example in your browser:

Open In Colab Match 5,000 FHRS records to 21,952 companies house records in < 10 seconds.

Open In Colab Investigate and understand how the model works

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uk_address_matcher-1.0.0.dev11.tar.gz (1.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

uk_address_matcher-1.0.0.dev11-py3-none-any.whl (1.8 MB view details)

Uploaded Python 3

File details

Details for the file uk_address_matcher-1.0.0.dev11.tar.gz.

File metadata

File hashes

Hashes for uk_address_matcher-1.0.0.dev11.tar.gz
Algorithm Hash digest
SHA256 ca3a2b5121b2c2477ff77885fdc28fcdf5785a92283e5dd25d0019861392bfa3
MD5 154baf74a6d884ded6afd1dc9baf2979
BLAKE2b-256 d7bf48e98cb527dc9a21fa653cd68faabb2e3df50e52674303371cb0ed501255

See more details on using hashes here.

File details

Details for the file uk_address_matcher-1.0.0.dev11-py3-none-any.whl.

File metadata

File hashes

Hashes for uk_address_matcher-1.0.0.dev11-py3-none-any.whl
Algorithm Hash digest
SHA256 ba8175bf7c36c7228e70f2ed5ad64b3940997b99d8af677d8bb8cca10ee0b259
MD5 43d93b0f6454952d795c75d68efd6003
BLAKE2b-256 d88ea5945f8ea59809a4ba7cf2a8ce081a062f1d24a687f4c50be5e5f9513c20

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page