Skip to main content

A package for matching UK addresses using a pretrained Splink model

Project description

High performance UK addresses matcher (geocoder)

Extremely fast address matching using a pre-trained Splink model.

Full time taken: 11.05 seconds
to match 176,640 messy addresses to 273,832 canonical addresses
at a rate of 15,008 addresses per second

(On Macbook M4 Max)

Installation

At the moment this uses a branch of Splink only available on Github.

pip install --pre uk_address_matcher

Usage

High performance address matching using a pre-trained Splink model.

Assuming you have two duckdb dataframes in this format:

unique_id address_concat postcode
1 123 Fake Street, Faketown FA1 2KE
2 456 Other Road, Otherville NO1 3WY
... ... ...

Basic Matching

Match them with:

import duckdb

from uk_address_matcher import (
    run_deterministic_match_pass,
    get_linker,
    best_matches_with_distinguishability,
    improve_predictions_using_distinguishing_tokens,
)
from uk_address_matcher.cleaning.chunking_strategies import clean_data_with_term_frequencies
from uk_address_matcher.post_linkage.match_candidate_selection import select_top_match_candidates

p_ch = "./example_data/companies_house_addresess_postcode_overlap.parquet"
p_fhrs = "./example_data/fhrs_addresses_sample.parquet"

con = duckdb.connect(database=":memory:")

df_ch = con.read_parquet(p_ch).order("postcode")
df_fhrs = con.read_parquet(p_fhrs).order("postcode")

df_ch_clean = clean_data_with_term_frequencies(df_ch, con=con)
df_fhrs_clean = clean_data_with_term_frequencies(df_fhrs, con=con)


df_fhrs_exact_matches = run_deterministic_match_pass(
    con=con,
    df_addresses_to_match=df_fhrs_clean,
    df_addresses_to_search_within=df_ch_clean,
)

linker = get_linker(
    df_addresses_to_match=df_fhrs_exact_matches,
    df_addresses_to_search_within=df_ch_clean,
    con=con,
    include_full_postcode_block=True,
    additional_columns_to_retain=["original_address_concat"],
)

# First pass - standard probabilistic linkage
df_predict = linker.inference.predict(
    threshold_match_weight=-50
)
df_predict_ddb = df_predict.as_duckdbpyrelation()

# Second pass - improve predictions using distinguishing tokens

df_predict_improved = improve_predictions_using_distinguishing_tokens(
    df_predict=df_predict_ddb,
    con=con,
    match_weight_threshold=-20,
)

# Find best matches within group and compute distinguishability

best_matches = best_matches_with_distinguishability(
    df_predict=df_predict_improved,
    df_addresses_to_match=df_fhrs_exact_matches,
    con=con,
)

# Find top matches in system
match_candidates = select_top_match_candidates(
    con=con,
    df_exact_matches=df_fhrs_exact_matches,
    df_splink_matches=best_matches,
    df_canonical=df_ch_clean,
    match_weight_threshold=15,
    distinguishability_threshold=None,
    include_unmatched=True,
)

match_candidates.show(max_width=500, max_rows=20)

Two-Pass Matching Approach

The package uses a two-pass approach to achieve high accuracy matching:

  1. First Pass: A standard probabilistic linkage model using Splink generates candidate matches for each input address.

  2. Second Pass: Within each candidate group, the model analyzes distinguishing tokens to refine matches:

    • Identifies tokens that uniquely distinguish addresses within a candidate group
    • Detects "punishment tokens" (tokens in the messy address that don't match the current candidate but do match other candidates)
    • Uses this contextual information to improve match scores

This approach is particularly effective when matching to a canonical (deduplicated) address list, as it can identify subtle differences between very similar addresses.

Refer to the example, which has detailed comments, for how to match your data.

See an example of comparing two addresses to get a sense of what it does/how it scores

Run an interactive example in your browser:

Open In Colab Match 5,000 FHRS records to 21,952 companies house records in < 10 seconds.

Open In Colab Investigate and understand how the model works

Development

The scripts and tests will run better if you create .vscode/settings.json with the following:

{
    "jupyter.notebookFileRoot": "${workspaceFolder}",
    "python.analysis.extraPaths": [
        "${workspaceFolder}"
    ],
    "python.testing.pytestEnabled": true,
    "python.testing.unittestEnabled": false,
    "python.testing.pytestArgs": [
        "-v",
        "--capture=tee-sys"
    ]
}

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uk_address_matcher-1.0.0.dev21.tar.gz (1.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

uk_address_matcher-1.0.0.dev21-py3-none-any.whl (1.8 MB view details)

Uploaded Python 3

File details

Details for the file uk_address_matcher-1.0.0.dev21.tar.gz.

File metadata

  • Download URL: uk_address_matcher-1.0.0.dev21.tar.gz
  • Upload date:
  • Size: 1.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.14 {"installer":{"name":"uv","version":"0.9.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for uk_address_matcher-1.0.0.dev21.tar.gz
Algorithm Hash digest
SHA256 a5f35f57051c01d0d60d35a22a5ccfec584207ca882f2d856d99a04cc2c8ab29
MD5 b4e0df191f4710061b4d1fcaaf7e114a
BLAKE2b-256 43a9330a2e42f2f8bafe2fb6dd4e73dde34ba137b5f4f125164dd56c7ad2a98a

See more details on using hashes here.

File details

Details for the file uk_address_matcher-1.0.0.dev21-py3-none-any.whl.

File metadata

  • Download URL: uk_address_matcher-1.0.0.dev21-py3-none-any.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.14 {"installer":{"name":"uv","version":"0.9.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for uk_address_matcher-1.0.0.dev21-py3-none-any.whl
Algorithm Hash digest
SHA256 b7fb3d7f3a1f5d2f6d2f04a789877a49f76369ec9f48549757a8586578fb9996
MD5 2e7dbc622700a6430f8af53fa0557019
BLAKE2b-256 73bd65353633299c2d23e530cd79d210b1eaa2893e6c226d1faf75f0b921e9f7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page