A package for matching UK addresses using a pretrained Splink model

These details have not been verified by PyPI

Project links

Repository

Project description

High performance UK addresses matcher (geocoder)

Extremely fast address matching using a pre-trained Splink model.

Full time taken: 11.05 seconds
to match 176,640 messy addresses to 273,832 canonical addresses
at a rate of 15,008 addresses per second

(On Macbook M4 Max)

Installation

At the moment this uses a branch of Splink only available on Github.

pip install --pre uk_address_matcher

Usage

High performance address matching using a pre-trained Splink model.

Assuming you have two duckdb dataframes in this format:

unique_id	address_concat	postcode
1	123 Fake Street, Faketown	FA1 2KE
2	456 Other Road, Otherville	NO1 3WY
...	...	...

Basic Matching

Match them with:

import duckdb

from uk_address_matcher import (
    run_deterministic_match_pass,
    get_linker,
    best_matches_with_distinguishability,
    improve_predictions_using_distinguishing_tokens,
)
from uk_address_matcher.cleaning.chunking_strategies import clean_data_with_term_frequencies
from uk_address_matcher.post_linkage.match_candidate_selection import select_top_match_candidates

p_ch = "./example_data/companies_house_addresess_postcode_overlap.parquet"
p_fhrs = "./example_data/fhrs_addresses_sample.parquet"

con = duckdb.connect(database=":memory:")

df_ch = con.read_parquet(p_ch).order("postcode")
df_fhrs = con.read_parquet(p_fhrs).order("postcode")

df_ch_clean = clean_data_with_term_frequencies(df_ch, con=con)
df_fhrs_clean = clean_data_with_term_frequencies(df_fhrs, con=con)


df_fhrs_exact_matches = run_deterministic_match_pass(
    con=con,
    df_addresses_to_match=df_fhrs_clean,
    df_addresses_to_search_within=df_ch_clean,
)

linker = get_linker(
    df_addresses_to_match=df_fhrs_exact_matches,
    df_addresses_to_search_within=df_ch_clean,
    con=con,
    include_full_postcode_block=True,
    additional_columns_to_retain=["original_address_concat"],
)

# First pass - standard probabilistic linkage
df_predict = linker.inference.predict(
    threshold_match_weight=-50
)
df_predict_ddb = df_predict.as_duckdbpyrelation()

# Second pass - improve predictions using distinguishing tokens

df_predict_improved = improve_predictions_using_distinguishing_tokens(
    df_predict=df_predict_ddb,
    con=con,
    match_weight_threshold=-20,
)

# Find best matches within group and compute distinguishability

best_matches = best_matches_with_distinguishability(
    df_predict=df_predict_improved,
    df_addresses_to_match=df_fhrs_exact_matches,
    con=con,
)

# Find top matches in system
match_candidates = select_top_match_candidates(
    con=con,
    df_exact_matches=df_fhrs_exact_matches,
    df_splink_matches=best_matches,
    df_canonical=df_ch_clean,
    match_weight_threshold=15,
    distinguishability_threshold=None,
    include_unmatched=True,
)

match_candidates.show(max_width=500, max_rows=20)

Two-Pass Matching Approach

The package uses a two-pass approach to achieve high accuracy matching:

First Pass: A standard probabilistic linkage model using Splink generates candidate matches for each input address.
Second Pass: Within each candidate group, the model analyzes distinguishing tokens to refine matches:
- Identifies tokens that uniquely distinguish addresses within a candidate group
- Detects "punishment tokens" (tokens in the messy address that don't match the current candidate but do match other candidates)
- Uses this contextual information to improve match scores

This approach is particularly effective when matching to a canonical (deduplicated) address list, as it can identify subtle differences between very similar addresses.

Refer to the example, which has detailed comments, for how to match your data.

See an example of comparing two addresses to get a sense of what it does/how it scores

Run an interactive example in your browser:

Match 5,000 FHRS records to 21,952 companies house records in < 10 seconds.

Investigate and understand how the model works

Development

The scripts and tests will run better if you create .vscode/settings.json with the following:

{
    "jupyter.notebookFileRoot": "${workspaceFolder}",
    "python.analysis.extraPaths": [
        "${workspaceFolder}"
    ],
    "python.testing.pytestEnabled": true,
    "python.testing.unittestEnabled": false,
    "python.testing.pytestArgs": [
        "-v",
        "--capture=tee-sys"
    ]
}

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

1.1.0

Apr 3, 2026

1.0.1

Mar 13, 2026

1.0.0

Mar 4, 2026

1.0.0.dev28 pre-release

Mar 4, 2026

1.0.0.dev27 pre-release

Mar 3, 2026

1.0.0.dev26 pre-release

Feb 28, 2026

1.0.0.dev25 pre-release

Feb 27, 2026

1.0.0.dev24 pre-release

Feb 19, 2026

1.0.0.dev23 pre-release

Jan 26, 2026

1.0.0.dev22 pre-release

Dec 5, 2025

This version

1.0.0.dev21 pre-release

Dec 2, 2025

1.0.0.dev20 pre-release

Sep 17, 2025

1.0.0.dev19 pre-release

Apr 7, 2025

1.0.0.dev18 pre-release

Apr 7, 2025

1.0.0.dev17 pre-release

Mar 19, 2025

1.0.0.dev16 pre-release

Mar 14, 2025

1.0.0.dev15 pre-release

Mar 13, 2025

1.0.0.dev14 pre-release

Mar 4, 2025

1.0.0.dev13 pre-release

Mar 3, 2025

1.0.0.dev12 pre-release

Feb 28, 2025

1.0.0.dev11 pre-release

Feb 26, 2025

1.0.0.dev10 pre-release

Feb 25, 2025

1.0.0.dev9 pre-release

Feb 23, 2025

1.0.0.dev8 pre-release

Feb 23, 2025

1.0.0.dev7 pre-release

Feb 23, 2025

1.0.0.dev6 pre-release

Feb 23, 2025

1.0.0.dev5 pre-release

Feb 23, 2025

1.0.0.dev4 pre-release

Feb 23, 2025

1.0.0.dev3 pre-release

Feb 22, 2025

1.0.0.dev2 pre-release

Feb 22, 2025

1.0.0.dev1 pre-release

Feb 22, 2025

0.0.4

Jan 26, 2026

0.0.3 yanked

Jan 22, 2026

Reason this release was yanked:

Superseded by a fixed release

0.0.2

Apr 3, 2025

0.0.1.dev11 pre-release

Jul 3, 2024

0.0.1.dev10 pre-release

Jun 26, 2024

0.0.1.dev9 pre-release

Jun 26, 2024

0.0.1.dev8 pre-release

Jun 25, 2024

0.0.1.dev7 pre-release

Jun 25, 2024

0.0.1.dev6 pre-release

Jun 25, 2024

0.0.1.dev5 pre-release

Jun 24, 2024

0.0.1.dev4 pre-release

Jun 24, 2024

0.0.1.dev3 pre-release

Jun 24, 2024

0.0.1.dev2 pre-release

Jun 24, 2024

0.0.1.dev1 pre-release

Jun 24, 2024

0.0.1.dev0 pre-release yanked

Jun 24, 2024

Reason this release was yanked:

wrong version

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uk_address_matcher-1.0.0.dev21.tar.gz (1.8 MB view details)

Uploaded Dec 2, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

uk_address_matcher-1.0.0.dev21-py3-none-any.whl (1.8 MB view details)

Uploaded Dec 2, 2025 Python 3

File details

Details for the file uk_address_matcher-1.0.0.dev21.tar.gz.

File metadata

Download URL: uk_address_matcher-1.0.0.dev21.tar.gz
Upload date: Dec 2, 2025
Size: 1.8 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.14 {"installer":{"name":"uv","version":"0.9.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for uk_address_matcher-1.0.0.dev21.tar.gz
Algorithm	Hash digest
SHA256	`a5f35f57051c01d0d60d35a22a5ccfec584207ca882f2d856d99a04cc2c8ab29`
MD5	`b4e0df191f4710061b4d1fcaaf7e114a`
BLAKE2b-256	`43a9330a2e42f2f8bafe2fb6dd4e73dde34ba137b5f4f125164dd56c7ad2a98a`

See more details on using hashes here.

File details

Details for the file uk_address_matcher-1.0.0.dev21-py3-none-any.whl.

File metadata

Download URL: uk_address_matcher-1.0.0.dev21-py3-none-any.whl
Upload date: Dec 2, 2025
Size: 1.8 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.14 {"installer":{"name":"uv","version":"0.9.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for uk_address_matcher-1.0.0.dev21-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b7fb3d7f3a1f5d2f6d2f04a789877a49f76369ec9f48549757a8586578fb9996`
MD5	`2e7dbc622700a6430f8af53fa0557019`
BLAKE2b-256	`73bd65353633299c2d23e530cd79d210b1eaa2893e6c226d1faf75f0b921e9f7`

See more details on using hashes here.

uk_address_matcher 1.0.0.dev21

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

High performance UK addresses matcher (geocoder)

Installation

Usage

Basic Matching

Two-Pass Matching Approach

Development

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes