A package for matching UK addresses using a pretrained Splink model

Project description

High performance UK addresses matcher (geocoder)

Fast, simple address matching (geocoding) in Python.

The key features are:

Simple: Python only, set up in seconds on any laptop, no infrastructure needed
Fast: Match 100,000 addresses in around 30 seconds*
Reproducible benchmarks: High accuracy, demonstrated with reproducible examples

* Timings based on a Macbook M4 Max.

Installation

pip install --pre uk_address_matcher

Usage

uk_address_matcher assumes you have two tables in the following format:

unique_id	address_concat
1	123 Fake Street, Faketown, FA1 2KE
2	456 Other Road, Otherville, NO1 3WY
...	...

Generally one dataset will be a dataset of 'messy addresses' which need matching, and the second will be a 'canonical dataset' of addresses to match to, such as Ordnance Survey Addressbase or NGD.

Basic Matching

[!NOTE] Two runnable examples with live sample data are included for experimentation:

examples/example_matching.py: End-to-end matching example, including loading data, running the matcher, and previewing results.

examples/example_prepare_canonical.py: Example of preparing a canonical dataset for repeated use, demonstrating how to persist prepared data to disk and load it for matching.

Both use parquet files in example_data/ so you can run and adapt them immediately. You will need to download the example data from the releases page to run them, or you can adapt the code to use your own data.

import duckdb

from uk_address_matcher import AddressMatcher, ExactMatchStage, SplinkStage

con = duckdb.connect()

df_canonical = con.read_parquet("your_canonical_addresses.parquet")
df_messy = con.read_parquet("your_messy_addresses.parquet")

matcher = AddressMatcher(
    canonical_addresses=df_canonical,
    addresses_to_match=df_messy,
    con=con,
)

result = matcher.match()  # returns a DuckDBPyRelation
result.limit(10).show(max_width=500)

The default stages are ExactMatchStage followed by SplinkStage. You can customise them by passing your own stages list:

from uk_address_matcher import (
    AddressMatcher,
    ExactMatchStage,
    SplinkStage,
    UniqueTrigramStage,
)

matcher = AddressMatcher(
    canonical_addresses=df_canonical,
    addresses_to_match=df_messy,
    con=con,
    stages=[
        ExactMatchStage(),
        UniqueTrigramStage(),
        SplinkStage(),
    ],
)

result = matcher.match()

Additional columns

You may also provide a separate column called postcode, which, if provided will take precidence over any postcode information provided in address_concat.

If you have labelled data (you know the ground truth), you may provide a column called ukam_label, if provided, this will propagate through your results for accuracy analysis.

Pre-preparing canonical data

Cleaning a large canonical dataset (e.g. AddressBase) is expensive. Use prepare_canonical_folder to do it once and write the artefacts to disk. Subsequent runs load the prepared folder directly, skipping cleaning entirely.

from uk_address_matcher import AddressMatcher, prepare_canonical_folder

# One-time preparation
prepare_canonical_folder(
    df_canonical,
    output_folder="./ukam_prepared_canonical",
    con=con,
    overwrite=True,
)

print("Prepared canonical data written to ./ukam_prepared_canonical/")

# Fast matching — pass the folder path instead of a relation
matcher = AddressMatcher(
    canonical_addresses="./ukam_prepared_canonical",
    addresses_to_match=df_messy,
    con=con,
)

result = matcher.match()

Matching one or more AddressRecord entries

If you want to match a small number of addresses, or you have them in-memory as Python dictionaries, you can pass them directly as addresses_to_match without needing to create a DuckDB relation first.

You can pass a list of AddressRecord entries directly as addresses_to_match. The matcher also accepts a list of dicts with address_concat, postcode, and unique_id, or a DuckDB relation.

import duckdb

from uk_address_matcher import AddressMatcher, AddressRecord

con = duckdb.connect()

df_canonical = con.read_parquet("your_canonical_addresses.parquet")

records = [
    AddressRecord(
        unique_id="m_1",
        address_concat="10 downing street westminster london",
        postcode="SW1A 2AA",
    ),
    AddressRecord(
        unique_id="m_2",
        address_concat="11 downing street westminster london",
        postcode="SW1A 2AB",
    ),
]

matcher = AddressMatcher(
    canonical_addresses=df_canonical,
    addresses_to_match=records,
    con=con,
)

result = matcher.match()

Methodology

The Splink phase uses a two-pass approach to achieve high accuracy matching:

First Pass: A standard probabilistic linkage model using Splink generates candidate matches for each input address.
Second Pass: Within each candidate group, the model analyzes distinguishing tokens to refine matches:
- Identifies tokens that uniquely distinguish addresses within a candidate group
- Detects "punishment tokens" (tokens in the messy address that don't match the current candidate but do match other candidates)
- Uses this contextual information to improve match scores

This approach is particularly effective when matching to a canonical (deduplicated) address list, as it can identify subtle differences between very similar addresses.

Development

The scripts and tests will run better if you create .vscode/settings.json with the following:

{
    "jupyter.notebookFileRoot": "${workspaceFolder}",
    "python.analysis.extraPaths": [
        "${workspaceFolder}"
    ],
    "python.testing.pytestEnabled": true,
    "python.testing.unittestEnabled": false,
    "python.testing.pytestArgs": [
        "-v",
        "--capture=tee-sys"
    ]
}

Project details

Release history Release notifications | RSS feed

1.1.0

Apr 3, 2026

1.0.1

Mar 13, 2026

1.0.0

Mar 4, 2026

1.0.0.dev28 pre-release

Mar 4, 2026

This version

1.0.0.dev27 pre-release

Mar 3, 2026

1.0.0.dev26 pre-release

Feb 28, 2026

1.0.0.dev25 pre-release

Feb 27, 2026

1.0.0.dev24 pre-release

Feb 19, 2026

1.0.0.dev23 pre-release

Jan 26, 2026

1.0.0.dev22 pre-release

Dec 5, 2025

1.0.0.dev21 pre-release

Dec 2, 2025

1.0.0.dev20 pre-release

Sep 17, 2025

1.0.0.dev19 pre-release

Apr 7, 2025

1.0.0.dev18 pre-release

Apr 7, 2025

1.0.0.dev17 pre-release

Mar 19, 2025

1.0.0.dev16 pre-release

Mar 14, 2025

1.0.0.dev15 pre-release

Mar 13, 2025

1.0.0.dev14 pre-release

Mar 4, 2025

1.0.0.dev13 pre-release

Mar 3, 2025

1.0.0.dev12 pre-release

Feb 28, 2025

1.0.0.dev11 pre-release

Feb 26, 2025

1.0.0.dev10 pre-release

Feb 25, 2025

1.0.0.dev9 pre-release

Feb 23, 2025

1.0.0.dev8 pre-release

Feb 23, 2025

1.0.0.dev7 pre-release

Feb 23, 2025

1.0.0.dev6 pre-release

Feb 23, 2025

1.0.0.dev5 pre-release

Feb 23, 2025

1.0.0.dev4 pre-release

Feb 23, 2025

1.0.0.dev3 pre-release

Feb 22, 2025

1.0.0.dev2 pre-release

Feb 22, 2025

1.0.0.dev1 pre-release

Feb 22, 2025

0.0.4

Jan 26, 2026

0.0.3 yanked

Jan 22, 2026

Reason this release was yanked:

Superseded by a fixed release

0.0.2

Apr 3, 2025

0.0.1.dev11 pre-release

Jul 3, 2024

0.0.1.dev10 pre-release

Jun 26, 2024

0.0.1.dev9 pre-release

Jun 26, 2024

0.0.1.dev8 pre-release

Jun 25, 2024

0.0.1.dev7 pre-release

Jun 25, 2024

0.0.1.dev6 pre-release

Jun 25, 2024

0.0.1.dev5 pre-release

Jun 24, 2024

0.0.1.dev4 pre-release

Jun 24, 2024

0.0.1.dev3 pre-release

Jun 24, 2024

0.0.1.dev2 pre-release

Jun 24, 2024

0.0.1.dev1 pre-release

Jun 24, 2024

0.0.1.dev0 pre-release yanked

Jun 24, 2024

Reason this release was yanked:

wrong version

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uk_address_matcher-1.0.0.dev27.tar.gz (1.8 MB view details)

Uploaded Mar 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

uk_address_matcher-1.0.0.dev27-py3-none-any.whl (1.8 MB view details)

Uploaded Mar 3, 2026 Python 3

File details

Details for the file uk_address_matcher-1.0.0.dev27.tar.gz.

File metadata

Download URL: uk_address_matcher-1.0.0.dev27.tar.gz
Upload date: Mar 3, 2026
Size: 1.8 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.7 {"installer":{"name":"uv","version":"0.10.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for uk_address_matcher-1.0.0.dev27.tar.gz
Algorithm	Hash digest
SHA256	`7bdd93eb6303a96402c595a56bfbd42c42ac5fd8012ffbdd0e51608aabb2f22f`
MD5	`bfd49cbce0aff97d0df95bc3c493ecb6`
BLAKE2b-256	`34d3202a843da3919d0c7b2a13e8fb99a9ebf1e1f2c96b11bf0bd7a438826d03`

See more details on using hashes here.

File details

Details for the file uk_address_matcher-1.0.0.dev27-py3-none-any.whl.

File metadata

Download URL: uk_address_matcher-1.0.0.dev27-py3-none-any.whl
Upload date: Mar 3, 2026
Size: 1.8 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.7 {"installer":{"name":"uv","version":"0.10.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for uk_address_matcher-1.0.0.dev27-py3-none-any.whl
Algorithm	Hash digest
SHA256	`67f52947dfcb4912ca860924ee5078cf4b1e15b523b8414a181deb8c672a26f9`
MD5	`3323b9ca8ea084b6585ea10ed13ac942`
BLAKE2b-256	`278fafe502957b0ab32544c6ab9828de2e92dbad9f4eb78eb0a5e5d7175b9eb5`

See more details on using hashes here.

uk_address_matcher 1.0.0.dev27

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

High performance UK addresses matcher (geocoder)

Installation

Usage

Basic Matching

Additional columns

Pre-preparing canonical data

Matching one or more AddressRecord entries

Methodology

Development

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes