Skip to main content

A package for matching UK addresses using a pretrained Splink model

Project description

High performance UK addresses matcher (geocoder)

Extremely fast address matching using a pre-trained Splink model.

Full time taken: 11.05 seconds
to match 176,640 messy addresses to 273,832 canonical addresses
at a rate of 15,008 addresses per second

(On Macbook M4 Max)

Installation

pip install --pre uk_address_matcher

Usage

High performance address matching using a pre-trained Splink model.

Will match two datasets provided in this format:

unique_id address_concat
1 123 Fake Street, Faketown, FA1 2KE
2 456 Other Road, Otherville, NO1 3WY
... ...
  • You may also provide a separate column called postcode, which, if provided will trump any postcode information provided in address_concat.
  • If you have labelled data (you know the ground truth), you may provide a column called ukam_label, if provided, this will propagate through your results for accuracy analysis.

Postcode handling rules:

  • If you provide a separate postcode column, address_concat should ideally not include the postcode.
  • If you do not provide postcode, the matcher will attempt to extract it during cleaning.

Generally one dataset will be a dataset of 'messy addresses' which need matching, and the second will be a 'canonical dataset' of addresses to match to.

Preparing AddressBase for use in uk_address_matcher

uk_address_matcher can be used to any canonical list of addresses provided in the format above.

Many users will wish to link to Ordnance Survey address products.

Simplest route (lower accuracy)

The simplest Ordnance Survey product to use for this purpose is NGD Built Address.

You can use this 'out of the box' as your canonical list of addresses by selecting data from BuiltAddress as follows:

select uprn as unique_id, fulladdress as address_concat
from builtaddress
where {your_filter_here}

And providing the result output to uk_address_matcher. You will generally improve accuracy if you filter the data down to the geographical region of interest, and filter the addresses down as much as possible to include only those of interest (e.g. residential only, if you're matching residential addresses)

Full prep (higher accuracy)

Higher accuracy can be achieve by processing Ornance Survey data in a more sophisticated way.

For instance, Ordnance Survey provides multiple representations of a single address in Addressbase Premium and also in NGD Address.

By providing multiple addresses representations of each canonical address to uk_adress_matcher, you will have a better chance of higher precisison matching.

We provide a recommendation for automated build scripts for how to build such a file from Addressbase Premium and the NGD datasets here:

Basic Matching

[!NOTE] Two runnable examples with live sample data are included for experimentation:

Both use parquet files in example_data/ so you can run and adapt them immediately. You will need to download the example data from the releases page to run them, or you can adapt the code to use your own data.

import duckdb

from uk_address_matcher import AddressMatcher, ExactMatchStage, SplinkStage

con = duckdb.connect()

df_canonical = con.read_parquet("your_canonical_addresses.parquet")
df_messy = con.read_parquet("your_messy_addresses.parquet")

matcher = AddressMatcher(
    canonical_addresses=df_canonical,
    addresses_to_match=df_messy,
    con=con,
)

result = matcher.match()  # returns a DuckDBPyRelation
result.limit(10).show(max_width=500)

The default stages are ExactMatchStage followed by SplinkStage. You can customise them by passing your own stages list:

from uk_address_matcher import (
    AddressMatcher,
    ExactMatchStage,
    SplinkStage,
    UniqueTrigramStage,
)

matcher = AddressMatcher(
    canonical_addresses=df_canonical,
    addresses_to_match=df_messy,
    con=con,
    stages=[
        ExactMatchStage(),
        UniqueTrigramStage(),
        SplinkStage(
            final_match_weight_threshold=20.0,
            final_distinguishability_threshold=5.0,
        ),
    ],
)

result = matcher.match()

Pre-preparing canonical data

Cleaning a large canonical dataset (e.g. AddressBase) is expensive. Use prepare_canonical_folder to do it once and write the artefacts to disk. Subsequent runs load the prepared folder directly, skipping cleaning entirely.

from uk_address_matcher import AddressMatcher, prepare_canonical_folder

# One-time preparation
prepare_canonical_folder(
    df_canonical,
    output_folder="./ukam_prepared_canonical",
    con=con,
    overwrite=True,
)

print("Prepared canonical data written to ./ukam_prepared_canonical/")

# Fast matching — pass the folder path instead of a relation
matcher = AddressMatcher(
    canonical_addresses="./ukam_prepared_canonical",
    addresses_to_match=df_messy,
    con=con,
)

result = matcher.match()

Matching one or more AddressRecord entries

If you want to match a small number of addresses, or you have them in-memory as Python dictionaries, you can pass them directly as addresses_to_match without needing to create a DuckDB relation first.

You can pass a list of AddressRecord entries directly as addresses_to_match. The matcher also accepts a list of dicts with address_concat, postcode, and unique_id, or a DuckDB relation.

import duckdb

from uk_address_matcher import AddressMatcher, AddressRecord

con = duckdb.connect()

df_canonical = con.read_parquet("your_canonical_addresses.parquet")

records = [
    AddressRecord(
        unique_id="m_1",
        address_concat="10 downing street westminster london",
        postcode="SW1A 2AA",
    ),
    AddressRecord(
        unique_id="m_2",
        address_concat="11 downing street westminster london",
        postcode="SW1A 2AB",
    ),
]

matcher = AddressMatcher(
    canonical_addresses=df_canonical,
    addresses_to_match=records,
    con=con,
)

result = matcher.match()

Two-Pass Matching Approach

The Splink phase uses a two-pass approach to achieve high accuracy matching:

  1. First Pass: A standard probabilistic linkage model using Splink generates candidate matches for each input address.

  2. Second Pass: Within each candidate group, the model analyzes distinguishing tokens to refine matches:

    • Identifies tokens that uniquely distinguish addresses within a candidate group
    • Detects "punishment tokens" (tokens in the messy address that don't match the current candidate but do match other candidates)
    • Uses this contextual information to improve match scores

This approach is particularly effective when matching to a canonical (deduplicated) address list, as it can identify subtle differences between very similar addresses.

Development

The scripts and tests will run better if you create .vscode/settings.json with the following:

{
    "jupyter.notebookFileRoot": "${workspaceFolder}",
    "python.analysis.extraPaths": [
        "${workspaceFolder}"
    ],
    "python.testing.pytestEnabled": true,
    "python.testing.unittestEnabled": false,
    "python.testing.pytestArgs": [
        "-v",
        "--capture=tee-sys"
    ]
}

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uk_address_matcher-1.0.0.dev24.tar.gz (1.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

uk_address_matcher-1.0.0.dev24-py3-none-any.whl (1.8 MB view details)

Uploaded Python 3

File details

Details for the file uk_address_matcher-1.0.0.dev24.tar.gz.

File metadata

  • Download URL: uk_address_matcher-1.0.0.dev24.tar.gz
  • Upload date:
  • Size: 1.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for uk_address_matcher-1.0.0.dev24.tar.gz
Algorithm Hash digest
SHA256 798aaab74fdfa9f9466aeafcdc9ff0f25ecf46bf17b9a0284d12b6ff3b22f098
MD5 6c98ab3c59bcc8ec612a5f270fa56713
BLAKE2b-256 846cd2b3d3a85d9313c9397abdf03156a48ac25810e21c54d499c7f0bb082e65

See more details on using hashes here.

File details

Details for the file uk_address_matcher-1.0.0.dev24-py3-none-any.whl.

File metadata

  • Download URL: uk_address_matcher-1.0.0.dev24-py3-none-any.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for uk_address_matcher-1.0.0.dev24-py3-none-any.whl
Algorithm Hash digest
SHA256 b2dccedb056df2f17ddf260b259f44a987013a7ea1df65469daae678f772a09a
MD5 7be07555e764e0b810954e89dbc9cfc6
BLAKE2b-256 b2ae46344025ff951d0805941f1baf3877e19d3d462173ced6203bc0ca3cb09f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page