Skip to main content

A package for matching UK addresses using a pretrained Splink model

Project description

UK Address Matcher Logo

pypi Documentation

High performance UK addresses matcher (geocoder)

Fast, simple address matching (geocoding) in Python.

The key features are:

  • Simple: Python only, set up in seconds on any laptop, no infrastructure needed
  • Fast: Match 100,000 addresses in around 30 seconds*
  • Reproducible benchmarks: High accuracy, demonstrated with reproducible examples

* Timings based on a Macbook M4 Max.

Installation

pip install --pre uk_address_matcher

Usage

uk_address_matcher assumes you have two tables in the following format:

unique_id address_concat
1 123 Fake Street, Faketown, FA1 2KE
2 456 Other Road, Otherville, NO1 3WY
... ...

Generally one dataset will be a dataset of 'messy addresses' which need matching, and the second will be a 'canonical dataset' of addresses to match to, such as Ordnance Survey Addressbase or NGD.

Basic Matching

[!NOTE] Two runnable examples with live sample data are included for experimentation:

The package also provides downloadable demo datasets via ukam_datasets.

import duckdb

from uk_address_matcher import AddressMatcher, ukam_datasets

con = duckdb.connect()

# Download + load fictional London dummy datasets (cached locally)
df_messy, df_canonical = ukam_datasets.fictional_london

matcher = AddressMatcher(
    canonical_addresses=df_canonical,
    addresses_to_match=df_messy,
    con=con,
)

result = matcher.match()
result.matches().show(max_width=500)

The default stages are ExactMatchStage followed by SplinkStage. You can customise them by passing your own stages list:

from uk_address_matcher import (
    AddressMatcher,
    ExactMatchStage,
    SplinkStage,
    UniqueTrigramStage,
)

matcher = AddressMatcher(
    canonical_addresses=df_canonical,
    addresses_to_match=df_messy,
    con=con,
    stages=[
        ExactMatchStage(),
        UniqueTrigramStage(),
        SplinkStage(),
    ],
)

result = matcher.match()
result.matches().show(max_width=500)

Additional columns

You may also provide a separate column called postcode, which, if provided will take precidence over any postcode information provided in address_concat.

If you have labelled data (you know the ground truth), you may provide a column called ukam_label, if provided, this will propagate through your results for accuracy analysis.

Pre-preparing canonical data

Cleaning a large canonical dataset (e.g. AddressBase) is expensive. Use prepare_canonical_folder to do it once and write the artefacts to disk. Subsequent runs load the prepared folder directly, skipping cleaning entirely.

from uk_address_matcher import AddressMatcher, prepare_canonical_folder

# One-time preparation
prepare_canonical_folder(
    df_canonical,
    output_folder="./ukam_prepared_canonical",
    con=con,
    overwrite=True,
)

print("Prepared canonical data written to ./ukam_prepared_canonical/")

# Fast matching — pass the folder path instead of a relation
matcher = AddressMatcher(
    canonical_addresses="./ukam_prepared_canonical",
    addresses_to_match=df_messy,
    con=con,
)

result = matcher.match()
result.matches().show(max_width=500)

Matching one or more AddressRecord entries

If you want to match a small number of addresses, or you have them in-memory as Python dictionaries, you can pass them directly as addresses_to_match without needing to create a DuckDB relation first.

You can pass a list of AddressRecord entries directly as addresses_to_match. The matcher also accepts a list of dicts with address_concat, postcode, and unique_id, or a DuckDB relation.

import duckdb

from uk_address_matcher import AddressMatcher, AddressRecord, ukam_datasets

con = duckdb.connect()

df_canonical = ukam_datasets.as_relation("fictional_london_canonical", con=con)

records = [
    AddressRecord(
        unique_id="m_1",
        address_concat="96 Marlowhill Street, Kingsford, London",
        postcode="NW24 2CW",
    ),
    AddressRecord(
        unique_id="m_2",
        address_concat="46 Vespergate Road, Maple Green",
        postcode="NW26 6MU",
    ),
]

matcher = AddressMatcher(
    canonical_addresses=df_canonical,
    addresses_to_match=records,
    con=con,
)

result = matcher.match()
result.matches().show(max_width=500)

Methodology

The Splink phase uses a two-pass approach to achieve high accuracy matching:

  1. First Pass: A standard probabilistic linkage model using Splink generates candidate matches for each input address.

  2. Second Pass: Within each candidate group, the model analyzes distinguishing tokens to refine matches:

    • Identifies tokens that uniquely distinguish addresses within a candidate group
    • Detects "punishment tokens" (tokens in the messy address that don't match the current candidate but do match other candidates)
    • Uses this contextual information to improve match scores

This approach is particularly effective when matching to a canonical (deduplicated) address list, as it can identify subtle differences between very similar addresses.

Development

The scripts and tests will run better if you create .vscode/settings.json with the following:

{
    "jupyter.notebookFileRoot": "${workspaceFolder}",
    "python.analysis.extraPaths": [
        "${workspaceFolder}"
    ],
    "python.testing.pytestEnabled": true,
    "python.testing.unittestEnabled": false,
    "python.testing.pytestArgs": [
        "-v",
        "--capture=tee-sys"
    ]
}

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uk_address_matcher-1.0.0.dev28.tar.gz (1.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

uk_address_matcher-1.0.0.dev28-py3-none-any.whl (1.8 MB view details)

Uploaded Python 3

File details

Details for the file uk_address_matcher-1.0.0.dev28.tar.gz.

File metadata

  • Download URL: uk_address_matcher-1.0.0.dev28.tar.gz
  • Upload date:
  • Size: 1.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for uk_address_matcher-1.0.0.dev28.tar.gz
Algorithm Hash digest
SHA256 90758225cd8d42181b2576f64cee85befb05e69a76008ab34f6fb1be44d68f88
MD5 75080eab204de1ee74f8e7d4801a4507
BLAKE2b-256 7c05b2edc9bbf7368f02b51174b3b61d30660cc90097ccd3b4d9f74d60e1ca73

See more details on using hashes here.

File details

Details for the file uk_address_matcher-1.0.0.dev28-py3-none-any.whl.

File metadata

  • Download URL: uk_address_matcher-1.0.0.dev28-py3-none-any.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for uk_address_matcher-1.0.0.dev28-py3-none-any.whl
Algorithm Hash digest
SHA256 a8e33e1cea1695158c610f1263a2ed4712e540da3ee11a64e49058887ac927a4
MD5 ef8f937ee2a4705ad244fe24775e80de
BLAKE2b-256 bcbe05e2db485e1b7e30f4115ad0f81276f6f0cd846292e9268f0c056c4ce692

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page