Skip to main content

A package for matching UK addresses using a pretrained Splink model

Project description

Matching UK addresses using Splink

Installation

At the moment this uses a branch of Splink only available on Github.

pip install --pre uk_address_matcher
pip install git+https://github.com/moj-analytical-services/splink.git@2580-improve-runtimes-but-pushing-up-common-case-statements-into-precomputed-values

Usage

High performance address matching using a pre-trained Splink model.

Assuming you have two duckdb dataframes in this format:

unique_id address_concat postcode
1 123 Fake Street, Faketown FA1 2KE
2 456 Other Road, Otherville NO1 3WY
... ... ...

Match them with:

import duckdb

from uk_address_matcher import clean_data_using_precomputed_rel_tok_freq, get_linker

p_ch = "./example_data/companies_house_addresess_postcode_overlap.parquet"
p_fhrs = "./example_data/fhrs_addresses_sample.parquet"

con = duckdb.connect(database=":memory:")

df_ch = con.read_parquet(p_ch).order("postcode")
df_fhrs = con.read_parquet(p_fhrs).order("postcode")

df_ch_clean = clean_data_using_precomputed_rel_tok_freq(df_ch, con=con)
df_fhrs_clean = clean_data_using_precomputed_rel_tok_freq(df_fhrs, con=con)

linker = get_linker(
    df_addresses_to_match=df_fhrs_clean,
    df_addresses_to_search_within=df_ch_clean,
    con=con,
    include_full_postcode_block=True,
    additional_columns_to_retain=["original_address_concat"],
)

df_predict = linker.inference.predict(
    threshold_match_weight=-50, experimental_optimisation=True
)
df_predict_ddb = df_predict.as_duckdbpyrelation()

Initial tests suggest you can match ~ 1,000 addresses per second against a list of 30 million addresses on a laptop.

Refer to the example, which has detailed comments, for how to match your data.

See an example of comparing two addresses to get a sense of what it does/how it scores

Run an interactive example in your browser:

Open In Colab Match 5,000 FHRS records to 21,952 companies house records in < 10 seconds.

Open In Colab Investigate and understand how the model works

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uk_address_matcher-1.0.0.dev3.tar.gz (1.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

uk_address_matcher-1.0.0.dev3-py3-none-any.whl (1.8 MB view details)

Uploaded Python 3

File details

Details for the file uk_address_matcher-1.0.0.dev3.tar.gz.

File metadata

  • Download URL: uk_address_matcher-1.0.0.dev3.tar.gz
  • Upload date:
  • Size: 1.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.0.0 CPython/3.11.11 Darwin/24.3.0

File hashes

Hashes for uk_address_matcher-1.0.0.dev3.tar.gz
Algorithm Hash digest
SHA256 22a230fa059f671b0123b8f70ff34d5f92d0d943fdd4f2b06e481d985bcc066a
MD5 08e98c6a45c5b6ceb1e7af2124e65f2e
BLAKE2b-256 4a4fcb7f7cc7a2dc78fd95ff9b65dee170a205456002ef603ef511c9c283f776

See more details on using hashes here.

File details

Details for the file uk_address_matcher-1.0.0.dev3-py3-none-any.whl.

File metadata

File hashes

Hashes for uk_address_matcher-1.0.0.dev3-py3-none-any.whl
Algorithm Hash digest
SHA256 3fd93f8ea7a1a35aa1da8a6d7e7d6a8ec94238f318e289a85d06df87e26ddf49
MD5 f4be5f6f27d4b46b836cb221d064457b
BLAKE2b-256 625d1578f36afafeed28f1aab31a023f53bbef99d7aff090e7ca2cd1dca9a267

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page