Implementation in Apache Spark of the EM algorithm to estimate parameters of Fellegi-Sunter's canonical model of record linkage.
Project description
splink: Probabilistic record linkage and deduplication at scale
splink implements Fellegi-Sunter's canonical model of record linkage in Apache Spark, including EM algorithm to estimate parameters of the model.
The aims of splink are to:
-
Work at much greater scale than current open source implementations (100 million records +).
-
Get results faster than current open source implementations - with runtimes of less than an hour.
-
Have a highly transparent methodology, so the match scores can be easily explained both graphically and in words
-
Have accuracy similar to some of the best alternatives
Installation
splink is a Python package. It uses the Spark Python API to execute data linking jobs in a Spark cluster. It has been tested in Apache Spark 2.3 and 2.4.
Install splink using
pip install splink
Interactive demo
You can run demos of splink in an interactive Jupyter notebook by clicking the button below:
Documentation
The best documentation is currently a series of demonstrations notebooks in the splink_demos repo.
We also provide an interactive splink settings editor and example settings here. A tool to generate custom m and u probabilities can be found here.
The statistical model behind splink is the same as that used in the R fastLink package. Accompanying the fastLink package is an academic paper that describes this model. This is the best place to start for users wanting to understand the theory about how splink works.
You can read a short blog post about splink here.
Videos
You can find a short video introducing splink and running though an introductory demo here.
A 'best practices and performance tuning' tutorial can be found here.
Acknowledgements
We are very grateful to ADR UK (Administrative Data Research UK) for providing funding for this work as part of the Data First project.
We are also very grateful to colleagues at the UK's Office for National Statistics for their expert advice and peer review of this work.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file splink-0.2.7.tar.gz.
File metadata
- Download URL: splink-0.2.7.tar.gz
- Upload date:
- Size: 36.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.4 CPython/3.8.3 Darwin/19.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
16b2862f4a51aec7bd2edb72b67500e982f14718f7d8c1d637af78184a62c5f9
|
|
| MD5 |
7f09a9c28c817b49acf19e1667e82e5a
|
|
| BLAKE2b-256 |
95ff14f1b46337602bca58e83607351dc462b756dc6ebf0bc22d8146ac4a64c6
|
File details
Details for the file splink-0.2.7-py3-none-any.whl.
File metadata
- Download URL: splink-0.2.7-py3-none-any.whl
- Upload date:
- Size: 42.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.4 CPython/3.8.3 Darwin/19.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f454fd97a3e855f38c10b7823693ea0d42fd4773ae3623d60f9c6002617b6f52
|
|
| MD5 |
26f0c1ad05c6a8ad976612aca18e82c2
|
|
| BLAKE2b-256 |
82421a39d7892083c18f821dac54320b420d781b105b026efa14a8d38d0296d5
|