Implementation of Fellegi-Sunter's canonical model of record linkage in Apache Spark, including EM algorithm to estimate parameters
Project description
splink: Probabilistic record linkage and deduplication at scale
splink
implements Fellegi-Sunter's canonical model of record linkage in Apache Spark, including the EM algorithm to estimate parameters of the model.
It:
-
Works at much greater scale than current open source implementations (100 million records+).
-
Runs quickly - with runtimes of less than an hour.
-
Has a highly transparent methodology; match scores can be easily explained both graphically and in words
-
Is highly accurate
It is assumed that users of Splink are familiar with the probabilistic record linkage theory, and the Fellegi-Sunter model in particular. A series of interactive articles explores the theory behind Splink.
The statistical model behind splink
is the same as that used in the R fastLink package. Accompanying the fastLink package is an academic paper that describes this model. This is the best place to start for users wanting to understand the theory about how splink
works.
Data Matching, a book by Peter Christen, is another excellent resource.
Installation
splink
is a Python package. It uses the Spark Python API to execute data linking jobs in a Spark cluster. It has been tested in Apache Spark 2.3, 2.4 and 3.1.
Install splink using:
pip install splink
Note that Splink requires pyspark
and a working Spark installation. These are not specified as explicit dependencies becuase it is assumed users have an existing pyspark setup they wish to use.
Interactive demo
You can run demos of splink
in an interactive Jupyter notebook by clicking the button below:
Documentation
The best documentation is currently a series of demonstrations notebooks in the splink_demos repo.
Other tools in the Splink family
Splink Graph
splink_graph
is a graph utility library for use in Apache Spark. It computes graph metrics on the outputs of data linking. The repo is here
- Quality assurance of linkage results and identifying false positive links
- Computing quality metrics associated with groups (clusters) of linked records
- Automatically identifying possible false positive links in clusters
Splink Comparison Viewer
splink_comparison_viewer
produces interactive dashboards that help you rapidly understand and quality assure the outputs of record linkage. A tutorial video is available here.
Splink Cluster Studio
splink_cluster_studio
creates an interactive html dashboard from Splink output that allows you to visualise and analyse a sample of clusters from your record linkage. The repo is here.
Splink Synthetic Data
This code is able to generate realistic test datasets for linkage using the WikiData Query Service.
It has been used to performance test the accuracy of various Splink models.
Interactive settings editor with autocomplete
We also provide an interactive splink
settings editor and example settings here.
Starting parameter generation tools
A tool to generate custom m
and u
probabilities can be found here.
Blog
You can read a short blog post about splink
here.
Videos
You can find a short video introducing splink
and running though an introductory demo here.
A 'best practices and performance tuning' tutorial can be found here.
How to make changes to Splink
(Steps 5 onwards for repo admins only)
- Raise new issue or target existing issue
- Create new branch (usually off master). Or fork for external contributors.
- Make changes, commit and push to GitHub
- Make pull request, referencing the issue
- Wait for tests to pass
- Review pull request
- Bump Splink version in pyproject.toml and update CHANGELOG.md as part of pull request
- Merge
- Create tagged release on Github. This will trigger autopublish to PyPi
Acknowledgements
We are very grateful to ADR UK (Administrative Data Research UK) for providing funding for this work as part of the Data First project.
We are also very grateful to colleagues at the UK's Office for National Statistics for their expert advice and peer review of this work.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.