[Beta]: Implementation in Apache Spark of the EM algorithm to estimate parameters of Fellegi-Sunter's canonical model of record linkage.
splink: Probabalistic record linkage at scale
WARNING: Splink is is currently in beta testing. Please feel free to try it, but note this software is not fully tested, and the interface is likely to continue to change.
splink implements Fellegi-Sunter's canonical model of record linkage in Apache Spark, including EM algorithm to estimate parameters of the model.
The aims of
splink are to:
Work at much greater scale than current open source implementations (100 million records +).
Get results faster than current open source implementations - with runtimes of less than an hour.
Have a highly transparent methodology, so the match scores can be easily explained both graphically and in words
Have accuracy similar to some of the best alternatives
splink is a Python package. It uses the Spark Python API to execute data linking jobs in a Spark cluster. It has been tested in Apache Spark 2.3 and 2.4.
Install splink using
pip install splink
You can run demos of
splink in an interactive Jupyter notebook by clicking the button below:
The best documentation is currently a series of demonstrations notebooks in the splink_demos repo.
We also provide an interactive
splink settings editor and example settings here
The statistical model behind
splink is the same as that used in the R fastLink package. Accompanying the fastLink package is an academic paper that describes this model. This is the best place to start for users wanting to understand the theory about how
You can find a short video introducing
splink and running though an introductory demo here.
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size splink-0.1.6-py3-none-any.whl (34.1 kB)||File type Wheel||Python version py3||Upload date||Hashes View|
|Filename, size splink-0.1.6.tar.gz (28.9 kB)||File type Source||Python version None||Upload date||Hashes View|