One of the aims of this project is to make an easily extensible record
linkage framework. It is easy to include your own indexing algorithms,
comparison/similarity measures and classifiers.
Basic linking example
Import the recordlinkage module with all important tools for record
linkage and import the data manipulation framework pandas.
Load your data into pandas DataFrames.
df_a = pandas.DataFrame(YOUR_FIRST_DATASET)
df_b = pandas.DataFrame(YOUR_SECOND_DATASET)
Comparing all record can be computationally intensive. Therefore, we make
set of candidate links with one of the built-in indexing techniques like
blocking. In this example, only pairs of records that agree on the surname
block_class = recordlinkage.BlockIndex('surname')
candidate_links = block_class.index(df_a, df_b)
Older versions of Python Record Linkage Toolkit use a different syntax for
indexing. More info about migrating can be found here.
For each candidate link, compare the records with one of the
comparison or similarity algorithms in the Compare class.
c = recordlinkage.Compare(candidate_links, df_a, df_b)
c.string('name_a', 'name_b', method='jarowinkler', threshold=0.85)
c.string('str_name', 'streetname', method='damerau_levenshtein', threshold=0.7)
c.numeric('income', 'income', method='gauss', offset=3, scale=3, missing_value=0.5)
# The comparison vectors
This Python Record Linkage Toolkit contains multiple classification algorithms.
Plenty of the algorithms need trainings data (supervised learning) while
others are unsupervised. An example of supervised learning:
logrg = recordlinkage.LogisticRegressionClassifier()
and an example of unsupervised learning (the well known ECM-algorithm):
ecm = recordlinkage.ECMClassifier()
The main features of the Python Record Linkage Toolkit are:
- Clean and standardise data with easy to use tools
- Make pairs of records with smart indexing methods such as
blocking and sorted neighbourhood indexing
- Compare records with a large number of comparison and similarity
measures for different types of variables such as strings, numbers and dates.
- Several classifications algorithms, both supervised and unsupervised
- Common record linkage evaluation tools
- Several built-in datasets.
Dependencies, installation and license
Install the Python Record Linkage Toolkit easily with pip
pip install recordlinkage
The toolkit depends on Pandas (>=18.0), Numpy, Scikit-learn, Scipy and
Jellyfish. You probably have most of them already installed. The package
jellyfish is used for approximate string comparing and string encoding.
The package Numexpr is an optional dependency to speed up numerical
The license for this record linkage tool is GPLv3.