One of the aims of this project is to make an extensible record linkage
framework. It is easy to include your own indexing algorithms,
comparison/similarity measures and classifiers.
Basic linking example
Import the recordlinkage module with all important tools for record
linkage and import the data manipulation framework pandas.
For examples, you try to link two datasets with personal information
like name, sex and date of birth. Load these datasets into a pandas
df_a = pandas.DataFrame(YOUR_FIRST_DATASET)
df_b = pandas.DataFrame(YOUR_SECOND_DATASET)
Comparing all record can be computationally intensive. Therefore, we
make smart set of candidate links with one of the built-in indexing
techniques like blocking. Only records pairs that agree on the
surname are included.
index = recordlinkage.Pairs(df_a, df_b)
candidate_links = index.block('surname')
For each candidate link, compare the pair of records with the Compare
class and the available comparison/similarity functions.
compare = recordlinkage.Compare(candidate_links, df_a, df_b)
compare.string('name', 'name', method='jarowinkler', threshold=0.85)
compare.string('streetname', 'streetname', method='damerau_levenshtein', threshold=0.7)
compare.exact('haircolor', 'haircolor', missing_value=9)
# The comparison vectors
This record linkage package contains several classification alogirthms.
Plenty of the algorithms need trainings data (supervised learning) while
others are unsupervised. An example of supervised learning:
logrg = recordlinkage.LogisticRegressionClassifier()
and an example of unsupervised learning (the well known ECM-algorithm):
ecm = recordlinkage.BernoulliEMClassifier()
The main features of the recordlinkage package are:
- Clean and standardise data
- Make pairs of records with several indexing methods such as
blocking and sorted neighbourhood indexing
- Compare records with a large number of comparison and similarity
functions (including the jaro-winkler and levenshtein metrics)
- Several classifications algorithms, both supervised and unsupervised
Dependencies, installation and license
The following packages are required. You probably have it already ;)
The following packages are recommanded
- jellyfish: Needed for
approximate string comparison. Version 0.5.0 or higher.
Install the package with pip
pip install recordlinkage
The license for this record linkage tool is GPLv3.