Skip to main content
This is a pre-production deployment of Warehouse. Changes made here affect the production instance of PyPI (pypi.python.org).
Help us improve Python packaging - Donate today!

A record linkage toolkit for linking and deduplication

Project Description

The Python Record Linkage Toolkit is a library to link records in or between data sources. The toolkit provides most of the tools needed for record linkage and deduplication. The package contains indexing methods, functions to compare records and classifiers. The package is developed for research and the linking of small or medium sized files.

This project is inspired by the Freely Extensible Biomedical Record Linkage (FEBRL) project, which is a great project. In contrast with FEBRL, the recordlinkage project uses pandas and numpy for data handling and computations. The use of pandas, a flexible and powerful data analysis and manipulation library for Python, makes the record linkage process much easier and faster. The extensive pandas library can be used to integrate your record linkage directly into existing data manipulation projects.

One of the aims of this project is to make an easily extensible record linkage framework. It is easy to include your own indexing algorithms, comparison/similarity measures and classifiers.

Basic linking example

Import the recordlinkage module with all important tools for record linkage and import the data manipulation framework pandas.

import recordlinkage
import pandas

Load your data into pandas DataFrames.

df_a = pandas.DataFrame(YOUR_FIRST_DATASET)
df_b = pandas.DataFrame(YOUR_SECOND_DATASET)

Comparing all record can be computationally intensive. Therefore, we make set of candidate links with one of the built-in indexing techniques like blocking. In this example, only pairs of records that agree on the surname are returned.

block_class = recordlinkage.BlockIndex('surname')
candidate_links = block_class.index(df_a, df_b)

Older versions of Python Record Linkage Toolkit use a different syntax for indexing. More info about migrating can be found here.

For each candidate link, compare the records with one of the comparison or similarity algorithms in the Compare class.

c = recordlinkage.Compare(candidate_links, df_a, df_b)

c.string('name_a', 'name_b', method='jarowinkler', threshold=0.85)
c.exact('sex', 'gender')
c.date('dob', 'date_of_birth')
c.string('str_name', 'streetname', method='damerau_levenshtein', threshold=0.7)
c.exact('place', 'placename')
c.numeric('income', 'income', method='gauss', offset=3, scale=3, missing_value=0.5)

# The comparison vectors
c.vectors

This Python Record Linkage Toolkit contains multiple classification algorithms. Plenty of the algorithms need trainings data (supervised learning) while others are unsupervised. An example of supervised learning:

logrg = recordlinkage.LogisticRegressionClassifier()
logrg.learn(TRAINING_COMPARISON_VECTORS, TRAINING_CLASSES)

logrg.predict(c.vectors)

and an example of unsupervised learning (the well known ECM-algorithm):

ecm = recordlinkage.ECMClassifier()
ecm.learn(c.vectors)

Main Features

The main features of the Python Record Linkage Toolkit are:

  • Clean and standardise data with easy to use tools
  • Make pairs of records with smart indexing methods such as blocking and sorted neighbourhood indexing
  • Compare records with a large number of comparison and similarity measures for different types of variables such as strings, numbers and dates.
  • Several classifications algorithms, both supervised and unsupervised algorithms.
  • Common record linkage evaluation tools
  • Several built-in datasets.

Documentation

The most recent documentation and API reference can be found at recordlinkage.readthedocs.org. The documentation provides some basic usage examples like deduplication and linking census data. More examples are coming soon. If you do have interesting examples to share, let us know.

Dependencies, installation and license

Install the Python Record Linkage Toolkit easily with pip

pip install recordlinkage

The toolkit depends on Pandas (>=18.0), Numpy, Scikit-learn, Scipy and Jellyfish. You probably have most of them already installed. The package jellyfish is used for approximate string comparing and string encoding. The package Numexpr is an optional dependency to speed up numerical comparisons.

The license for this record linkage tool is GPLv3.

Need help?

Stuck on your record linkage code or problem? Any other questions? Don’t hestitate to send me an email (jonathandebruinos@gmail.com).

Release History

Release History

This version
History Node

0.9.0

History Node

0.8.1

History Node

0.8.0

History Node

0.7.2

History Node

0.7.1

History Node

0.7.0

History Node

0.6.0

History Node

0.5

History Node

0.4.0

History Node

0.3.1

History Node

0.3

History Node

0.2

Download Files

Download Files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

File Name & Checksum SHA256 Checksum Help Version File Type Upload Date
recordlinkage-0.9.0-py2.py3-none-any.whl (887.3 kB) Copy SHA256 Checksum SHA256 py2.py3 Wheel Jun 21, 2017

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting