This is a pre-production deployment of Warehouse, however changes made here WILL affect the production instance of PyPI.
Latest Version Dependencies status unknown Test status unknown Test coverage unknown
Project Description

This Record Linkage Toolkit is a library to link records in or between data sources. The package provides most of the tools needed for record linkage. The package contains indexing methods, functions to compare records and classifiers. The package is developed for research and linking of small or medium sized files.

This project is inspired by the Freely Extensible Biomedical Record Linkage (FEBRL) project, which is a great project. This project has one big difference, it uses pandas and numpy for data handling and computations. The use of pandas, a flexible and powerful data analysis and manipulation library for Python, makes the record linkage process much easier and faster. A lot of built-in pandas methods can be used to integrate your record linkage directly into existing data manipulation projects.

One of the aims of this project is to make an extensible record linkage framework. It is easy to include your own indexing algorithms, comparison/similarity measures and classifiers.

Basic linking example

Import the recordlinkage module with all important tools for record linkage and import the data manipulation framework pandas.

import recordlinkage
import pandas

For examples, you try to link two datasets with personal information like name, sex and date of birth. Load these datasets into a pandas DataFrame.

df_a = pandas.DataFrame(YOUR_FIRST_DATASET)
df_b = pandas.DataFrame(YOUR_SECOND_DATASET)

Comparing all record can be computationally intensive. Therefore, we make smart set of candidate links with one of the built-in indexing techniques like blocking. Only records pairs that agree on the surname are included.

index = recordlinkage.Pairs(df_a, df_b)
candidate_links = index.block('surname')

For each candidate link, compare the pair of records with the Compare class and the available comparison/similarity functions.

compare = recordlinkage.Compare(candidate_links, df_a, df_b)

compare.string('name', 'name', method='jarowinkler', threshold=0.85)
compare.exact('sex', 'gender')
compare.exact('dob', 'date_of_birth')
compare.string('streetname', 'streetname', method='damerau_levenshtein', threshold=0.7)
compare.exact('place', 'placename')
compare.exact('haircolor', 'haircolor', missing_value=9)

# The comparison vectors
compare.vectors

This record linkage package contains several classification alogirthms. Plenty of the algorithms need trainings data (supervised learning) while others are unsupervised. An example of supervised learning:

logrg = recordlinkage.LogisticRegressionClassifier()
logrg.learn(TRAINING_COMPARISON_VECTORS, TRAINING_CLASSES)

logrg.predict(compare.vectors)

and an example of unsupervised learning (the well known ECM-algorithm):

ecm = recordlinkage.BernoulliEMClassifier()
ecm.learn(compare.vectors)

Main Features

The main features of the recordlinkage package are:

  • Clean and standardise data
  • Make pairs of records with several indexing methods such as blocking and sorted neighbourhood indexing
  • Compare records with a large number of comparison and similarity functions (including the jaro-winkler and levenshtein metrics)
  • Several classifications algorithms, both supervised and unsupervised algorithms.

Documentation

The most recent documentation and API reference can be found at recordlinkage.readthedocs.org. The documentation provides some basic usage examples like deduplication and linking census data. More examples are coming soon. If you do have interesting examples to share, let us know.

Dependencies, installation and license

The following packages are required. You probably have it already ;)

The following packages are recommanded

  • jellyfish: Needed for approximate string comparison. Version 0.5.0 or higher.

Install the package with pip

pip install recordlinkage

The license for this record linkage tool is GPLv3.

Need help?

Stuck on your record linkage code or problem? Any other questions? Don’t hestitate to send me an email (jonathandebruinhome@gmail.com).

Release History

Release History

0.7.2

This version

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.7.1

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.7.0

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.6.0

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.5

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.4.0

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.3.1

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.3

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.2

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

Download Files

Download Files

TODO: Brief introduction on what you do with files - including link to relevant help section.

File Name & Checksum SHA256 Checksum Help Version File Type Upload Date
recordlinkage-0.7.2-py2.py3-none-any.whl (1.0 MB) Copy SHA256 Checksum SHA256 py2.py3 Wheel Nov 9, 2016

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS HPE HPE Development Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting