A record linkage toolkit for linking and deduplication

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
License
- OSI Approved :: BSD License
Programming Language

Project description

RecordLinkage: powerful and modular Python record linkage toolkit

RecordLinkage is a powerful and modular record linkage toolkit to link records in or between data sources. The toolkit provides most of the tools needed for record linkage and deduplication. The package contains indexing methods, functions to compare records and classifiers. The package is developed for research and the linking of small or medium sized files.

This project is inspired by the Freely Extensible Biomedical Record Linkage (FEBRL) project, which is a great project. In contrast with FEBRL, the recordlinkage project uses pandas and numpy for data handling and computations. The use of pandas, a flexible and powerful data analysis and manipulation library for Python, makes the record linkage process much easier and faster. The extensive pandas library can be used to integrate your record linkage directly into existing data manipulation projects.

One of the aims of this project is to make an easily extensible record linkage framework. It is easy to include your own indexing algorithms, comparison/similarity measures and classifiers.

Basic linking example

Import the recordlinkage module with all important tools for record linkage and import the data manipulation framework pandas.

import recordlinkage
import pandas

Load your data into pandas DataFrames.

df_a = pandas.DataFrame(YOUR_FIRST_DATASET)
df_b = pandas.DataFrame(YOUR_SECOND_DATASET)

Comparing all record can be computationally intensive. Therefore, we make set of candidate links with one of the built-in indexing techniques like blocking. In this example, only pairs of records that agree on the surname are returned.

indexer = recordlinkage.Index()
indexer.block('surname')
candidate_links = indexer.index(df_a, df_b)

For each candidate link, compare the records with one of the comparison or similarity algorithms in the Compare class.

c = recordlinkage.Compare()

c.string('name_a', 'name_b', method='jarowinkler', threshold=0.85)
c.exact('sex', 'gender')
c.date('dob', 'date_of_birth')
c.string('str_name', 'streetname', method='damerau_levenshtein', threshold=0.7)
c.exact('place', 'placename')
c.numeric('income', 'income', method='gauss', offset=3, scale=3, missing_value=0.5)

# The comparison vectors
feature_vectors = c.compute(candidate_links, df_a, df_b)

Classify the candidate links into matching or distinct pairs based on their comparison result with one of the classification algorithms. The following code classifies candidate pairs with a Logistic Regression classifier. This (supervised machine learning) algorithm requires training data.

logrg = recordlinkage.LogisticRegressionClassifier()
logrg.fit(TRAINING_COMPARISON_VECTORS, TRAINING_PAIRS)

logrg.predict(feature_vectors)

The following code shows the classification of candidate pairs with the Expectation-Conditional Maximisation (ECM) algorithm. This variant of the Expectation-Maximisation algorithm doesn't require training data (unsupervised machine learning).

ecm = recordlinkage.ECMClassifier()
ecm.fit_predict(feature_vectors)

Main Features

The main features of this Python record linkage toolkit are:

Clean and standardise data with easy to use tools
Make pairs of records with smart indexing methods such as blocking and sorted neighbourhood indexing
Compare records with a large number of comparison and similarity measures for different types of variables such as strings, numbers and dates.
Several classifications algorithms, both supervised and unsupervised algorithms.
Common record linkage evaluation tools
Several built-in datasets.

Documentation

The most recent documentation and API reference can be found at recordlinkage.readthedocs.org. The documentation provides some basic usage examples like deduplication and linking census data. More examples are coming soon. If you do have interesting examples to share, let us know.

Installation

The Python Record linkage Toolkit requires Python 3.8 or higher. Install the package easily with pip

pip install recordlinkage

The toolkit depends on popular packages like Pandas, Numpy, Scipy and, Scikit-learn. A complete list of dependencies can be found in the installation manual as well as recommended and optional dependencies.

License

The license for this record linkage tool is BSD-3-Clause.

Citation

Please cite this package when being used in an academic context. Ensure that the DOI and version match the installed version. Citatation styles can be found on the publishers website 10.5281/zenodo.3559042.

@software{de_bruin_j_2019_3559043,
  author       = {De Bruin, J},
  title        = {{Python Record Linkage Toolkit: A toolkit for
                   record linkage and duplicate detection in Python}},
  month        = dec,
  year         = 2019,
  publisher    = {Zenodo},
  version      = {v0.14},
  doi          = {10.5281/zenodo.3559043},
  url          = {https://doi.org/10.5281/zenodo.3559043}
}

Need help?

Stuck on your record linkage code or problem? Any other questions? Don't hestitate to send me an email (jonathandebruinos@gmail.com).

Project details

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
License
- OSI Approved :: BSD License
Programming Language

Release history Release notifications | RSS feed

This version

0.16

Jul 20, 2023

0.15

Apr 19, 2022

0.14

Dec 1, 2019

0.13.2

Mar 27, 2019

0.13

Mar 15, 2019

0.12

Jul 26, 2018

0.11.2

Jan 4, 2018

0.11.1

Jan 4, 2018

0.11.0

Dec 22, 2017

0.10.1

Aug 30, 2017

0.10.0

Aug 30, 2017

0.9.0

Jun 21, 2017

0.8.1

Jan 27, 2017

0.8.0

Jan 22, 2017

0.7.2

Nov 9, 2016

0.7.1

Nov 9, 2016

0.7.0

Nov 8, 2016

0.6.0

Oct 12, 2016

0.5

Sep 9, 2016

0.4.0

Aug 20, 2016

0.3.1

Jun 15, 2016

0.3

Jun 11, 2016

0.2

Jun 4, 2016

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

recordlinkage-0.16.tar.gz (1.0 MB view details)

Uploaded Jul 20, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

recordlinkage-0.16-py3-none-any.whl (926.9 kB view details)

Uploaded Jul 20, 2023 Python 3

File details

Details for the file recordlinkage-0.16.tar.gz.

File metadata

Download URL: recordlinkage-0.16.tar.gz
Upload date: Jul 20, 2023
Size: 1.0 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for recordlinkage-0.16.tar.gz
Algorithm	Hash digest
SHA256	`ecda0c10dff138b1706815de332b1285f670ae7e8cce92596213501d589e6aa4`
MD5	`73099e4bda78cd3e75c5e1d2de48bc02`
BLAKE2b-256	`2371df9df311c651e016240ec4a15d6da7b354cddd2172433819e504ee3655bc`

See more details on using hashes here.

File details

Details for the file recordlinkage-0.16-py3-none-any.whl.

File metadata

Download URL: recordlinkage-0.16-py3-none-any.whl
Upload date: Jul 20, 2023
Size: 926.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for recordlinkage-0.16-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7ca404ab30435ea4b0ae2eda411f8dcc3c48186e152d3ca91fb525e8f6c0fd63`
MD5	`c0f3c602ed659b48cd985d3fea2bcf29`
BLAKE2b-256	`12fc05c343d0b8e02c1b2f45256202a50f6970dae0bfac791c569a74c779c76d`

See more details on using hashes here.

recordlinkage 0.16

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

RecordLinkage: powerful and modular Python record linkage toolkit

Basic linking example

Main Features

Documentation

Installation

License

Citation

Need help?

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes