Fast supervised pyspark record linkage software
Project description
hlink: Historical Record Linkage
A paper on the creation and applications of this program can be found at https://www.tandfonline.com/doi/full/10.1080/01615440.2021.1985027.
Docs
The documentation site can be found at hlink.docs.ipums.org. This includes information about installation and setting up your configuration files.
An example script and configuration file can be found in the examples
directory.
Overview
Hlink is designed to link two datasets. The primary use case was for linking demographics in the Household -> Person hierarchical structure, however it can be used to link generic datasets as well by skipping household linking tasks. It allows for probabilistic and deterministic record linkage, and provides functionality for the following tasks:
- Preprocessing: Preprocess each dataset to clean and transform it in preparation for linking.
- Training: Train machine learning models on a set of features and compare results between models.
- Matching: Match two datasets using a model created in training or with deterministic rules.
- Household Training: Train machine learning models on a set of features for households and compare results between models.
- Household Matching: Match households between two datasets.
In addition, it also provides functionality for the following research/development tasks:
- Model Exploration and Household Model Exploration: Use a matrix of models and hyper-parameters to evaluate model performance and select a model to be used in the production run. Also generates reports of suspected false positives and false negatives in the specified training data set if appropriate config flag is set.
- Reporting: Generate reports on the linked data.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.