Skip to main content

A package for generating controlled errors in record linkage datasets

Project description

Error Generator

A Python package to introduce controlled errors into datasets for benchmarking record linkage frameworks. This package is a port of the R package rlErrorGeneratoR.

Overview

The Error Generator package allows you to introduce different types of data heterogeneity commonly found in record linkage projects:

  • Duplicates and twins
  • Name variations (suffixes, nicknames)
  • Character-level errors (insertions, deletions, replacements, transpositions)
  • Date format variations and errors
  • Missing values
  • Field swaps

The system allows you to control the overall rate of heterogeneity in the data, making it easy to run systematic controlled experiments.

Installation

pip install -r requirements.txt

Usage

import pandas as pd
from error_generator import ErrorGenerator, generate_errors

# Load your data
df = pd.read_csv('your_data.csv')

# Create error specifications
error_specs = pd.DataFrame({
    'error': ['indel', 'to_nickname', 'date_month_swap'],
    'amount': [10, 5, 3],
    'columns': ['name', 'first_name', 'birth_date'],
    'additional_args': [None, None, None]
})

# Method 1: Using the ErrorGenerator class
generator = ErrorGenerator(df)
generator.add_errors(error_specs)
modified_df = generator.get_modified()
error_record = generator.get_error_record()

# Method 2: Using the convenience function
modified_df = generate_errors(df, error_specs)

Available Error Types

The package supports the following types of errors:

  1. Character-level errors:

    • indel: Insert or delete characters
    • repl: Replace characters
    • tpose: Transpose adjacent characters
  2. Name variations:

    • to_nickname: Convert real names to nicknames
    • to_realname: Convert nicknames to real names
    • invert_nick_realnames: Invert real names and nicknames
    • name_suffix: Add name suffixes (Jr., Sr., etc.)
    • first_letter_abbreviate: Abbreviate to first letter
  3. Format variations:

    • blanks_to_hyphens: Replace spaces with hyphens
    • hyphens_to_blanks: Replace hyphens with spaces
    • missing: Set values to missing
    • swap: Swap values between columns
  4. Record-level variations:

    • married_name_change: Simulate married name changes
    • duplicate: Add duplicate records
    • twins: Generate twin records
  5. Date variations:

    • date_month_swap: Swap day and month in dates
    • date_transpose_year: Transpose year digits in dates
    • date_transpose_day: Transpose day digits in dates
    • date_replace_year: Replace year digits in dates
    • date_replace_month: Replace month in dates
    • date_replace_day: Replace day in dates

Data Requirements

  1. The input DataFrame must have an 'id' column.

  2. For name-related errors, you need to provide name lookup tables in the data directory:

    • first_names_male.csv
    • first_names_female.csv
    • last_names.csv
    • names_lookup.csv
    • nick_real_lookup.csv
  3. For gender-specific errors (e.g., married name changes), the DataFrame must have a sex/gender column with 'm'/'f' values.

Error Specifications

The error specifications DataFrame must have the following columns:

  • error: Type of error to introduce (see available error types above)
  • amount: Number of errors to introduce (can be absolute number or fraction < 1)
  • columns: Comma-separated list of columns to apply errors to
  • additional_args: Optional additional arguments as string (specific to error type)

Citation

If you use this package in your research, please cite:

Ilangovan, Gurudev (2019). Benchmarking the Effectiveness and Efficiency of Machine Learning Algorithms for Record Linkage. Master's thesis, Texas A&M University.
Available electronically from https://hdl.handle.net/1969.1/186390

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

error_generator-0.1.0.tar.gz (15.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

error_generator-0.1.0-py3-none-any.whl (18.1 kB view details)

Uploaded Python 3

File details

Details for the file error_generator-0.1.0.tar.gz.

File metadata

  • Download URL: error_generator-0.1.0.tar.gz
  • Upload date:
  • Size: 15.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for error_generator-0.1.0.tar.gz
Algorithm Hash digest
SHA256 816dd7630b19940023140ad8abb01c411fbc375196b69e0790d1f2dea3cd9e56
MD5 fb63c5eaff233d2be95d8c7e589b1252
BLAKE2b-256 81dd1e8138924ae13c2e9867094e431d656738bd359823fe9342ec74f5d93259

See more details on using hashes here.

File details

Details for the file error_generator-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for error_generator-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 415a3171b16fbb7e169ab1c619f143ed91e37b03bc579b75855b59a7c795d9c2
MD5 c9bd817ed58cac6897ae55b66bb82e14
BLAKE2b-256 1245e04e4cb3f2d4430234a79deaac61beaf049223c14760c66f62fff5363cff

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page