Skip to main content

A package for generating controlled errors in record linkage datasets

Project description

Error Generator

A Python package to introduce controlled errors into datasets for benchmarking record linkage frameworks. This package is a port of the R package rlErrorGeneratoR.

Overview

The Error Generator package allows you to introduce different types of data heterogeneity commonly found in record linkage projects:

  • Duplicates and twins
  • Name variations (suffixes, nicknames)
  • Character-level errors (insertions, deletions, replacements, transpositions)
  • Date format variations and errors
  • Missing values
  • Field swaps

The system allows you to control the overall rate of heterogeneity in the data, making it easy to run systematic controlled experiments.

Installation

pip install -r requirements.txt

Usage

import pandas as pd
from error_generator import ErrorGenerator, generate_errors

# Load your data
df = pd.read_csv('your_data.csv')

# Create error specifications
error_specs = pd.DataFrame({
    'error': ['indel', 'to_nickname', 'date_month_swap'],
    'amount': [10, 5, 3],
    'columns': ['name', 'first_name', 'birth_date'],
    'additional_args': [None, None, None]
})

# Method 1: Using the ErrorGenerator class
generator = ErrorGenerator(df)
generator.add_errors(error_specs)
modified_df = generator.get_modified()
error_record = generator.get_error_record()

# Method 2: Using the convenience function
modified_df = generate_errors(df, error_specs)

Available Error Types

The package supports the following types of errors:

  1. Character-level errors:

    • indel: Insert or delete characters
    • repl: Replace characters
    • tpose: Transpose adjacent characters
  2. Name variations:

    • to_nickname: Convert real names to nicknames
    • to_realname: Convert nicknames to real names
    • invert_nick_realnames: Invert real names and nicknames
    • name_suffix: Add name suffixes (Jr., Sr., etc.)
    • first_letter_abbreviate: Abbreviate to first letter
  3. Format variations:

    • blanks_to_hyphens: Replace spaces with hyphens
    • hyphens_to_blanks: Replace hyphens with spaces
    • missing: Set values to missing
    • swap: Swap values between columns
  4. Record-level variations:

    • married_name_change: Simulate married name changes
    • duplicate: Add duplicate records
    • twins: Generate twin records
  5. Date variations:

    • date_month_swap: Swap day and month in dates
    • date_transpose_year: Transpose year digits in dates
    • date_transpose_day: Transpose day digits in dates
    • date_replace_year: Replace year digits in dates
    • date_replace_month: Replace month in dates
    • date_replace_day: Replace day in dates

Data Requirements

  1. The input DataFrame must have an 'id' column.

  2. For name-related errors, you need to provide name lookup tables in the data directory:

    • first_names_male.csv
    • first_names_female.csv
    • last_names.csv
    • names_lookup.csv
    • nick_real_lookup.csv
  3. For gender-specific errors (e.g., married name changes), the DataFrame must have a sex/gender column with 'm'/'f' values.

Error Specifications

The error specifications DataFrame must have the following columns:

  • error: Type of error to introduce (see available error types above)
  • amount: Number of errors to introduce (can be absolute number or fraction < 1)
  • columns: Comma-separated list of columns to apply errors to
  • additional_args: Optional additional arguments as string (specific to error type)

Citation

If you use this package in your research, please cite:

Ilangovan, Gurudev (2019). Benchmarking the Effectiveness and Efficiency of Machine Learning Algorithms for Record Linkage. Master's thesis, Texas A&M University.
Available electronically from https://hdl.handle.net/1969.1/186390

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

error_generator-0.2.0.tar.gz (15.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

error_generator-0.2.0-py3-none-any.whl (18.5 kB view details)

Uploaded Python 3

File details

Details for the file error_generator-0.2.0.tar.gz.

File metadata

  • Download URL: error_generator-0.2.0.tar.gz
  • Upload date:
  • Size: 15.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for error_generator-0.2.0.tar.gz
Algorithm Hash digest
SHA256 d7d3b411468f32a026b215f93f120198af41e9f5b17d0351514856b7c894c641
MD5 562fc448b933cfba45cd7347aeb36d4c
BLAKE2b-256 11cda8c153a74af78996e4b08acfa004bf8def217d9eebf093227e4a1f2aa527

See more details on using hashes here.

File details

Details for the file error_generator-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for error_generator-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0a90e7ac7b1df721eb6e0ac03cfddcdd43fc9b2ce283e00391e6c7a46d7b7729
MD5 d75386b55939f63974f0b90466f30669
BLAKE2b-256 3e0c46ef1c9910bf8c9a9b68d3bb491b3a164f60b4a1dc48a064442fdb28f764

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page