Skip to main content

A package for generating controlled errors in record linkage datasets

Project description

Error Generator

A Python package to introduce controlled errors into datasets for benchmarking record linkage frameworks. This package is a port of the R package rlErrorGeneratoR.

Overview

The Error Generator package allows you to introduce different types of data heterogeneity commonly found in record linkage projects:

  • Duplicates and twins
  • Name variations (suffixes, nicknames)
  • Character-level errors (insertions, deletions, replacements, transpositions)
  • Date format variations and errors
  • Missing values
  • Field swaps

The system allows you to control the overall rate of heterogeneity in the data, making it easy to run systematic controlled experiments.

Installation

pip install -r requirements.txt

Usage

import pandas as pd
from error_generator import ErrorGenerator, generate_errors

# Load your data
df = pd.read_csv('your_data.csv')

# Create error specifications
error_specs = pd.DataFrame({
    'error': ['indel', 'to_nickname', 'date_month_swap'],
    'amount': [10, 5, 3],
    'columns': ['name', 'first_name', 'birth_date'],
    'additional_args': [None, None, None]
})

# Method 1: Using the ErrorGenerator class
generator = ErrorGenerator(df)
generator.add_errors(error_specs)
modified_df = generator.get_modified()
error_record = generator.get_error_record()

# Method 2: Using the convenience function
modified_df = generate_errors(df, error_specs)

Available Error Types

The package supports the following types of errors:

  1. Character-level errors:

    • indel: Insert or delete characters
    • repl: Replace characters
    • tpose: Transpose adjacent characters
  2. Name variations:

    • to_nickname: Convert real names to nicknames
    • to_realname: Convert nicknames to real names
    • invert_nick_realnames: Invert real names and nicknames
    • name_suffix: Add name suffixes (Jr., Sr., etc.)
    • first_letter_abbreviate: Abbreviate to first letter
  3. Format variations:

    • blanks_to_hyphens: Replace spaces with hyphens
    • hyphens_to_blanks: Replace hyphens with spaces
    • missing: Set values to missing
    • swap: Swap values between columns
  4. Record-level variations:

    • married_name_change: Simulate married name changes
    • duplicate: Add duplicate records
    • twins: Generate twin records
  5. Date variations:

    • date_month_swap: Swap day and month in dates
    • date_transpose_year: Transpose year digits in dates
    • date_transpose_day: Transpose day digits in dates
    • date_replace_year: Replace year digits in dates
    • date_replace_month: Replace month in dates
    • date_replace_day: Replace day in dates

Data Requirements

  1. The input DataFrame must have an 'id' column.

  2. For name-related errors, you need to provide name lookup tables in the data directory:

    • first_names_male.csv
    • first_names_female.csv
    • last_names.csv
    • names_lookup.csv
    • nick_real_lookup.csv
  3. For gender-specific errors (e.g., married name changes), the DataFrame must have a sex/gender column with 'm'/'f' values.

Error Specifications

The error specifications DataFrame must have the following columns:

  • error: Type of error to introduce (see available error types above)
  • amount: Number of errors to introduce (can be absolute number or fraction < 1)
  • columns: Comma-separated list of columns to apply errors to
  • additional_args: Optional additional arguments as string (specific to error type)

Citation

If you use this package in your research, please cite:

Ilangovan, Gurudev (2019). Benchmarking the Effectiveness and Efficiency of Machine Learning Algorithms for Record Linkage. Master's thesis, Texas A&M University.
Available electronically from https://hdl.handle.net/1969.1/186390

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

error_generator-0.2.1.tar.gz (15.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

error_generator-0.2.1-py3-none-any.whl (18.5 kB view details)

Uploaded Python 3

File details

Details for the file error_generator-0.2.1.tar.gz.

File metadata

  • Download URL: error_generator-0.2.1.tar.gz
  • Upload date:
  • Size: 15.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for error_generator-0.2.1.tar.gz
Algorithm Hash digest
SHA256 9265ec99a7c05b66c1beed7f566bd2d76405f355d4036a974190f8d4507836d8
MD5 f08170201a19caf91ef9894fa3ba6cca
BLAKE2b-256 2df370e0bd4fb840039bb0b2b85c2e16bc1c7bd5e7edaad2a298dd465f041942

See more details on using hashes here.

File details

Details for the file error_generator-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for error_generator-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a708294a84b0c98b09cd13c14b972f995cc49032b2516ba843bbc6406c919186
MD5 1ccd50ce39c1325d4160ea01791370c3
BLAKE2b-256 b4cb7613b27ca565eb5b17d9df2bf3df2ad28fc0d878d37bd0a840c4e2f66611

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page