Skip to main content

A package for generating realistic errors in data

Project description

Error Generator

A Python package to introduce controlled errors into datasets for benchmarking record linkage frameworks. This package is a port of the R package rlErrorGeneratoR.

Overview

The Error Generator package allows you to introduce different types of data heterogeneity commonly found in record linkage projects:

  • Duplicates and twins
  • Name variations (suffixes, nicknames)
  • Character-level errors (insertions, deletions, replacements, transpositions)
  • Date format variations and errors
  • Missing values
  • Field swaps

The system allows you to control the overall rate of heterogeneity in the data, making it easy to run systematic controlled experiments.

Installation

pip install -r requirements.txt

Usage

import pandas as pd
from error_generator import ErrorGenerator, generate_errors

# Load your data
df = pd.read_csv('your_data.csv')

# Create error specifications
error_specs = pd.DataFrame({
    'error': ['indel', 'to_nickname', 'date_month_swap'],
    'amount': [10, 5, 3],
    'columns': ['name', 'first_name', 'birth_date'],
    'additional_args': [None, None, None]
})

# Method 1: Using the ErrorGenerator class
generator = ErrorGenerator(df)
generator.add_errors(error_specs)
modified_df = generator.get_modified()
error_record = generator.get_error_record()

# Method 2: Using the convenience function
modified_df = generate_errors(df, error_specs)

Available Error Types

The package supports the following types of errors:

  1. Character-level errors:

    • indel: Insert or delete characters
    • repl: Replace characters
    • tpose: Transpose adjacent characters
  2. Name variations:

    • to_nickname: Convert real names to nicknames
    • to_realname: Convert nicknames to real names
    • invert_nick_realnames: Invert real names and nicknames
    • name_suffix: Add name suffixes (Jr., Sr., etc.)
    • first_letter_abbreviate: Abbreviate to first letter
  3. Format variations:

    • blanks_to_hyphens: Replace spaces with hyphens
    • hyphens_to_blanks: Replace hyphens with spaces
    • missing: Set values to missing
    • swap: Swap values between columns
  4. Record-level variations:

    • married_name_change: Simulate married name changes
    • duplicate: Add duplicate records
    • twins: Generate twin records
  5. Date variations:

    • date_month_swap: Swap day and month in dates
    • date_transpose_year: Transpose year digits in dates
    • date_transpose_day: Transpose day digits in dates
    • date_replace_year: Replace year digits in dates
    • date_replace_month: Replace month in dates
    • date_replace_day: Replace day in dates

Data Requirements

  1. The input DataFrame must have an 'id' column.

  2. For name-related errors, you need to provide name lookup tables in the data directory:

    • first_names_male.csv
    • first_names_female.csv
    • last_names.csv
    • names_lookup.csv
    • nick_real_lookup.csv
  3. For gender-specific errors (e.g., married name changes), the DataFrame must have a sex/gender column with 'm'/'f' values.

Error Specifications

The error specifications DataFrame must have the following columns:

  • error: Type of error to introduce (see available error types above)
  • amount: Number of errors to introduce (can be absolute number or fraction < 1)
  • columns: Comma-separated list of columns to apply errors to
  • additional_args: Optional additional arguments as string (specific to error type)

Citation

If you use this package in your research, please cite:

Ilangovan, Gurudev (2019). Benchmarking the Effectiveness and Efficiency of Machine Learning Algorithms for Record Linkage. Master's thesis, Texas A&M University.
Available electronically from https://hdl.handle.net/1969.1/186390

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

error_generator-0.2.2.tar.gz (16.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

error_generator-0.2.2-py3-none-any.whl (19.1 kB view details)

Uploaded Python 3

File details

Details for the file error_generator-0.2.2.tar.gz.

File metadata

  • Download URL: error_generator-0.2.2.tar.gz
  • Upload date:
  • Size: 16.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for error_generator-0.2.2.tar.gz
Algorithm Hash digest
SHA256 49b46b9918488efd57f9e5d7220c9abf0c5a1ebb7cf01f23cda7174987f31955
MD5 5ea683622c7ad4457e5753b6f1717b97
BLAKE2b-256 a1e1f50c389df1024cd2fa24755a965e1380c545d0ac8ffb2f0837f57e092017

See more details on using hashes here.

File details

Details for the file error_generator-0.2.2-py3-none-any.whl.

File metadata

File hashes

Hashes for error_generator-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 8142ca9303d10f6d66f1926c594d9946842b039c928f23bb2682f6f7e15ea833
MD5 619b744a7e090e7b3b565a0884ead034
BLAKE2b-256 44d295569b85f3b5d4ce004bdfbe57568362b8644c0c413efcd5d1601b795146

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page