A package for generating realistic errors in data
Project description
Error Generator
A Python package to introduce controlled errors into datasets for benchmarking record linkage frameworks. This package is a port of the R package rlErrorGeneratoR.
Overview
The Error Generator package allows you to introduce different types of data heterogeneity commonly found in record linkage projects:
- Duplicates and twins
- Name variations (suffixes, nicknames)
- Character-level errors (insertions, deletions, replacements, transpositions)
- Date format variations and errors
- Missing values
- Field swaps
The system allows you to control the overall rate of heterogeneity in the data, making it easy to run systematic controlled experiments.
Installation
pip install -r requirements.txt
Usage
import pandas as pd
from error_generator import ErrorGenerator, generate_errors
# Load your data
df = pd.read_csv('your_data.csv')
# Create error specifications
error_specs = pd.DataFrame({
'error': ['indel', 'to_nickname', 'date_month_swap'],
'amount': [10, 5, 3],
'columns': ['name', 'first_name', 'birth_date'],
'additional_args': [None, None, None]
})
# Method 1: Using the ErrorGenerator class
generator = ErrorGenerator(df)
generator.add_errors(error_specs)
modified_df = generator.get_modified()
error_record = generator.get_error_record()
# Method 2: Using the convenience function
modified_df = generate_errors(df, error_specs)
Available Error Types
The package supports the following types of errors:
-
Character-level errors:
indel: Insert or delete charactersrepl: Replace characterstpose: Transpose adjacent characters
-
Name variations:
to_nickname: Convert real names to nicknamesto_realname: Convert nicknames to real namesinvert_nick_realnames: Invert real names and nicknamesname_suffix: Add name suffixes (Jr., Sr., etc.)first_letter_abbreviate: Abbreviate to first letter
-
Format variations:
blanks_to_hyphens: Replace spaces with hyphenshyphens_to_blanks: Replace hyphens with spacesmissing: Set values to missingswap: Swap values between columns
-
Record-level variations:
married_name_change: Simulate married name changesduplicate: Add duplicate recordstwins: Generate twin records
-
Date variations:
date_month_swap: Swap day and month in datesdate_transpose_year: Transpose year digits in datesdate_transpose_day: Transpose day digits in datesdate_replace_year: Replace year digits in datesdate_replace_month: Replace month in datesdate_replace_day: Replace day in dates
Data Requirements
-
The input DataFrame must have an 'id' column.
-
For name-related errors, you need to provide name lookup tables in the
datadirectory:first_names_male.csvfirst_names_female.csvlast_names.csvnames_lookup.csvnick_real_lookup.csv
-
For gender-specific errors (e.g., married name changes), the DataFrame must have a sex/gender column with 'm'/'f' values.
Error Specifications
The error specifications DataFrame must have the following columns:
error: Type of error to introduce (see available error types above)amount: Number of errors to introduce (can be absolute number or fraction < 1)columns: Comma-separated list of columns to apply errors toadditional_args: Optional additional arguments as string (specific to error type)
Citation
If you use this package in your research, please cite:
Ilangovan, Gurudev (2019). Benchmarking the Effectiveness and Efficiency of Machine Learning Algorithms for Record Linkage. Master's thesis, Texas A&M University.
Available electronically from https://hdl.handle.net/1969.1/186390
License
This project is licensed under the MIT License.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file error_generator-0.2.2.tar.gz.
File metadata
- Download URL: error_generator-0.2.2.tar.gz
- Upload date:
- Size: 16.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
49b46b9918488efd57f9e5d7220c9abf0c5a1ebb7cf01f23cda7174987f31955
|
|
| MD5 |
5ea683622c7ad4457e5753b6f1717b97
|
|
| BLAKE2b-256 |
a1e1f50c389df1024cd2fa24755a965e1380c545d0ac8ffb2f0837f57e092017
|
File details
Details for the file error_generator-0.2.2-py3-none-any.whl.
File metadata
- Download URL: error_generator-0.2.2-py3-none-any.whl
- Upload date:
- Size: 19.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8142ca9303d10f6d66f1926c594d9946842b039c928f23bb2682f6f7e15ea833
|
|
| MD5 |
619b744a7e090e7b3b565a0884ead034
|
|
| BLAKE2b-256 |
44d295569b85f3b5d4ce004bdfbe57568362b8644c0c413efcd5d1601b795146
|