A package for emulating common data entry errors
Project description
Mistaker
Mistaker is a Python package designed to emulate common data entry errors that occur in real-world datasets. It's particularly useful for testing data quality tools, generating synthetic training data, and simulating typical mistakes found in OCR output, manual transcription, and legacy data migration projects.
Features
- Simulate common transcription and data entry errors for:
- Text strings and words
- Personal names and business names
- Dates in various formats
- Numeric data
- Configurable error types and rates
- Support for multiple input formats
- Preserves data structure while introducing realistic errors
- Deterministic error generation available for testing
Installation
Install using pip:
pip install mistaker
For development installation:
pip install -e ".[test]"
Quick Start
from mistaker import Word, Name, Date, Number
# Generate word variations
Word.make_mistake("GRATEFUL") # => "GRATEFU"
Word.make_mistake("GRATEFUL") # => "GRATAFUL"
# Generate name variations with common mistakes
Name.make_mistake("KIM DEAL") # => "KIM FEAL"
Name.make_mistake("KIM DEAL") # => "KIM DEL"
Name("KIM DEAL").chaos() # => "DEELLL KIN"
# Generate date formatting errors and typos
Date.make_mistake("09/04/1982") # => "1928-09-04"
Date.make_mistake("09/04/1982") # => "0019-82-09"
# Generate numeric transcription errors
Number.make_mistake("12345") # => "12335"
Number.make_mistake("12345") # => "72345"
Detailed Usage
Word and Text Errors
from mistaker import Word, ErrorType
# Create a word instance
word = Word("TESTING")
# Generate specific error types
word.mistake(ErrorType.DROPPED_LETTER) # => "TESTNG"
word.mistake(ErrorType.DOUBLE_LETTER) # => "TESSTING"
word.mistake(ErrorType.MISREAD_LETTER) # => "TEZTING"
word.mistake(ErrorType.MISTYPED_LETTER) # => "TEDTING"
word.mistake(ErrorType.EXTRA_LETTER) # => "TESTINGS"
word.mistake(ErrorType.MISHEARD_LETTER) # => "TEZDING"
Name Handling
from mistaker import Name
# Create a name instance
name = Name("Robert James Smith")
# Generate name variations
variations = name.get_name_variations()
# Returns variations like:
# - "Smith, Robert"
# - "R James Smith"
# - "Robert Smith"
# - "Smith Robert"
# Generate case variants
cases = name.get_case_variants()
# Returns:
# - "Robert James Smith"
# - "ROBERT JAMES SMITH"
# - "robert james smith"
# Generate multiple errors
name.chaos() # Applies 1-6 random errors
# John Smith -> JAHN SMEH
Date Handling
from mistaker import Date
# Create a date instance
date = Date("2023-05-15")
# Supports multiple input formats
date = Date("05/15/2023") # US format
date = Date("15/05/2023") # UK format
# Generate specific error types
date.mistake(ErrorType.MONTH_DAY_SWAP) # => "2023-15-05"
date.mistake(ErrorType.ONE_DECADE_DOWN) # => "2013-05-15"
date.mistake(ErrorType.Y2K) # => "0023-05-15"
Number Handling
from mistaker import Number
# Create a number instance
number = Number("12345")
# Generate specific error types
number.mistake(ErrorType.ONE_DIGIT_UP) # => "12346"
number.mistake(ErrorType.ONE_DIGIT_DOWN) # => "12344"
number.mistake(ErrorType.KEY_SWAP) # => "21345"
number.mistake(ErrorType.DIGIT_SHIFT) # => "01234"
number.mistake(ErrorType.MISREAD) # => "12375"
number.mistake(ErrorType.NUMERIC_KEY_PAD) # => "12348"
Error Types
Text and Name Errors
- Dropped Letters: Missing characters (e.g., "testing" → "testng")
- Double Letters: Repeated characters (e.g., "testing" → "tessting")
- Misread Letters: Similar-looking character substitutions (e.g., "testing" → "tezting")
- Mistyped Letters: Keyboard proximity errors (e.g., "testing" → "tedting")
- Extra Letters: Common suffix additions (e.g., "test" → "tests")
- Misheard Letters: Phonetic errors (e.g., "testing" → "tesding")
Number Errors
- Single Digit Errors: Off-by-one errors
- Key Swaps: Adjacent digit transposition
- Digit Shifts: Decimal/position shifts
- Misread Numbers: Similar-looking number substitution
- Numeric Keypad Errors: Based on number pad layout
Date Errors
- Month/Day Swaps: Common in international formats
- Decade Shifts: Common in manual entry
- Y2K Issues: Two-digit year ambiguity
- All Number-Based Errors: Inherited from number handling
Development
Running Tests
pytest
Contributing
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- Inspired by real-world data quality challenges in government and enterprise systems
- Error patterns based on extensive analysis of common transcription mistakes
- Designed to support data quality testing and synthetic data generation
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mistaker-0.1.0.tar.gz.
File metadata
- Download URL: mistaker-0.1.0.tar.gz
- Upload date:
- Size: 12.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0a389d5bfb4eb3afc8be34e376f7c122d5ccf891a5b3940f7f7815d24df01322
|
|
| MD5 |
92f9f15720c51f500cc2f7776459ab9f
|
|
| BLAKE2b-256 |
407a988c3155cead7887dadec5566e402eac807a2f1d34e382e460246e12227b
|
File details
Details for the file mistaker-0.1.0-py3-none-any.whl.
File metadata
- Download URL: mistaker-0.1.0-py3-none-any.whl
- Upload date:
- Size: 10.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b49f6ab4b7148edf3b331dfbb27c7f141cb3458ff5242412252cc9159b3bdaf4
|
|
| MD5 |
09f63081aeb24ee0fef272d82617355c
|
|
| BLAKE2b-256 |
2bfe9775447b9f0869407787cd5bc0d571dd0ed3063468810c825a9856d0f2ca
|