A Python package for detecting garbled text using multiple detection strategies with a scikit-learn-like interface

These details have not been verified by PyPI

Project links

Project description

pygarble

Detect gibberish, garbled text, and corrupted content with high accuracy using advanced machine learning techniques.

pygarble is a powerful Python library designed to identify nonsensical, garbled, or corrupted text content that often appears in data processing pipelines, user inputs, or automated systems. Whether you're dealing with random character sequences, encoding errors, keyboard mashing, or corrupted data streams, pygarble provides multiple detection strategies to filter out unwanted content and maintain data quality. The library uses statistical analysis, entropy calculations, pattern matching, and n-gram analysis to distinguish between meaningful text and gibberish with configurable sensitivity levels.

Features

24 Detection Strategies: Choose from multiple garble detection algorithms including Markov chains, n-gram analysis, mojibake detection, and homoglyph detection
Zero Dependencies: Core library works without any external dependencies
Ensemble Detector: Combine multiple strategies for higher accuracy with voting mechanisms
Scikit-learn Interface: Familiar predict() and predict_proba() methods
Configurable Thresholds: Adjust sensitivity for each strategy
Probability Scores: Get confidence scores for garble detection
Input Validation: Built-in validation for thresholds and parameters
Type Hints: Full type annotation support throughout the codebase
Modular Design: Easy to extend with new detection strategies
Smart Edge Cases: Automatically detects extremely long strings without whitespace (like base64 data)

Installation

You can install pygarble using pip:

# Core library (zero dependencies)
pip install pygarble

# With pyspellchecker for legacy word validation (optional)
pip install pygarble[spellchecker]

Quick Start

from pygarble import GarbleDetector, Strategy, EnsembleDetector

# RECOMMENDED: Default ensemble (99.5% precision, majority voting)
ensemble = EnsembleDetector()
print(ensemble.predict("hello world"))      # False
print(ensemble.predict("asdfghjkl"))        # True
print(ensemble.predict("xkjqzpvmw"))        # True (impossible bigrams)

# Individual strategies for specific use cases
detector = GarbleDetector(Strategy.MARKOV_CHAIN)  # Best overall (92% F1)
print(detector.predict("the quick brown fox"))    # False

detector = GarbleDetector(Strategy.BIGRAM_PROBABILITY)  # 100% precision
print(detector.predict("qxjjxz"))                 # True (impossible bigrams)

# Batch processing
texts = ["Hello world", "asdfghjkl", "qwertyuiop"]
results = ensemble.predict(texts)
print(results)  # [False, True, True]

# Get probability scores
probabilities = ensemble.predict_proba(texts)
print(probabilities)  # [0.0, 0.8, 0.6]

Benchmark Results

Based on 1,644 test cases (420 internal + 1,224 external validation):

Top Strategy Performance

Strategy	Accuracy	Precision	Recall	F1 Score
markov_chain	93.19%	98.80%	86.39%	92.18%
ensemble	89.84%	99.50%	78.53%	87.78%
ngram_frequency	87.41%	96.34%	75.79%	84.84%
pronounceability	81.08%	84.47%	72.64%	78.11%
letter_position	77.86%	99.02%	52.88%	68.94%
bigram_probability	69.16%	100%	33.64%	50.34%

High-Precision Strategies (v0.5.0)

New strategies designed for maximum precision (minimize false positives):

Strategy	Precision	Target Use Case
bigram_probability	100%	Impossible letter pairs (qx, jj, xz)
rare_trigram	100%	Impossible trigrams (jjj, qqq, xqz)
vowel_pattern	100%	Invalid vowel sequences (aaaaaaa)
letter_frequency	100%	Abnormal letter distribution
consonant_sequence	99.38%	Impossible consonant runs (6+ consonants)
letter_position	99.02%	Letters in impossible positions

Default Ensemble Configuration

The default EnsembleDetector() uses majority voting with high-precision strategies:

MARKOV_CHAIN (95% precision, 61% recall)
WORD_LOOKUP (89% precision, 51% recall)
NGRAM_FREQUENCY (88% precision, 47% recall)
BIGRAM_PROBABILITY (100% precision, 25% recall)
LETTER_POSITION (93% precision, 35% recall)

Result: 99.5% precision - only 3 false positives out of 1,644 test cases.

Run the benchmark yourself:

python regression/benchmark.py

Detection Strategies

Each strategy implements a different approach to detect garbled text. All strategies return probability scores between 0.0 and 1.0, where higher scores indicate more likely garbled text.

1. Keyboard Pattern (`KEYBOARD_PATTERN`) ⭐ Best F1 Score

Implementation Logic: Detects keyboard row sequences (qwerty, asdf, zxcv) and analyzes trigram patterns. English text has predictable trigram distributions; garbled text doesn't.

Algorithm:

Extract trigrams from the text
Check for keyboard row sequences (forward and reverse)
Compare against common English trigrams
Detect repeated bigram patterns (ababab)

Parameters:

keyboard_threshold (float, default: 0.3): Threshold for keyboard pattern ratio
common_trigram_threshold (float, default: 0.1): Minimum common trigram ratio

detector = GarbleDetector(Strategy.KEYBOARD_PATTERN, threshold=0.5)

# Examples
detector.predict("asdfghjkl")       # True - keyboard row pattern
detector.predict("qwertyuiop")      # True - keyboard row pattern
detector.predict("Hello world")     # False - normal English text
detector.predict("ababababab")      # True - repeated bigram pattern

2. Vowel Ratio (`VOWEL_RATIO`) ⭐ Best Precision

Implementation Logic: Analyzes the ratio of vowels to consonants. Natural English has 35-45% vowels. Also detects consonant clusters that are impossible in English.

Algorithm:

Calculate vowel ratio in alphabetic characters
Detect long consonant clusters (4+ consecutive consonants)
Flag text outside normal vowel ratio range (15-65%)

Parameters:

min_vowel_ratio (float, default: 0.15): Minimum allowed vowel ratio
max_vowel_ratio (float, default: 0.65): Maximum allowed vowel ratio
consonant_cluster_len (int, default: 4): Max consonant cluster length

detector = GarbleDetector(Strategy.VOWEL_RATIO, threshold=0.5)

# Examples
detector.predict("bcdfghjklmnp")    # True - no vowels
detector.predict("aeiouaeiou")      # True - all vowels
detector.predict("Hello world")     # False - normal vowel ratio (~36%)
detector.predict("rhythm")          # False - valid English word

3. Entropy Based (`ENTROPY_BASED`)

Implementation Logic: Uses Shannon entropy combined with bigram frequency analysis. English text has predictable character and bigram distributions.

Algorithm:

Calculate Shannon entropy of alphabetic characters
Analyze common English bigram frequency (th, he, in, er, etc.)
Combine entropy and bigram scores

Parameters:

entropy_threshold (float, default: 2.5): Minimum required entropy
bigram_threshold (float, default: 0.15): Minimum common bigram ratio

detector = GarbleDetector(Strategy.ENTROPY_BASED, threshold=0.5)

# Examples
detector.predict("aaaaaaa")         # True - low entropy (repetitive)
detector.predict("xkjqzpv")         # True - no common bigrams
detector.predict("the weather")     # False - high common bigram ratio

4. Pattern Matching (`PATTERN_MATCHING`)

Implementation Logic: Uses regex patterns to detect suspicious sequences including keyboard rows, repeated characters, and consonant clusters.

Default Patterns:

special_chars: 3+ special characters
repeated_chars: 4+ repeated characters
uppercase_sequence: 5+ uppercase letters
long_numbers: 8+ consecutive digits
keyboard_row_qwerty: Keyboard row sequences (qwert, asdf, zxcv)
keyboard_row_reverse: Reverse keyboard sequences
consonant_cluster: 5+ consecutive consonants
alternating_pattern: Alternating character patterns (ababab)

detector = GarbleDetector(Strategy.PATTERN_MATCHING, threshold=0.2)

# Examples
detector.predict("asdfghjkl")       # True - keyboard row
detector.predict("AAAAA")           # True - repeated chars
detector.predict("normal text")     # False - no patterns match

5. Markov Chain (`MARKOV_CHAIN`) ⭐ NEW - Recommended

Implementation Logic: Uses a character-level Markov chain trained on English text. Computes the probability of text based on character transition frequencies. Garbled text has unusual character transitions.

Algorithm:

Train bigram transition probabilities on 300K+ English words
Compute average log-probability of character transitions
Map to garble score using sigmoid function

Parameters:

threshold_per_char (float, default: -3.5): Average log probability threshold

detector = GarbleDetector(Strategy.MARKOV_CHAIN, threshold=0.5)

# Examples
detector.predict("hello world")       # False - common bigrams (he, el, ll, lo, ow, wo, or, rl, ld)
detector.predict("asdfghjkl")         # True - unusual bigrams (sd, df, fg, gh, hj, jk, kl)
detector.predict("xzqkjhf")           # True - rare character transitions

6. N-gram Frequency (`NGRAM_FREQUENCY`) ⭐ NEW

Implementation Logic: Analyzes what proportion of character trigrams appear in common English text. Uses a set of 2000 most common English trigrams.

Algorithm:

Extract trigrams from words
Count how many appear in common trigram set
Low ratio = likely garbled

Parameters:

common_ratio_threshold (float, default: 0.3): Minimum ratio of common trigrams

detector = GarbleDetector(Strategy.NGRAM_FREQUENCY, threshold=0.5)

# Examples
detector.predict("the quick brown")   # False - trigrams: the, qui, uic, ick, bro, row, own
detector.predict("xzqkjhf")           # True - no common trigrams

7. Word Lookup (`WORD_LOOKUP`) ⭐ NEW - Zero Dependencies

Implementation Logic: Validates words against an embedded dictionary of 50,000 common English words. No external dependencies required.

Algorithm:

Tokenize text into words
Check each word against embedded word set
Return ratio of unknown words

Parameters:

unknown_threshold (float, default: 0.5): Ratio above which text is garbled

detector = GarbleDetector(Strategy.WORD_LOOKUP, threshold=0.5)

# Examples
detector.predict("hello world")       # False - both words in dictionary
detector.predict("xyzzy plugh")       # True - neither word in dictionary
detector.predict("hello xyzzy")       # True (0.5) - half unknown

8. Symbol Ratio (`SYMBOL_RATIO`) - NEW

Implementation Logic: Detects text with high proportion of special characters, numbers, or non-alphabetic content. Particularly effective for symbol spam, number sequences, and mixed alphanumeric noise.

Algorithm:

Count non-alphabetic characters (excluding spaces)
Calculate ratio to total characters
High ratio = likely garbled

Parameters:

symbol_threshold (float, default: 0.5): Ratio above which text is garbled
min_length (int, default: 3): Minimum text length to analyze
allow_spaces (bool, default: True): Whether to exclude spaces from ratio

detector = GarbleDetector(Strategy.SYMBOL_RATIO, threshold=0.5)

# Examples
detector.predict("!!!@@@###$$$")     # True - all symbols
detector.predict("abc123def456")     # True - high number ratio
detector.predict("hello world")       # False - mostly alphabetic

9. Repetition (`REPETITION`) - NEW

Implementation Logic: Detects text with excessive character or pattern repetition. Identifies repeated single characters, bigrams, trigrams, and low character diversity.

Algorithm:

Check for repeated single characters (aaaa)
Check for repeated bigrams (ababab)
Check for repeated trigrams (abcabcabc)
Analyze character diversity

Parameters:

max_char_repeat (int, default: 3): Maximum allowed consecutive repeated characters
max_pattern_repeat (int, default: 3): Maximum allowed pattern repetitions
diversity_threshold (float, default: 0.3): Minimum unique character ratio

detector = GarbleDetector(Strategy.REPETITION, threshold=0.5)

# Examples
detector.predict("aaaaaaaaaa")        # True - repeated character
detector.predict("abababababab")      # True - repeated bigram
detector.predict("hello world")       # False - diverse characters

10. Hex String (`HEX_STRING`) - NEW

Implementation Logic: Detects hash strings, UUIDs, base64-like content, and other hexadecimal patterns commonly found in garbled data.

Algorithm:

Check for pure hash patterns (MD5, SHA256)
Detect UUID format (8-4-4-4-12)
Identify long hex sequences
Check for base64-like patterns

Parameters:

min_hex_length (int, default: 16): Minimum hex sequence length to detect
hex_ratio_threshold (float, default: 0.7): Ratio of hex chars above which text is suspicious

detector = GarbleDetector(Strategy.HEX_STRING, threshold=0.5)

# Examples
detector.predict("5d41402abc4b2a76b9719d911017c592")  # True - MD5 hash
detector.predict("550e8400-e29b-41d4-a716-446655440000")  # True - UUID
detector.predict("hello world")                        # False - no hex patterns

11. Compression Ratio (`COMPRESSION_RATIO`) - NEW v0.4.0

Implementation Logic: Uses zlib compression to detect text with unusual entropy patterns. Natural language has patterns and redundancy that compress well, while random text compresses poorly.

Algorithm:

Compress text using zlib
Calculate compression ratio (compressed/original size)
Compare against thresholds

Parameters:

high_ratio_threshold (float, default: 1.1): Ratio above which text is garbled
low_ratio_threshold (float, default: 0.85): Ratio below which text is valid
min_length (int, default: 100): Minimum text length to analyze

detector = GarbleDetector(Strategy.COMPRESSION_RATIO, threshold=0.5)

# Examples (works best on longer text)
long_random = "xkjhqwerty zxcvbn " * 10
detector.predict(long_random)              # True - random patterns
detector.predict("hello " * 30)            # False - repetitive but valid

12. Mojibake Detection (`MOJIBAKE`) - NEW v0.4.0 ⭐ 100% Precision

Implementation Logic: Detects encoding corruption (mojibake) that occurs when UTF-8 text is incorrectly decoded as Latin-1 or Windows-1252. Identifies patterns like "Ã©" (should be "é") and Unicode replacement characters (�).

Algorithm:

Search for known mojibake byte patterns
Detect Unicode replacement characters (U+FFFD)
Check for high density of Latin-1 supplement characters
Identify double-encoding signatures

Parameters:

pattern_threshold (int, default: 1): Number of patterns to trigger detection
ratio_threshold (float, default: 0.05): High-byte density threshold
check_replacement_char (bool, default: True): Check for � characters

detector = GarbleDetector(Strategy.MOJIBAKE, threshold=0.5)

# Examples
detector.predict("Café au lait")           # False - correct UTF-8
detector.predict("CafÃ© au lait")          # True - mojibake (UTF-8 as Latin-1)
detector.predict("Hello � world")          # True - replacement character

13. Pronounceability (`PRONOUNCEABILITY`) - NEW v0.4.0

Implementation Logic: Analyzes if text follows English phonotactic rules. Detects forbidden consonant clusters, checks vowel distribution, and validates word onset patterns.

Algorithm:

Extract consonant clusters from words
Check against forbidden bigram combinations (e.g., "bk", "zt", "qx")
Verify vowel ratio is within pronounceable range
Validate word-initial consonant clusters

Parameters:

forbidden_cluster_threshold (int, default: 2): Forbidden clusters to trigger
min_word_length (int, default: 3): Minimum word length to analyze
vowel_min_ratio (float, default: 0.1): Minimum vowel ratio

detector = GarbleDetector(Strategy.PRONOUNCEABILITY, threshold=0.5)

# Examples
detector.predict("hello world")            # False - pronounceable
detector.predict("xkcd qwfp zxcv")         # True - unpronounceable clusters
detector.predict("bvnk tspk dkfm")         # True - forbidden consonant pairs
detector.predict("through threshold")       # False - valid English clusters

14. Unicode Script Mixing (`UNICODE_SCRIPT`) - NEW v0.4.0 ⭐ 100% Precision

Implementation Logic: Detects suspicious mixing of Unicode scripts, particularly homoglyph attacks where Cyrillic or Greek characters are disguised as Latin letters. Common in phishing attempts (e.g., "pаypal" with Cyrillic 'а').

Algorithm:

Check for known homoglyph characters (Cyrillic а, о, е, Greek ο, etc.)
Detect words mixing multiple scripts
Count total scripts used in text

Parameters:

homoglyph_threshold (int, default: 1): Homoglyphs to trigger detection
max_scripts (int, default: 2): Maximum allowed scripts
check_homoglyphs (bool, default: True): Enable homoglyph detection

detector = GarbleDetector(Strategy.UNICODE_SCRIPT, threshold=0.5)

# Examples
detector.predict("paypal")                 # False - all Latin
detector.predict("pаypal")                 # True - Cyrillic 'а' (U+0430)
detector.predict("gооgle")                 # True - Cyrillic 'о' (U+043E)
detector.predict("Hello АБВ World")        # True - mixed Latin/Cyrillic

15. Bigram Probability (`BIGRAM_PROBABILITY`) - NEW v0.5.0 ⭐ 100% Precision

Implementation Logic: Detects impossible letter pair combinations that never occur in English. Uses phonotactic constraints to identify gibberish.

Algorithm:

Extract all letter bigrams from text
Check against set of impossible bigrams (qx, jj, xz, etc.)
Calculate ratio of impossible to total bigrams

Parameters:

impossible_ratio_threshold (float, default: 0.1): Ratio above which text is garbled

detector = GarbleDetector(Strategy.BIGRAM_PROBABILITY, threshold=0.5)

# Examples
detector.predict("hello world")       # False - valid bigrams
detector.predict("qxjjxz")            # True - impossible bigrams (qx, jj, xz)
detector.predict("bxcxdx")            # True - impossible bigrams

16. Letter Position (`LETTER_POSITION`) - NEW v0.5.0 ⭐ 99% Precision

Implementation Logic: Detects letters appearing in impossible positions within words (start/end constraints).

Algorithm:

Check for letters that never end words (j, q, v)
Check for impossible word-initial letter pairs
Calculate violation ratio

detector = GarbleDetector(Strategy.LETTER_POSITION, threshold=0.5)

# Examples
detector.predict("Strange strings")    # False - valid positions
detector.predict("wordj endq")         # True - j and q can't end words
detector.predict("xjword bwtext")      # True - impossible word starts

17. Consonant Sequence (`CONSONANT_SEQUENCE`) - NEW v0.5.0

Implementation Logic: Detects impossibly long consonant sequences. English allows at most 5-6 consecutive consonants (e.g., "strengths").

Algorithm:

Extract consonant sequences from words
Flag sequences of 6+ consonants as impossible
Skip all-caps words (likely acronyms)

detector = GarbleDetector(Strategy.CONSONANT_SEQUENCE, threshold=0.5)

# Examples
detector.predict("strengths")         # False - valid (6 consonants max)
detector.predict("bcdfghjklmn")       # True - 11 consonants impossible
detector.predict("HTTP/HTTPS")        # False - acronyms skipped

18. Vowel Pattern (`VOWEL_PATTERN`) - NEW v0.5.0 ⭐ 100% Precision

Implementation Logic: Detects invalid vowel sequences. English has specific vowel patterns; repeated same vowels (5+) are impossible.

Algorithm:

Extract vowel sequences from text
Check for 5+ repeated same vowels
Allow valid patterns like "eau", "iou", "ueue"

detector = GarbleDetector(Strategy.VOWEL_PATTERN, threshold=0.5)

# Examples
detector.predict("beautiful")         # False - valid vowel pattern
detector.predict("aaaaaaa")           # True - repeated vowels
detector.predict("queue")             # False - valid "ueue" pattern

19. Letter Frequency (`LETTER_FREQUENCY`) - NEW v0.5.0 ⭐ 100% Precision

Implementation Logic: Detects text dominated by rare letters (j, q, x, z). Uses chi-squared analysis against English letter frequency norms.

Algorithm:

Calculate letter frequency distribution
Compare against expected English frequencies
Flag text with excessive rare letters

detector = GarbleDetector(Strategy.LETTER_FREQUENCY, threshold=0.5)

# Examples
detector.predict("The quick brown fox")  # False - normal distribution
detector.predict("jjjj qqqq xxxx zzzz")  # True - dominated by rare letters
detector.predict("xqzjxqzj")             # True - all rare letters

20. Rare Trigram (`RARE_TRIGRAM`) - NEW v0.5.0 ⭐ 100% Precision

Implementation Logic: Detects impossible three-letter combinations that never appear in English.

Algorithm:

Extract trigrams from text
Check against set of impossible trigrams
Calculate ratio of impossible trigrams

detector = GarbleDetector(Strategy.RARE_TRIGRAM, threshold=0.5)

# Examples
detector.predict("The quick brown fox")  # False - valid trigrams
detector.predict("jjjqqq")               # True - impossible trigrams
detector.predict("xqzjxq")               # True - impossible trigrams

21. English Word Validation (`ENGLISH_WORD_VALIDATION`)

Implementation Logic: Validates words against an English dictionary using pyspellchecker.

Note: Requires optional dependency. Install with pip install pygarble[spellchecker] Consider using WORD_LOOKUP instead for zero-dependency operation.

detector = GarbleDetector(Strategy.ENGLISH_WORD_VALIDATION, threshold=0.5)

# Examples
detector.predict("hello world")              # False - valid words
detector.predict("asdfghjkl qwertyuiop")    # True - invalid words

12. Character Frequency (`CHARACTER_FREQUENCY`)

Implementation Logic: Analyzes character frequency distribution. Garbled text often has skewed distributions.

detector = GarbleDetector(Strategy.CHARACTER_FREQUENCY, threshold=0.5)

# Examples
detector.predict("aaaaaaa")         # True - high 'a' frequency
detector.predict("normal text")     # False - balanced distribution

13. Statistical Analysis (`STATISTICAL_ANALYSIS`)

Implementation Logic: Analyzes the ratio of alphabetic to non-alphabetic characters.

detector = GarbleDetector(Strategy.STATISTICAL_ANALYSIS, threshold=0.5)

# Examples
detector.predict("123456789")       # True - no alphabetic chars
detector.predict("normal text")     # False - mostly alphabetic

14. Word Length (`WORD_LENGTH`)

Implementation Logic: Checks average word length against normal English patterns.

detector = GarbleDetector(Strategy.WORD_LENGTH, threshold=0.5)

# Examples
detector.predict("supercalifragilistic")  # True - very long word
detector.predict("short words here")       # False - normal lengths

Ensemble Detector

Combine multiple strategies for better accuracy using voting:

from pygarble import EnsembleDetector, Strategy

# Default ensemble (uses best-performing strategies)
ensemble = EnsembleDetector()
print(ensemble.predict("asdfghjkl"))  # True

# Custom strategies
ensemble = EnsembleDetector(
    strategies=[
        Strategy.KEYBOARD_PATTERN,
        Strategy.VOWEL_RATIO,
        Strategy.ENTROPY_BASED,
    ],
    voting="majority"  # or "average" or "weighted"
)

# Weighted voting
ensemble = EnsembleDetector(
    strategies=[Strategy.KEYBOARD_PATTERN, Strategy.VOWEL_RATIO],
    voting="weighted",
    weights=[0.7, 0.3]
)

# Batch processing
texts = ["Hello world", "asdfghjkl", "qwertyuiop"]
results = ensemble.predict(texts)
probas = ensemble.predict_proba(texts)

Voting Modes:

majority: Text is garbled if >50% of strategies agree
average: Average probability across all strategies
weighted: Weighted average using custom weights
any: High recall - text is garbled if ANY strategy flags it (best F1 score)
all: High precision - text is garbled only if ALL strategies agree

Advanced Usage

Input Validation

The library validates inputs automatically:

# Threshold must be between 0 and 1
detector = GarbleDetector(Strategy.KEYBOARD_PATTERN, threshold=1.5)
# Raises: ValueError: threshold must be between 0.0 and 1.0

# Threads must be positive
detector = GarbleDetector(Strategy.KEYBOARD_PATTERN, threads=0)
# Raises: ValueError: threads must be a positive integer

Batch Processing with Threading

detector = GarbleDetector(Strategy.KEYBOARD_PATTERN, threads=4)

# Process 1000 texts in parallel
texts = ["text"] * 1000
results = detector.predict(texts)

Custom Pattern Matching

custom_patterns = {
    'email': r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}',
    'phone': r'\d{3}-\d{3}-\d{4}',
}

detector = GarbleDetector(
    Strategy.PATTERN_MATCHING,
    patterns=custom_patterns,
    override_defaults=True  # Use only custom patterns
)

API Reference

GarbleDetector

GarbleDetector(
    strategy: Strategy,
    threshold: float = 0.5,
    threads: Optional[int] = None,
    **kwargs
)

Parameters:

strategy: Detection strategy to use
threshold: Probability threshold (0.0-1.0) for binary predictions
threads: Number of threads for batch processing
**kwargs: Strategy-specific parameters

Methods:

predict(X): Returns bool or List[bool]
predict_proba(X): Returns float or List[float]

EnsembleDetector

EnsembleDetector(
    strategies: Optional[List[Strategy]] = None,
    threshold: float = 0.5,
    voting: str = "majority",  # "majority", "average", "weighted", "any", "all"
    weights: Optional[List[float]] = None,
    threads: Optional[int] = None,
    **kwargs
)

Strategy Enum

class Strategy(Enum):
    # Core strategies (zero dependencies)
    MARKOV_CHAIN = "markov_chain"              # Recommended - Best F1 score
    NGRAM_FREQUENCY = "ngram_frequency"        # Trigram analysis
    WORD_LOOKUP = "word_lookup"                # Zero dependencies dictionary
    SYMBOL_RATIO = "symbol_ratio"              # Symbol/number detection
    REPETITION = "repetition"                  # Pattern repetition
    HEX_STRING = "hex_string"                  # Hash/UUID detection
    COMPRESSION_RATIO = "compression_ratio"    # v0.4.0 - Compression-based
    MOJIBAKE = "mojibake"                      # v0.4.0 - Encoding corruption
    PRONOUNCEABILITY = "pronounceability"      # v0.4.0 - Phonotactic rules
    UNICODE_SCRIPT = "unicode_script"          # v0.4.0 - Homoglyph detection

    # High-precision strategies (v0.5.0)
    BIGRAM_PROBABILITY = "bigram_probability"  # NEW v0.5.0 - 100% precision
    LETTER_POSITION = "letter_position"        # NEW v0.5.0 - 99% precision
    CONSONANT_SEQUENCE = "consonant_sequence"  # NEW v0.5.0 - Consonant runs
    VOWEL_PATTERN = "vowel_pattern"            # NEW v0.5.0 - 100% precision
    LETTER_FREQUENCY = "letter_frequency"      # NEW v0.5.0 - 100% precision
    RARE_TRIGRAM = "rare_trigram"              # NEW v0.5.0 - 100% precision

    # Legacy strategies
    CHARACTER_FREQUENCY = "character_frequency"
    WORD_LENGTH = "word_length"
    PATTERN_MATCHING = "pattern_matching"
    STATISTICAL_ANALYSIS = "statistical_analysis"
    ENTROPY_BASED = "entropy_based"
    VOWEL_RATIO = "vowel_ratio"
    KEYBOARD_PATTERN = "keyboard_pattern"

    # Strategy with optional dependency
    ENGLISH_WORD_VALIDATION = "english_word_validation"  # Requires: pygarble[spellchecker]

Architecture

pygarble/
├── __init__.py
├── core.py                      # GarbleDetector & EnsembleDetector
├── data/                        # Embedded training data
│   ├── words.py                 # 50K English words
│   ├── bigrams.py               # Character transition probabilities
│   └── trigrams.py              # Common English trigrams
└── strategies/
    ├── base.py                  # BaseStrategy with shared utilities
    ├── markov_chain.py          # Markov chain detection
    ├── ngram_frequency.py       # Trigram frequency analysis
    ├── word_lookup.py           # Dictionary lookup (zero deps)
    ├── symbol_ratio.py          # Symbol/number detection
    ├── repetition.py            # Pattern repetition
    ├── hex_string.py            # Hash/UUID detection
    ├── compression_ratio.py     # Compression-based detection
    ├── mojibake.py              # Encoding corruption detection
    ├── pronounceability.py      # Phonotactic rules
    ├── unicode_script.py        # Homoglyph/script detection
    ├── bigram_probability.py    # NEW v0.5.0: Impossible bigrams
    ├── letter_position.py       # NEW v0.5.0: Position constraints
    ├── consonant_sequence.py    # NEW v0.5.0: Consonant runs
    ├── vowel_pattern.py         # NEW v0.5.0: Vowel sequences
    ├── letter_frequency.py      # NEW v0.5.0: Letter distribution
    ├── rare_trigram.py          # NEW v0.5.0: Impossible trigrams
    ├── character_frequency.py
    ├── word_length.py
    ├── pattern_matching.py      # Regex patterns + keyboard detection
    ├── statistical_analysis.py
    ├── entropy_based.py         # Shannon entropy + bigram analysis
    ├── english_word_validation.py  # pyspellchecker (optional)
    ├── vowel_ratio.py           # Vowel analysis + consonant clusters
    └── keyboard_pattern.py      # N-gram + keyboard row detection

Dependencies

Core library: Zero dependencies - works with Python 3.8+ only

Optional dependency:

pygarble[spellchecker]: pyspellchecker for English word validation
- pyspellchecker>=0.7.0

Development

# Clone and setup
git clone https://github.com/brightertiger/pygarble.git
cd pygarble
pip install -r requirements.txt -r requirements-dev.txt

# Run tests
pytest tests/ -v

# Run benchmark
python regression/benchmark.py

# Linting
flake8 pygarble/
black pygarble/
mypy pygarble/

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Adding New Strategies

Create a new file in pygarble/strategies/
Inherit from BaseStrategy
Implement _predict_impl() and _predict_proba_impl()
Add to strategies/__init__.py and core.py
Add tests in tests/

License

MIT License - see LICENSE for details.

Changelog

0.5.0 (Current)

6 New High-Precision Strategies designed to minimize false positives:
- BIGRAM_PROBABILITY: Detects impossible letter pairs (qx, jj, xz) - 100% precision
- LETTER_POSITION: Detects letters in impossible positions - 99% precision
- CONSONANT_SEQUENCE: Detects impossibly long consonant runs (6+)
- VOWEL_PATTERN: Detects invalid vowel sequences - 100% precision
- LETTER_FREQUENCY: Detects text dominated by rare letters - 100% precision
- RARE_TRIGRAM: Detects impossible trigrams - 100% precision
Redesigned Default Ensemble: Uses majority voting with high-precision strategies for 99.5% precision
Expanded Benchmark: 1,644 test cases (420 internal + 1,224 external validation)
External Validation: Added dictionary words and generated gibberish to detect overfitting
Source Attribution: Benchmark now tracks internal vs external data sources
Overfitting Analysis: Benchmark compares internal vs external performance

0.4.0

4 New Specialized Strategies:
- COMPRESSION_RATIO: Detects garbled text using zlib compression analysis (best for long text)
- MOJIBAKE: Detects encoding corruption with 100% precision (UTF-8 decoded as Latin-1, replacement characters)
- PRONOUNCEABILITY: Detects unpronounceable text using English phonotactic rules (forbidden consonant clusters)
- UNICODE_SCRIPT: Detects homoglyph attacks with 100% precision (Cyrillic/Greek chars disguised as Latin)
Expanded Benchmark: 263 test cases across 44 categories (up from 200/34)
New Test Categories: mojibake_encoding, replacement_chars, homoglyph_attacks, mixed_scripts, unpronounceable, long_random_text, legitimate_unicode, spam_patterns
Zero Dependencies: All new strategies use only Python stdlib (zlib, unicodedata, re)

0.3.1

New Strategies: Added SYMBOL_RATIO, REPETITION, and HEX_STRING strategies for specialized detection
New Voting Modes: Added any (high recall) and all (high precision) voting modes for EnsembleDetector
Removed: LANGUAGE_DETECTION strategy (FastText dependency had NumPy 2.0 compatibility issues)
Production-Grade Robustness:
- Thread safety with timeout and exception handling in batch processing
- Division by zero protection across all strategy calculations
- Parameter validation for all strategy-specific parameters
- Pre-compiled regex patterns for improved performance
Comprehensive Edge Case Tests: 77 new tests covering parameter validation, type errors, Unicode handling, and boundary conditions

0.3.0

Zero Dependencies: Core library now works without any external dependencies
New Markov Chain Strategy: Character-level Markov chain trained on 300K+ English words
New N-gram Frequency Strategy: Trigram analysis using 2000 most common English trigrams
New Word Lookup Strategy: 50K embedded English word dictionary (replaces pyspellchecker dependency)
Embedded Training Data: Pre-computed bigrams, trigrams, and word sets included in package
Optional Dependencies: FastText and pyspellchecker moved to optional extras
Lightweight Package: ~190KB wheel size (well under 5MB limit)
Data Source: Training data from Peter Norvig's word frequency list (MIT licensed)

0.2.0

New Keyboard Pattern Strategy: Best-performing strategy with 69.9% F1 score
New Vowel Ratio Strategy: Highest precision (95.45%) with consonant cluster detection
EnsembleDetector: Built-in ensemble with majority/average/weighted voting
Enhanced Entropy Strategy: Added bigram frequency analysis using common English bigrams
Enhanced Pattern Matching: Added keyboard row patterns, consonant clusters, alternating patterns
Input Validation: Validates threshold (0-1) and threads parameters
Type Hints: Full type annotation throughout the codebase
Regression Tests: 117 test cases across 20 categories with benchmarking
Performance: Regex patterns compiled once at initialization

0.1.0

Initial release with 7 detection strategies
Scikit-learn-like interface
Probability scoring
Modular architecture

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.5.0

Jan 20, 2026

0.3.2

Jan 19, 2026

0.1.6

Dec 23, 2025

0.1.5

Sep 14, 2025

0.1.0

Sep 14, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pygarble-0.5.0.tar.gz (237.7 kB view details)

Uploaded Jan 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pygarble-0.5.0-py3-none-any.whl (233.4 kB view details)

Uploaded Jan 20, 2026 Python 3

File details

Details for the file pygarble-0.5.0.tar.gz.

File metadata

Download URL: pygarble-0.5.0.tar.gz
Upload date: Jan 20, 2026
Size: 237.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for pygarble-0.5.0.tar.gz
Algorithm	Hash digest
SHA256	`9339cb0ba67e1674cf645a295cc28e10aada85964468746c2104cde873d28ec9`
MD5	`af46e4262efe743b390828bb48f9937c`
BLAKE2b-256	`0349e0c8da160ff8fe3906230a49d6abd321fd7b61910efe4c4c826e092deba0`

See more details on using hashes here.

File details

Details for the file pygarble-0.5.0-py3-none-any.whl.

File metadata

Download URL: pygarble-0.5.0-py3-none-any.whl
Upload date: Jan 20, 2026
Size: 233.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for pygarble-0.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a554454b534463fce3f873335d570f1eee6b2c7c66c97d1b10e04bf9b90a550a`
MD5	`c31bd0ae4a7d070fd577e820349b761f`
BLAKE2b-256	`7ff5a63e689e1131cc25c04297f7b82d25c0a73c08d20b5cb1f318c5e5efc969`

See more details on using hashes here.

pygarble 0.5.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pygarble

Features

Installation

Quick Start

Benchmark Results

Top Strategy Performance

High-Precision Strategies (v0.5.0)

Default Ensemble Configuration

Detection Strategies

1. Keyboard Pattern (KEYBOARD_PATTERN) ⭐ Best F1 Score

2. Vowel Ratio (VOWEL_RATIO) ⭐ Best Precision

3. Entropy Based (ENTROPY_BASED)

4. Pattern Matching (PATTERN_MATCHING)

5. Markov Chain (MARKOV_CHAIN) ⭐ NEW - Recommended

6. N-gram Frequency (NGRAM_FREQUENCY) ⭐ NEW

7. Word Lookup (WORD_LOOKUP) ⭐ NEW - Zero Dependencies

8. Symbol Ratio (SYMBOL_RATIO) - NEW

9. Repetition (REPETITION) - NEW

10. Hex String (HEX_STRING) - NEW

11. Compression Ratio (COMPRESSION_RATIO) - NEW v0.4.0

12. Mojibake Detection (MOJIBAKE) - NEW v0.4.0 ⭐ 100% Precision

13. Pronounceability (PRONOUNCEABILITY) - NEW v0.4.0

14. Unicode Script Mixing (UNICODE_SCRIPT) - NEW v0.4.0 ⭐ 100% Precision

15. Bigram Probability (BIGRAM_PROBABILITY) - NEW v0.5.0 ⭐ 100% Precision

16. Letter Position (LETTER_POSITION) - NEW v0.5.0 ⭐ 99% Precision

17. Consonant Sequence (CONSONANT_SEQUENCE) - NEW v0.5.0

18. Vowel Pattern (VOWEL_PATTERN) - NEW v0.5.0 ⭐ 100% Precision

19. Letter Frequency (LETTER_FREQUENCY) - NEW v0.5.0 ⭐ 100% Precision

20. Rare Trigram (RARE_TRIGRAM) - NEW v0.5.0 ⭐ 100% Precision

21. English Word Validation (ENGLISH_WORD_VALIDATION)

12. Character Frequency (CHARACTER_FREQUENCY)

13. Statistical Analysis (STATISTICAL_ANALYSIS)

14. Word Length (WORD_LENGTH)

Ensemble Detector

Advanced Usage

Input Validation

Batch Processing with Threading

Custom Pattern Matching

API Reference

GarbleDetector

EnsembleDetector

Strategy Enum

Architecture

Dependencies

Development

Contributing

Adding New Strategies

License

Changelog

0.5.0 (Current)

0.4.0

0.3.1

0.3.0

0.2.0

0.1.0

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

1. Keyboard Pattern (`KEYBOARD_PATTERN`) ⭐ Best F1 Score

2. Vowel Ratio (`VOWEL_RATIO`) ⭐ Best Precision

3. Entropy Based (`ENTROPY_BASED`)

4. Pattern Matching (`PATTERN_MATCHING`)

5. Markov Chain (`MARKOV_CHAIN`) ⭐ NEW - Recommended

6. N-gram Frequency (`NGRAM_FREQUENCY`) ⭐ NEW

7. Word Lookup (`WORD_LOOKUP`) ⭐ NEW - Zero Dependencies

8. Symbol Ratio (`SYMBOL_RATIO`) - NEW

9. Repetition (`REPETITION`) - NEW

10. Hex String (`HEX_STRING`) - NEW

11. Compression Ratio (`COMPRESSION_RATIO`) - NEW v0.4.0

12. Mojibake Detection (`MOJIBAKE`) - NEW v0.4.0 ⭐ 100% Precision

13. Pronounceability (`PRONOUNCEABILITY`) - NEW v0.4.0

14. Unicode Script Mixing (`UNICODE_SCRIPT`) - NEW v0.4.0 ⭐ 100% Precision

15. Bigram Probability (`BIGRAM_PROBABILITY`) - NEW v0.5.0 ⭐ 100% Precision

16. Letter Position (`LETTER_POSITION`) - NEW v0.5.0 ⭐ 99% Precision

17. Consonant Sequence (`CONSONANT_SEQUENCE`) - NEW v0.5.0

18. Vowel Pattern (`VOWEL_PATTERN`) - NEW v0.5.0 ⭐ 100% Precision

19. Letter Frequency (`LETTER_FREQUENCY`) - NEW v0.5.0 ⭐ 100% Precision

20. Rare Trigram (`RARE_TRIGRAM`) - NEW v0.5.0 ⭐ 100% Precision

21. English Word Validation (`ENGLISH_WORD_VALIDATION`)

12. Character Frequency (`CHARACTER_FREQUENCY`)

13. Statistical Analysis (`STATISTICAL_ANALYSIS`)

14. Word Length (`WORD_LENGTH`)