A configurable Python tool for detecting and redacting sensitive data using entropy analysis and pattern matching
Project description
High-Entropy String Redaction Tool
A configurable Python tool for detecting and redacting sensitive data in text using entropy analysis and pattern matching. Designed for processing log files, user input, and any text containing potentially sensitive information like timestamps, UUIDs, account IDs, IP addresses, and AWS hostnames.
๐ Quick Start
Installation
# Development installation with uv (recommended)
git clone https://github.com/rjury-sumo/python-entrophier.git
cd python-entrophier
uv sync
# Or install locally with uv
uv pip install .
# Or install with pip
pip install .
# Note: Package not yet published to PyPI
Command-Line Usage
# Process a file (redacted output only)
entrophier input.txt
# Comparative mode (see original vs redacted)
entrophier -c input.txt
# Comparative mode with condensed asterisks
entrophier -c test_strings.txt --condense-asterisks
# Process stdin
cat logfile.txt | entrophier
# Save to file
entrophier -c -o output.txt input.txt
๐ Features
Core Capabilities
- Shannon Entropy Analysis: Identifies high-randomness strings using information theory
- Pattern-Based Detection: Recognizes structured data like timestamps, IPs, and AWS hostnames
- Word Preservation: Protects common English words and technical terms from redaction
- Selective Redaction: AWS hostname patterns preserve domain parts while redacting sensitive IDs
- Two Redaction Methods: Token-level (default, more reliable) and sliding window (more granular)
Supported Pattern Types
- Datetime/Timestamps: ISO 8601, human-readable dates, epoch timestamps
- IP Addresses: IPv4, IPv6, with contextual detection
- AWS Resources: EC2 hostnames, RDS endpoints, CloudFront distributions, API Gateway
- UUIDs and Hex Sequences: Full and partial UUID detection
- Account IDs and Random Strings: Long numeric and alphanumeric sequences
โ๏ธ Configuration
The tool requires three YAML configuration files:
common_words.yaml
Contains word lists and patterns to preserve during redaction:
common_words:
- admin, alert, application, auth, backup, cache
- database, debug, deployment, development, device
# ... extensive list of technical and common terms
word_patterns:
prefixes: [pre, post, anti, auto, co, de, dis, ...]
suffixes: [able, ible, al, ed, er, est, ful, ic, ...]
entropy_settings.yaml
Controls detection behavior and thresholds:
entropy_detection:
default_threshold: 2.5 # Higher = less sensitive
word_pattern_bonus: 0.5 # Extra threshold for word-like patterns
min_length: 4 # Minimum string length to consider
window_size: 6 # Sliding window size
output_formatting:
# Note: condense_asterisks is now a command-line option only
pattern_detection:
detect_timestamps: true # Enable timestamp detection
detect_ip_addresses: true # Enable IP address detection
detect_aws_hostnames: true # Enable AWS hostname detection
redaction_patterns.yaml
Defines regex patterns for structured data detection:
datetime_patterns:
human_readable_datetime: |
(Mon|Tue|Wed|Thu|Fri|Sat|Sun)\s+(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+\d{1,2}\s+\d{2}:\d{2}:\d{2}\s+\w+\s+\d{4}
date_yyyy_mm_dd: '\d{4}/\d{2}/\d{2}'
# ... more datetime patterns
aws_selective_patterns:
ec2_public_ipv4_dns:
pattern: 'ec2-(\d{1,3}-\d{1,3}-\d{1,3}-\d{1,3})(\..*\.amazonaws\.com)'
replacement: 'ec2-*\2'
# ... selective redaction patterns
๐ ๏ธ Command-Line Options
entrophier [options] [input_file]
Options:
-c, --comparative Show original and redacted (default: redacted only)
-o OUTPUT, --output OUTPUT Output file path (default: stdout)
--method {token,sliding} Redaction method (default: token)
--threshold THRESHOLD Override entropy threshold
--min-length MIN_LENGTH Override minimum length
--condense-asterisks Condense consecutive asterisks to single *
--config-dir CONFIG_DIR Directory containing configuration files
Positional:
input_file Input file path (use "-" or omit for stdin)
๐ Library Usage
from entrophier import redact_sensitive_data, load_config
# Load configuration (required - uses default config files from module directory)
load_config()
# Basic usage (recommended)
result = redact_sensitive_data("model-scheduler-667689996-jd4g7")
# Output: "model-scheduler-*********-*****"
# Use custom config directory
load_config(config_dir="/path/to/config")
# Alternative methods
from entrophier import redact_high_entropy_tokens, redact_high_entropy_strings
# Token-level approach (more reliable)
result = redact_high_entropy_tokens("text")
# Sliding window approach (more granular)
result = redact_high_entropy_strings("text")
๐ฏ Examples
Input/Output Examples
# AWS CloudTrail path
Original: s3://aws-cloudtrail-logs-123456789012-us-east-1/CloudTrail/...
Redacted: s3://aws-cloudtrail-logs-************-us-east-1/CloudTrail/...
# EC2 hostname (selective redaction)
Original: ec2-198-51-100-1.compute-1.amazonaws.com
Redacted: ec2-*.compute-1.amazonaws.com
# Human-readable datetime
Original: Wed Sep 24 11:17:52 NZST 2025
Redacted: Wed Sep 24 ******** NZST 2025
# UUID
Original: uuid-550e8400-e29b-41d4-a716-446655440000
Redacted: uuid-********-****-****-****-************
# Mixed content preserving common words
Original: database-connection-string-server-production-guid-550e8400
Redacted: database-connection-string-server-production-guid-********
# With --condense-asterisks flag
Original: user-session-abc123def456-another-xyz789abc123
Redacted: user-session-*-another-*
# Without --condense-asterisks flag (default)
Original: user-session-abc123def456-another-xyz789abc123
Redacted: user-session-************-another-************
๐ ๏ธ Development
Setup Development Environment
# Clone the repository
git clone https://github.com/rjury-sumo/python-entrophier.git
cd python-entrophier
# Install with uv (creates .venv automatically)
uv sync
# Verify installation
uv run python -c "from entrophier import redact_sensitive_data, load_config; load_config(); print('โ Setup complete')"
Development Workflow
# Run tests during development
uv run pytest
# Run tests with coverage
uv run pytest --cov=entrophier --cov-report=term-missing
# Run CLI in development mode
uv run entrophier input.txt
# Build package
uv build
# Run specific module
uv run python -m entrophier input.txt
Project Dependencies
- Runtime:
pyyaml>=6.0.0 - Development:
pytest>=8.0.0,pytest-cov>=4.0.0 - Python: 3.8+ (compatible with Python 3.8 through 3.13)
- Build: hatchling (specified in pyproject.toml)
๐งช Testing
Run the comprehensive pytest test suite (40 tests):
# Run all tests
uv run pytest
# With verbose output
uv run pytest -v
# With coverage report
uv run pytest --cov=entrophier
# Run specific test class
uv run pytest tests/test_entropy.py::TestEntropyCalculation -v
The test suite includes:
- Entropy calculation and segment detection
- Word preservation and pattern matching
- AWS S3 CloudTrail and CloudWatch paths
- Container and Docker paths (Kubernetes, Docker)
- Windows and Linux file paths
- Database connection strings
- IP addresses (IPv4/IPv6) and network identifiers
- Various timestamp and datetime formats
- Edge cases and boundary conditions
See tests/README.md for detailed test documentation.
๐ง How It Works
Entropy Analysis
The tool uses Shannon entropy to measure randomness in character distributions:
- High entropy (>2.5 bits): Random-looking strings like
abc123def456 - Low entropy (<2.0 bits): Structured text like common words
- Pattern boosting: Adds entropy score for mixed case, alphanumeric combinations
Two-Phase Detection
- Pattern-based detection: Identifies structured data using regex patterns
- Entropy-based analysis: Catches random strings missed by patterns
Word Preservation
- Extensive dictionary of common English and technical terms
- Pattern matching for word prefixes/suffixes
- Contextual analysis to avoid redacting legitimate words
๐ Project Structure
python-entrophier/
โโโ src/
โ โโโ entrophier/
โ โโโ __init__.py # Public API exports
โ โโโ __main__.py # Module entry point
โ โโโ cli.py # Command-line interface
โ โโโ config.py # Configuration loading
โ โโโ core.py # Core redaction logic
โ โโโ common_words.yaml # Word lists and patterns
โ โโโ entropy_settings.yaml # Detection and output settings
โ โโโ redaction_patterns.yaml # Regex patterns for structured data
โโโ tests/
โ โโโ test_entropy.py # Comprehensive test suite
โโโ pyproject.toml # Project metadata and dependencies
โโโ README.md # This documentation
โโโ .python-version # Python version (3.13)
โ ๏ธ Configuration Requirements
The package will exit with an error if any required configuration files or settings are missing. This ensures consistent behavior and prevents fallback to hardcoded values.
Default Configuration: Config files are bundled with the package in src/entrophier/. For custom configurations, use the --config-dir CLI option or load_config(config_dir="/path/to/config") in Python.
Required sections in YAML files:
common_words.yaml:common_words,word_patternsentropy_settings.yaml:entropy_detection,pattern_detectionredaction_patterns.yaml: Pattern sections as needed
๐๏ธ Customization
Adding Custom Patterns
Add to redaction_patterns.yaml:
custom_patterns:
my_identifier: 'pattern-\d{6}-[a-z]{4}'
Adjusting Sensitivity
Modify entropy_settings.yaml:
entropy_detection:
default_threshold: 3.0 # Less sensitive (higher threshold)
min_length: 6 # Only redact longer strings
Adding Words to Preserve
Add to common_words.yaml:
common_words:
- mycompany
- customterm
- projectname
๐ฆ Building and Distribution
Build Package
# Build wheel and source distribution
uv build
# Output files in dist/:
# - entrophier-0.1.0-py3-none-any.whl
# - entrophier-0.1.0.tar.gz
Install Built Package
# Install wheel locally
uv pip install dist/entrophier-0.1.0-py3-none-any.whl
# Or install from source distribution
uv pip install dist/entrophier-0.1.0.tar.gz
๐ Additional Documentation
- Test Documentation: Comprehensive test suite details
- Changelog: Version history and release notes
๐ค Contributing
This is a defensive security tool. Contributions should focus on:
- Improving detection accuracy
- Adding new pattern types
- Enhancing performance
- Expanding test coverage
๐ License
MIT License - See LICENSE file for details
๐ Security Note
This tool provides enterprise-grade sensitive data redaction with full configurability and no hardcoded assumptions about your data formats. However, always verify redaction results for your specific use case before using in production.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file entrophier-0.1.0.tar.gz.
File metadata
- Download URL: entrophier-0.1.0.tar.gz
- Upload date:
- Size: 64.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
60b46b3843453730ea114a61e8bcacb021727acb665eaf60f96762683278fdf7
|
|
| MD5 |
7e37c34de5aaa188672e59912c90387f
|
|
| BLAKE2b-256 |
8efcaa839744b99ac7f77ca2543da1421c6f224bfe2774a5f572f720412f18c0
|
File details
Details for the file entrophier-0.1.0-py3-none-any.whl.
File metadata
- Download URL: entrophier-0.1.0-py3-none-any.whl
- Upload date:
- Size: 21.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a45541c063879648fe3719fa513293701b02c0e4eab9fed70e5d981b47be68d2
|
|
| MD5 |
e45348b62b4b4ff3f3ccb7b9fab3aac0
|
|
| BLAKE2b-256 |
15bacdc33140222c66adaf386de905497feb9f92eb89dc5c1312c9ee7604d5bb
|