Skip to main content

A configurable Python tool for detecting and redacting sensitive data using entropy analysis and pattern matching

Project description

High-Entropy String Redaction Tool

A configurable Python tool for detecting and redacting sensitive data in text using entropy analysis and pattern matching. Designed for processing log files, user input, and any text containing potentially sensitive information like timestamps, UUIDs, account IDs, IP addresses, and AWS hostnames.

๐Ÿš€ Quick Start

Installation

# Development installation with uv (recommended)
git clone https://github.com/rjury-sumo/python-entrophier.git
cd python-entrophier
uv sync

# Or install locally with uv
uv pip install .

# Or install with pip
pip install .

# Note: Package not yet published to PyPI

Command-Line Usage

# Process a file (redacted output only)
entrophier input.txt

# Comparative mode (see original vs redacted)
entrophier -c input.txt

# Comparative mode with condensed asterisks
entrophier -c test_strings.txt --condense-asterisks

# Process stdin
cat logfile.txt | entrophier

# Save to file
entrophier -c -o output.txt input.txt

๐Ÿ“‹ Features

Core Capabilities

  • Shannon Entropy Analysis: Identifies high-randomness strings using information theory
  • Pattern-Based Detection: Recognizes structured data like timestamps, IPs, and AWS hostnames
  • Word Preservation: Protects common English words and technical terms from redaction
  • Selective Redaction: AWS hostname patterns preserve domain parts while redacting sensitive IDs
  • Two Redaction Methods: Token-level (default, more reliable) and sliding window (more granular)

Supported Pattern Types

  • Datetime/Timestamps: ISO 8601, human-readable dates, epoch timestamps
  • IP Addresses: IPv4, IPv6, with contextual detection
  • AWS Resources: EC2 hostnames, RDS endpoints, CloudFront distributions, API Gateway
  • UUIDs and Hex Sequences: Full and partial UUID detection
  • Account IDs and Random Strings: Long numeric and alphanumeric sequences

โš™๏ธ Configuration

The tool requires three YAML configuration files:

common_words.yaml

Contains word lists and patterns to preserve during redaction:

common_words:
  - admin, alert, application, auth, backup, cache
  - database, debug, deployment, development, device
  # ... extensive list of technical and common terms

word_patterns:
  prefixes: [pre, post, anti, auto, co, de, dis, ...]
  suffixes: [able, ible, al, ed, er, est, ful, ic, ...]

entropy_settings.yaml

Controls detection behavior and thresholds:

entropy_detection:
  default_threshold: 2.5        # Higher = less sensitive
  word_pattern_bonus: 0.5       # Extra threshold for word-like patterns
  min_length: 4                 # Minimum string length to consider
  window_size: 6                # Sliding window size

output_formatting:
  # Note: condense_asterisks is now a command-line option only

pattern_detection:
  detect_timestamps: true       # Enable timestamp detection
  detect_ip_addresses: true     # Enable IP address detection
  detect_aws_hostnames: true    # Enable AWS hostname detection

redaction_patterns.yaml

Defines regex patterns for structured data detection:

datetime_patterns:
  human_readable_datetime: |
    (Mon|Tue|Wed|Thu|Fri|Sat|Sun)\s+(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+\d{1,2}\s+\d{2}:\d{2}:\d{2}\s+\w+\s+\d{4}
  date_yyyy_mm_dd: '\d{4}/\d{2}/\d{2}'
  # ... more datetime patterns

aws_selective_patterns:
  ec2_public_ipv4_dns:
    pattern: 'ec2-(\d{1,3}-\d{1,3}-\d{1,3}-\d{1,3})(\..*\.amazonaws\.com)'
    replacement: 'ec2-*\2'
  # ... selective redaction patterns

๐Ÿ› ๏ธ Command-Line Options

entrophier [options] [input_file]

Options:
  -c, --comparative           Show original and redacted (default: redacted only)
  -o OUTPUT, --output OUTPUT  Output file path (default: stdout)
  --method {token,sliding}    Redaction method (default: token)
  --threshold THRESHOLD       Override entropy threshold
  --min-length MIN_LENGTH     Override minimum length
  --condense-asterisks        Condense consecutive asterisks to single *
  --config-dir CONFIG_DIR     Directory containing configuration files

Positional:
  input_file                  Input file path (use "-" or omit for stdin)

๐Ÿ“š Library Usage

from entrophier import redact_sensitive_data, load_config

# Load configuration (required - uses default config files from module directory)
load_config()

# Basic usage (recommended)
result = redact_sensitive_data("model-scheduler-667689996-jd4g7")
# Output: "model-scheduler-*********-*****"

# Use custom config directory
load_config(config_dir="/path/to/config")

# Alternative methods
from entrophier import redact_high_entropy_tokens, redact_high_entropy_strings

# Token-level approach (more reliable)
result = redact_high_entropy_tokens("text")

# Sliding window approach (more granular)
result = redact_high_entropy_strings("text")

๐ŸŽฏ Examples

Input/Output Examples

# AWS CloudTrail path
Original: s3://aws-cloudtrail-logs-123456789012-us-east-1/CloudTrail/...
Redacted: s3://aws-cloudtrail-logs-************-us-east-1/CloudTrail/...

# EC2 hostname (selective redaction)
Original: ec2-198-51-100-1.compute-1.amazonaws.com
Redacted: ec2-*.compute-1.amazonaws.com

# Human-readable datetime
Original: Wed Sep 24 11:17:52 NZST 2025
Redacted: Wed Sep 24 ******** NZST 2025

# UUID
Original: uuid-550e8400-e29b-41d4-a716-446655440000
Redacted: uuid-********-****-****-****-************

# Mixed content preserving common words
Original: database-connection-string-server-production-guid-550e8400
Redacted: database-connection-string-server-production-guid-********

# With --condense-asterisks flag
Original: user-session-abc123def456-another-xyz789abc123
Redacted: user-session-*-another-*

# Without --condense-asterisks flag (default)
Original: user-session-abc123def456-another-xyz789abc123
Redacted: user-session-************-another-************

๐Ÿ› ๏ธ Development

Setup Development Environment

# Clone the repository
git clone https://github.com/rjury-sumo/python-entrophier.git
cd python-entrophier

# Install with uv (creates .venv automatically)
uv sync

# Verify installation
uv run python -c "from entrophier import redact_sensitive_data, load_config; load_config(); print('โœ“ Setup complete')"

Development Workflow

# Run tests during development
uv run pytest

# Run tests with coverage
uv run pytest --cov=entrophier --cov-report=term-missing

# Run CLI in development mode
uv run entrophier input.txt

# Build package
uv build

# Run specific module
uv run python -m entrophier input.txt

Project Dependencies

  • Runtime: pyyaml>=6.0.0
  • Development: pytest>=8.0.0, pytest-cov>=4.0.0
  • Python: 3.8+ (compatible with Python 3.8 through 3.13)
  • Build: hatchling (specified in pyproject.toml)

๐Ÿงช Testing

Run the comprehensive pytest test suite (40 tests):

# Run all tests
uv run pytest

# With verbose output
uv run pytest -v

# With coverage report
uv run pytest --cov=entrophier

# Run specific test class
uv run pytest tests/test_entropy.py::TestEntropyCalculation -v

The test suite includes:

  • Entropy calculation and segment detection
  • Word preservation and pattern matching
  • AWS S3 CloudTrail and CloudWatch paths
  • Container and Docker paths (Kubernetes, Docker)
  • Windows and Linux file paths
  • Database connection strings
  • IP addresses (IPv4/IPv6) and network identifiers
  • Various timestamp and datetime formats
  • Edge cases and boundary conditions

See tests/README.md for detailed test documentation.

๐Ÿ”ง How It Works

Entropy Analysis

The tool uses Shannon entropy to measure randomness in character distributions:

  • High entropy (>2.5 bits): Random-looking strings like abc123def456
  • Low entropy (<2.0 bits): Structured text like common words
  • Pattern boosting: Adds entropy score for mixed case, alphanumeric combinations

Two-Phase Detection

  1. Pattern-based detection: Identifies structured data using regex patterns
  2. Entropy-based analysis: Catches random strings missed by patterns

Word Preservation

  • Extensive dictionary of common English and technical terms
  • Pattern matching for word prefixes/suffixes
  • Contextual analysis to avoid redacting legitimate words

๐Ÿ“ Project Structure

python-entrophier/
โ”œโ”€โ”€ src/
โ”‚   โ””โ”€โ”€ entrophier/
โ”‚       โ”œโ”€โ”€ __init__.py          # Public API exports
โ”‚       โ”œโ”€โ”€ __main__.py          # Module entry point
โ”‚       โ”œโ”€โ”€ cli.py               # Command-line interface
โ”‚       โ”œโ”€โ”€ config.py            # Configuration loading
โ”‚       โ”œโ”€โ”€ core.py              # Core redaction logic
โ”‚       โ”œโ”€โ”€ common_words.yaml    # Word lists and patterns
โ”‚       โ”œโ”€โ”€ entropy_settings.yaml # Detection and output settings
โ”‚       โ””โ”€โ”€ redaction_patterns.yaml # Regex patterns for structured data
โ”œโ”€โ”€ tests/
โ”‚   โ””โ”€โ”€ test_entropy.py          # Comprehensive test suite
โ”œโ”€โ”€ pyproject.toml               # Project metadata and dependencies
โ”œโ”€โ”€ README.md                    # This documentation
โ””โ”€โ”€ .python-version              # Python version (3.13)

โš ๏ธ Configuration Requirements

The package will exit with an error if any required configuration files or settings are missing. This ensures consistent behavior and prevents fallback to hardcoded values.

Default Configuration: Config files are bundled with the package in src/entrophier/. For custom configurations, use the --config-dir CLI option or load_config(config_dir="/path/to/config") in Python.

Required sections in YAML files:

  • common_words.yaml: common_words, word_patterns
  • entropy_settings.yaml: entropy_detection, pattern_detection
  • redaction_patterns.yaml: Pattern sections as needed

๐ŸŽ›๏ธ Customization

Adding Custom Patterns

Add to redaction_patterns.yaml:

custom_patterns:
  my_identifier: 'pattern-\d{6}-[a-z]{4}'

Adjusting Sensitivity

Modify entropy_settings.yaml:

entropy_detection:
  default_threshold: 3.0    # Less sensitive (higher threshold)
  min_length: 6            # Only redact longer strings

Adding Words to Preserve

Add to common_words.yaml:

common_words:
  - mycompany
  - customterm
  - projectname

๐Ÿ“ฆ Building and Distribution

Build Package

# Build wheel and source distribution
uv build

# Output files in dist/:
# - entrophier-0.1.0-py3-none-any.whl
# - entrophier-0.1.0.tar.gz

Install Built Package

# Install wheel locally
uv pip install dist/entrophier-0.1.0-py3-none-any.whl

# Or install from source distribution
uv pip install dist/entrophier-0.1.0.tar.gz

๐Ÿ“š Additional Documentation

๐Ÿค Contributing

This is a defensive security tool. Contributions should focus on:

  • Improving detection accuracy
  • Adding new pattern types
  • Enhancing performance
  • Expanding test coverage

๐Ÿ“„ License

MIT License - See LICENSE file for details

๐Ÿ”’ Security Note

This tool provides enterprise-grade sensitive data redaction with full configurability and no hardcoded assumptions about your data formats. However, always verify redaction results for your specific use case before using in production.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

entrophier-0.1.0.tar.gz (64.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

entrophier-0.1.0-py3-none-any.whl (21.0 kB view details)

Uploaded Python 3

File details

Details for the file entrophier-0.1.0.tar.gz.

File metadata

  • Download URL: entrophier-0.1.0.tar.gz
  • Upload date:
  • Size: 64.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for entrophier-0.1.0.tar.gz
Algorithm Hash digest
SHA256 60b46b3843453730ea114a61e8bcacb021727acb665eaf60f96762683278fdf7
MD5 7e37c34de5aaa188672e59912c90387f
BLAKE2b-256 8efcaa839744b99ac7f77ca2543da1421c6f224bfe2774a5f572f720412f18c0

See more details on using hashes here.

File details

Details for the file entrophier-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: entrophier-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 21.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for entrophier-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a45541c063879648fe3719fa513293701b02c0e4eab9fed70e5d981b47be68d2
MD5 e45348b62b4b4ff3f3ccb7b9fab3aac0
BLAKE2b-256 15bacdc33140222c66adaf386de905497feb9f92eb89dc5c1312c9ee7604d5bb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page