High-Performance Semantic Deduplication Tool for RAG Pipelines
Project description
๐ก๏ธ EntropyGuard v1.22.1
The Unbreakable RAG Data Cleaner
Enterprise-grade semantic data deduplication and sanitization engine for LLM training data.
Features โข Quick Start โข Installation โข Documentation
Why EntropyGuard?
The Problem: Dirty Data = Hallucinations & Wasted Money
Training Large Language Models on contaminated, redundant, or low-quality data leads to:
- Model Collapse โ Degraded performance from duplicate content
- Hallucinations โ Inaccurate outputs from poor training data
- Wasted Compute โ Paying for processing duplicate data multiple times
- Compliance Risks โ PII and sensitive data in training sets
The Solution: Local CPU Processing with Hybrid Deduplication
EntropyGuard runs 100% locally on your CPUโno data ever leaves your machine. Perfect for:
- Air-gapped environments (no cloud dependencies)
- Privacy compliance (GDPR, HIPAA, SOC 2)
- Cost efficiency (no API calls, no cloud fees)
- Enterprise security (complete data sovereignty)
โจ Key Features
๐ก๏ธ Fault Tolerant
- Checkpoint/Resume System โ Automatic recovery from failures
- Memory Safety โ Chunked processing prevents OOM errors
- Graceful Shutdown โ SIGINT/SIGTERM handling (Windows + Unix)
- Error Recovery โ Automatic retry with exponential backoff
๐ High Performance
- Hybrid Engine โ Hash-based exact dedup + AI semantic similarity
- Unix Pipes Support โ Stream processing for data engineering workflows
- Lazy Evaluation โ Polars LazyFrame for datasets larger than RAM
- Optimized Memory โ Pre-materialization checks prevent OOM
๐ Memory Safe
- Chunked Processing โ Process datasets larger than available RAM
- Memory Profiling โ Track memory usage per pipeline stage
- Resource Guards โ Disk space and memory checks before operations
๐ Observability
- Prometheus Metrics โ Export pipeline metrics for monitoring
- Structured Logging โ JSON logs with correlation IDs
- Progress Tracking โ Real-time ETA and throughput estimation
- Audit Logs โ Complete audit trail of all operations
๐ Enterprise Ready
- Standard Exit Codes โ sysexits.h compliant for automation
- Type Safety โ Full type hints (MyPy strict compatible)
- Configuration Validation โ Pydantic-based schema validation
- Input Validation โ Format detection and consistency checks
โก Quick Start
The "Magic" Command
# Unix pipe example (the most common use case)
cat data.jsonl | entropyguard --dedup-threshold 0.95 > clean.jsonl
Basic Usage
# File-to-file processing
entropyguard \
--input data.jsonl \
--output clean.jsonl \
--text-column text \
--dedup-threshold 0.95
# With custom settings
entropyguard \
--input data.ndjson \
--output cleaned.ndjson \
--text-column content \
--min-length 100 \
--dedup-threshold 0.9 \
--chunk-size 500
Advanced: Checkpoint & Resume
# Enable automatic checkpoint recovery
entropyguard \
--input large_dataset.jsonl \
--output clean.jsonl \
--checkpoint-dir ./checkpoints \
--text-column text
# Resume from checkpoint manually
entropyguard \
--input large_dataset.jsonl \
--output clean.jsonl \
--checkpoint-dir ./checkpoints \
--resume \
--text-column text
๐ฆ Installation
Option 1: pip from PyPI (Recommended)
pip install entropyguard
Requirements:
- Python 3.10, 3.11, or 3.12 (3.13 not supported yet)
Option 2: Install from Git
pip install "git+https://github.com/DamianSiuta/entropyguard.git"
Requirements:
- Python 3.10, 3.11, or 3.12 (3.13 not supported yet)
gitavailable on your system
Option 3: Docker
# Build image
docker build -t entropyguard:latest .
# Run container
docker run -v $(pwd):/data entropyguard:latest \
--input /data/input.jsonl \
--output /data/output.jsonl \
--text-column text
Option 4: Development Setup
git clone https://github.com/DamianSiuta/entropyguard.git
cd entropyguard
poetry install
๐ CLI Flags Reference
Complete reference for all available flags:
| Flag | Type | Default | Description |
|---|---|---|---|
| Input/Output | |||
--input |
string | - (stdin) |
Path to input file (CSV, JSON, NDJSON). Use - for stdin |
--output |
string | - (stdout) |
Path to output file (NDJSON). Use - for stdout |
--text-column |
string | auto-detect | Name of text column to process. Auto-detects first string column if omitted |
--required-columns |
string | None | Comma-separated list of required columns (optional schema validation) |
| Processing Options | |||
--min-length |
int | 50 |
Minimum text length after sanitization (characters) |
--dedup-threshold |
float | 0.95 |
Similarity threshold for semantic deduplication (0.0-1.0). Higher = stricter |
--model-name |
string | all-MiniLM-L6-v2 |
Sentence-transformers model for embeddings. Use paraphrase-multilingual-MiniLM-L12-v2 for multilingual |
--batch-size |
int | 10000 |
Batch size for embedding processing. Reduce for low-memory systems |
| Chunking (RAG) | |||
--chunk-size |
int | None | Chunk size (characters) for splitting long texts. Disabled if not set |
--chunk-overlap |
int | 50 |
Overlap size (characters) between consecutive chunks. Only used with --chunk-size |
--separators |
list | default | Custom separators for chunking (space-separated). Use \n for newline, \t for tab |
| Checkpoint & Resume | |||
--checkpoint-dir |
string | None | Directory to save checkpoints for error recovery |
--resume |
flag | false | Resume from last checkpoint if available. Requires --checkpoint-dir |
--no-auto-resume |
flag | false | Disable automatic checkpoint recovery (requires explicit --resume) |
| Logging & Output | |||
--verbose |
flag | false | Enable verbose logging (INFO level) |
--debug |
flag | false | Enable debug mode (DEBUG level + full tracebacks). Implies --verbose |
--demo |
flag | false | Demo mode: Hide INFO logs, show only progress bars and final summary |
--quiet |
flag | false | Disable progress bars (useful for CI/CD) |
--json |
flag | false | Output results as JSON (machine-readable format) |
--json-logs |
flag | false | Output logs as JSON (for log aggregation systems) |
| Monitoring & Profiling | |||
--profile-memory |
flag | false | Enable memory profiling. Tracks usage at each pipeline stage |
--memory-report-path |
string | None | Path to save memory profiling report (JSON). Requires --profile-memory |
--metrics-port |
int | None | Start Prometheus metrics HTTP server on specified port |
--audit-log |
string | None | Path to JSON file for audit log of dropped/duplicate rows |
| Configuration | |||
--config |
string | auto-detect | Path to config file (JSON/YAML/TOML). Auto-detects .entropyguardrc in current/home dir |
| Utility | |||
--dry-run |
flag | false | Simulate processing without expensive operations. Shows statistics only |
--version |
flag | - | Show version number and exit |
Flag Categories Explained
Input/Output: Control where data comes from and goes to. Supports Unix pipes (- for stdin/stdout).
Processing Options: Core deduplication settings. --dedup-threshold controls how similar texts must be to be considered duplicates (0.95 = 95% similarity).
Chunking (RAG): For Retrieval-Augmented Generation workflows. Splits long texts into smaller chunks with configurable overlap.
Checkpoint & Resume: Fault tolerance features. Automatically saves progress and can resume from failures.
Logging & Output: Control verbosity and output format. --demo is perfect for video demonstrations.
Monitoring & Profiling: Production observability. Memory profiling helps debug OOM issues, Prometheus metrics enable monitoring.
Configuration: Use config files to avoid repeating flags. CLI arguments override config file values.
๐ข Enterprise / Advanced Usage
Configuration File (.entropyguardrc.json)
Create a configuration file in your home directory or project root:
{
"text_column": "text",
"min_length": 100,
"dedup_threshold": 0.95,
"chunk_size": 500,
"chunk_overlap": 50,
"remove_pii": true,
"normalize_text": true,
"show_progress": true
}
Then run:
entropyguard --input data.jsonl --output clean.jsonl
Monitoring & Observability
# Enable Prometheus metrics
entropyguard \
--input data.jsonl \
--output clean.jsonl \
--metrics-port 9090 \
--text-column text
# Enable memory profiling
entropyguard \
--input data.jsonl \
--output clean.jsonl \
--profile-memory \
--text-column text
# JSON logs for machine parsing
entropyguard \
--input data.jsonl \
--output clean.jsonl \
--json-logs \
--text-column text
Exit Codes
EntropyGuard follows the sysexits.h standard:
| Code | Meaning |
|---|---|
0 |
Success |
1 |
General error |
2 |
Usage error (invalid arguments) |
64 |
Data format error |
65 |
Input file error |
66 |
Output file error |
70 |
Software error (internal bug) |
130 |
Process interrupted (SIGINT/Ctrl+C) |
๐ Comparison
| Feature | EntropyGuard | Basic Scripts | Vector DBs |
|---|---|---|---|
| Exact Deduplication | โ Hash-based (fast) | โ ๏ธ Manual | โ |
| Semantic Deduplication | โ AI-powered | โ | โ |
| Local Processing | โ 100% local | โ | โ ๏ธ Requires DB |
| Memory Safety | โ Chunked processing | โ ๏ธ Manual | โ ๏ธ Depends on DB |
| Fault Tolerance | โ Checkpoint/Resume | โ | โ ๏ธ Depends on DB |
| Unix Pipes | โ Native support | โ ๏ธ Manual | โ |
| Observability | โ Metrics + Logs | โ | โ ๏ธ Depends on DB |
| Configuration | โ Pydantic validation | โ | โ ๏ธ DB-specific |
| Type Safety | โ Full type hints | โ | โ ๏ธ Depends on language |
๐ ๏ธ Tech Stack
- Core: Python 3.10+, Polars (LazyFrame)
- AI/ML: PyTorch (CPU), FAISS, Sentence-Transformers
- Validation: Pydantic v2
- Logging: structlog (optional)
- Metrics: Prometheus Client (optional)
- Infrastructure: Poetry, Docker-ready
๐ Edition Comparison
EntropyGuard is available in two editions:
| Feature | Community (Open Source) | Enterprise |
|---|---|---|
| CLI Tool | โ Full-featured | โ Full-featured |
| Semantic Deduplication | โ Unlimited | โ Unlimited |
| PII Removal | โ Unlimited | โ Unlimited |
| Data Formats | โ All formats | โ All formats |
| Docker Support | โ Yes | โ Yes |
| Audit Logs | โ Yes | โ Enhanced |
| Web Dashboard | โ | โ Professional Analytics Platform |
| Real-time Monitoring | โ | โ Live telemetry & metrics |
| Alert System | โ | โ Custom alert rules (Watchtower) |
| API Access | โ | โ RESTful API |
| SSO Integration | โ | โ SAML 2.0, OAuth 2.0 |
| Support | Community | Priority support with SLA |
| License | MIT License | Commercial license required |
๐ Legal Notice: Enterprise features (Control Plane, Dashboard, API, Alerting System) are proprietary software covered by a commercial license. These components are NOT included in the Open Source release and are NOT subject to the MIT license terms.
๐ Documentation
๐ค Contributing
Contributions are welcome! Please read our contributing guidelines and code of conduct before submitting pull requests.
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Acknowledgments
Built with โค๏ธ by the EntropyGuard Team
Special thanks to:
- Polars for the amazing DataFrame library
- Sentence-Transformers for semantic embeddings
- FAISS for vector similarity search
Made with โค๏ธ for the LLM community
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file entropyguard-1.22.1.tar.gz.
File metadata
- Download URL: entropyguard-1.22.1.tar.gz
- Upload date:
- Size: 61.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.2.1 CPython/3.11.9 Windows/10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f335f8f9ab8d5659aeaa749de2c76e11b6ee2cab7a62d92357298a1b4e3c10e9
|
|
| MD5 |
5b7288cb9680900379752e3341f55d63
|
|
| BLAKE2b-256 |
b9d205fff850aacc123642431a9231259e686b7b32a8ff528c9fe1b44223b54f
|
File details
Details for the file entropyguard-1.22.1-py3-none-any.whl.
File metadata
- Download URL: entropyguard-1.22.1-py3-none-any.whl
- Upload date:
- Size: 69.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.2.1 CPython/3.11.9 Windows/10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f7ebc7c584322184ebdcf17e7e35afb76ceb4057b3ccaea132d67a793be64a58
|
|
| MD5 |
e9846c006da0fef183a0721c57da4876
|
|
| BLAKE2b-256 |
4912fdcd278cf188d9135d5a6c2f5a41c6ab8099d10eb31a3e93030a630a04f3
|