High-Performance Semantic Deduplication Tool for RAG Pipelines
Project description
๐ก๏ธ EntropyGuard v1.22.0
The Unbreakable RAG Data Cleaner
Enterprise-grade semantic data deduplication and sanitization engine for LLM training data.
Features โข Quick Start โข Installation โข Documentation
Why EntropyGuard?
The Problem: Dirty Data = Hallucinations & Wasted Money
Training Large Language Models on contaminated, redundant, or low-quality data leads to:
- Model Collapse โ Degraded performance from duplicate content
- Hallucinations โ Inaccurate outputs from poor training data
- Wasted Compute โ Paying for processing duplicate data multiple times
- Compliance Risks โ PII and sensitive data in training sets
The Solution: Local CPU Processing with Hybrid Deduplication
EntropyGuard runs 100% locally on your CPUโno data ever leaves your machine. Perfect for:
- Air-gapped environments (no cloud dependencies)
- Privacy compliance (GDPR, HIPAA, SOC 2)
- Cost efficiency (no API calls, no cloud fees)
- Enterprise security (complete data sovereignty)
โจ Key Features
๐ก๏ธ Fault Tolerant
- Checkpoint/Resume System โ Automatic recovery from failures
- Memory Safety โ Chunked processing prevents OOM errors
- Graceful Shutdown โ SIGINT/SIGTERM handling (Windows + Unix)
- Error Recovery โ Automatic retry with exponential backoff
๐ High Performance
- Hybrid Engine โ Hash-based exact dedup + AI semantic similarity
- Unix Pipes Support โ Stream processing for data engineering workflows
- Lazy Evaluation โ Polars LazyFrame for datasets larger than RAM
- Optimized Memory โ Pre-materialization checks prevent OOM
๐ Memory Safe
- Chunked Processing โ Process datasets larger than available RAM
- Memory Profiling โ Track memory usage per pipeline stage
- Resource Guards โ Disk space and memory checks before operations
๐ Observability
- Prometheus Metrics โ Export pipeline metrics for monitoring
- Structured Logging โ JSON logs with correlation IDs
- Progress Tracking โ Real-time ETA and throughput estimation
- Audit Logs โ Complete audit trail of all operations
๐ Enterprise Ready
- Standard Exit Codes โ sysexits.h compliant for automation
- Type Safety โ Full type hints (MyPy strict compatible)
- Configuration Validation โ Pydantic-based schema validation
- Input Validation โ Format detection and consistency checks
โก Quick Start
The "Magic" Command
# Unix pipe example (the most common use case)
cat data.jsonl | entropyguard --dedup-threshold 0.95 > clean.jsonl
Basic Usage
# File-to-file processing
entropyguard \
--input data.jsonl \
--output clean.jsonl \
--text-column text \
--dedup-threshold 0.95
# With custom settings
entropyguard \
--input data.ndjson \
--output cleaned.ndjson \
--text-column content \
--min-length 100 \
--dedup-threshold 0.9 \
--chunk-size 500
Advanced: Checkpoint & Resume
# Enable automatic checkpoint recovery
entropyguard \
--input large_dataset.jsonl \
--output clean.jsonl \
--checkpoint-dir ./checkpoints \
--text-column text
# Resume from checkpoint manually
entropyguard \
--input large_dataset.jsonl \
--output clean.jsonl \
--checkpoint-dir ./checkpoints \
--resume \
--text-column text
๐ฆ Installation
Option 1: pip from PyPI (Recommended)
pip install entropyguard
Requirements:
- Python 3.10, 3.11, or 3.12 (3.13 not supported yet)
Option 2: Install from Git
pip install "git+https://github.com/DamianSiuta/entropyguard.git"
Requirements:
- Python 3.10, 3.11, or 3.12 (3.13 not supported yet)
gitavailable on your system
Option 3: Docker
# Build image
docker build -t entropyguard:latest .
# Run container
docker run -v $(pwd):/data entropyguard:latest \
--input /data/input.jsonl \
--output /data/output.jsonl \
--text-column text
Option 4: Development Setup
git clone https://github.com/DamianSiuta/entropyguard.git
cd entropyguard
poetry install
๐ข Enterprise / Advanced Usage
Configuration File (.entropyguardrc.json)
Create a configuration file in your home directory or project root:
{
"text_column": "text",
"min_length": 100,
"dedup_threshold": 0.95,
"chunk_size": 500,
"chunk_overlap": 50,
"remove_pii": true,
"normalize_text": true,
"show_progress": true
}
Then run:
entropyguard --input data.jsonl --output clean.jsonl
Monitoring & Observability
# Enable Prometheus metrics
entropyguard \
--input data.jsonl \
--output clean.jsonl \
--metrics-port 9090 \
--text-column text
# Enable memory profiling
entropyguard \
--input data.jsonl \
--output clean.jsonl \
--profile-memory \
--text-column text
# JSON logs for machine parsing
entropyguard \
--input data.jsonl \
--output clean.jsonl \
--json-logs \
--text-column text
Exit Codes
EntropyGuard follows the sysexits.h standard:
| Code | Meaning |
|---|---|
0 |
Success |
1 |
General error |
2 |
Usage error (invalid arguments) |
64 |
Data format error |
65 |
Input file error |
66 |
Output file error |
70 |
Software error (internal bug) |
130 |
Process interrupted (SIGINT/Ctrl+C) |
๐ Comparison
| Feature | EntropyGuard | Basic Scripts | Vector DBs |
|---|---|---|---|
| Exact Deduplication | โ Hash-based (fast) | โ ๏ธ Manual | โ |
| Semantic Deduplication | โ AI-powered | โ | โ |
| Local Processing | โ 100% local | โ | โ ๏ธ Requires DB |
| Memory Safety | โ Chunked processing | โ ๏ธ Manual | โ ๏ธ Depends on DB |
| Fault Tolerance | โ Checkpoint/Resume | โ | โ ๏ธ Depends on DB |
| Unix Pipes | โ Native support | โ ๏ธ Manual | โ |
| Observability | โ Metrics + Logs | โ | โ ๏ธ Depends on DB |
| Configuration | โ Pydantic validation | โ | โ ๏ธ DB-specific |
| Type Safety | โ Full type hints | โ | โ ๏ธ Depends on language |
๐ ๏ธ Tech Stack
- Core: Python 3.10+, Polars (LazyFrame)
- AI/ML: PyTorch (CPU), FAISS, Sentence-Transformers
- Validation: Pydantic v2
- Logging: structlog (optional)
- Metrics: Prometheus Client (optional)
- Infrastructure: Poetry, Docker-ready
๐ Edition Comparison
EntropyGuard is available in two editions:
| Feature | Community (Open Source) | Enterprise |
|---|---|---|
| CLI Tool | โ Full-featured | โ Full-featured |
| Semantic Deduplication | โ Unlimited | โ Unlimited |
| PII Removal | โ Unlimited | โ Unlimited |
| Data Formats | โ All formats | โ All formats |
| Docker Support | โ Yes | โ Yes |
| Audit Logs | โ Yes | โ Enhanced |
| Web Dashboard | โ | โ Professional Analytics Platform |
| Real-time Monitoring | โ | โ Live telemetry & metrics |
| Alert System | โ | โ Custom alert rules (Watchtower) |
| API Access | โ | โ RESTful API |
| SSO Integration | โ | โ SAML 2.0, OAuth 2.0 |
| Support | Community | Priority support with SLA |
| License | MIT License | Commercial license required |
๐ Legal Notice: Enterprise features (Control Plane, Dashboard, API, Alerting System) are proprietary software covered by a commercial license. These components are NOT included in the Open Source release and are NOT subject to the MIT license terms.
๐ Documentation
๐ค Contributing
Contributions are welcome! Please read our contributing guidelines and code of conduct before submitting pull requests.
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Acknowledgments
Built with โค๏ธ by the EntropyGuard Team
Special thanks to:
- Polars for the amazing DataFrame library
- Sentence-Transformers for semantic embeddings
- FAISS for vector similarity search
Made with โค๏ธ for the LLM community
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file entropyguard-1.22.0.tar.gz.
File metadata
- Download URL: entropyguard-1.22.0.tar.gz
- Upload date:
- Size: 57.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.2.1 CPython/3.11.9 Windows/10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a2601718b8210593afe56009f5f68cf7332593c13482bd9a01fa6e8d595ec948
|
|
| MD5 |
9fef12c3ae536710bc9fca9c4715ce6a
|
|
| BLAKE2b-256 |
5263a3c8bda6605fd0cc901b5a4c546dc81cb8c8a75da739b7cd4a155a972dad
|
File details
Details for the file entropyguard-1.22.0-py3-none-any.whl.
File metadata
- Download URL: entropyguard-1.22.0-py3-none-any.whl
- Upload date:
- Size: 66.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.2.1 CPython/3.11.9 Windows/10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ada7a295bb252c92c9bb83092e24eb00cf5a4c67a04be703972a634fee44fe3f
|
|
| MD5 |
0c19627a32c19306a1a7065654e4dda5
|
|
| BLAKE2b-256 |
e58bce84345053e7e81535b26f64b73e778f91b459c07f3ccb71025fa86f1ff1
|