Skip to main content

High-Performance Semantic Deduplication Tool for RAG Pipelines

Project description

๐Ÿ›ก๏ธ EntropyGuard v1.22.0

The Unbreakable RAG Data Cleaner

License: MIT Python 3.10+ Docker Production Ready

Enterprise-grade semantic data deduplication and sanitization engine for LLM training data.

Features โ€ข Quick Start โ€ข Installation โ€ข Documentation


Why EntropyGuard?

The Problem: Dirty Data = Hallucinations & Wasted Money

Training Large Language Models on contaminated, redundant, or low-quality data leads to:

  • Model Collapse โ€” Degraded performance from duplicate content
  • Hallucinations โ€” Inaccurate outputs from poor training data
  • Wasted Compute โ€” Paying for processing duplicate data multiple times
  • Compliance Risks โ€” PII and sensitive data in training sets

The Solution: Local CPU Processing with Hybrid Deduplication

EntropyGuard runs 100% locally on your CPUโ€”no data ever leaves your machine. Perfect for:

  • Air-gapped environments (no cloud dependencies)
  • Privacy compliance (GDPR, HIPAA, SOC 2)
  • Cost efficiency (no API calls, no cloud fees)
  • Enterprise security (complete data sovereignty)

โœจ Key Features

๐Ÿ›ก๏ธ Fault Tolerant

  • Checkpoint/Resume System โ€” Automatic recovery from failures
  • Memory Safety โ€” Chunked processing prevents OOM errors
  • Graceful Shutdown โ€” SIGINT/SIGTERM handling (Windows + Unix)
  • Error Recovery โ€” Automatic retry with exponential backoff

๐Ÿš€ High Performance

  • Hybrid Engine โ€” Hash-based exact dedup + AI semantic similarity
  • Unix Pipes Support โ€” Stream processing for data engineering workflows
  • Lazy Evaluation โ€” Polars LazyFrame for datasets larger than RAM
  • Optimized Memory โ€” Pre-materialization checks prevent OOM

๐Ÿ“‰ Memory Safe

  • Chunked Processing โ€” Process datasets larger than available RAM
  • Memory Profiling โ€” Track memory usage per pipeline stage
  • Resource Guards โ€” Disk space and memory checks before operations

๐Ÿ“Š Observability

  • Prometheus Metrics โ€” Export pipeline metrics for monitoring
  • Structured Logging โ€” JSON logs with correlation IDs
  • Progress Tracking โ€” Real-time ETA and throughput estimation
  • Audit Logs โ€” Complete audit trail of all operations

๐Ÿ”’ Enterprise Ready

  • Standard Exit Codes โ€” sysexits.h compliant for automation
  • Type Safety โ€” Full type hints (MyPy strict compatible)
  • Configuration Validation โ€” Pydantic-based schema validation
  • Input Validation โ€” Format detection and consistency checks

โšก Quick Start

The "Magic" Command

# Unix pipe example (the most common use case)
cat data.jsonl | entropyguard --dedup-threshold 0.95 > clean.jsonl

Basic Usage

# File-to-file processing
entropyguard \
  --input data.jsonl \
  --output clean.jsonl \
  --text-column text \
  --dedup-threshold 0.95

# With custom settings
entropyguard \
  --input data.ndjson \
  --output cleaned.ndjson \
  --text-column content \
  --min-length 100 \
  --dedup-threshold 0.9 \
  --chunk-size 500

Advanced: Checkpoint & Resume

# Enable automatic checkpoint recovery
entropyguard \
  --input large_dataset.jsonl \
  --output clean.jsonl \
  --checkpoint-dir ./checkpoints \
  --text-column text

# Resume from checkpoint manually
entropyguard \
  --input large_dataset.jsonl \
  --output clean.jsonl \
  --checkpoint-dir ./checkpoints \
  --resume \
  --text-column text

๐Ÿ“ฆ Installation

Option 1: pip from PyPI (Recommended)

pip install entropyguard

Requirements:

  • Python 3.10, 3.11, or 3.12 (3.13 not supported yet)

Option 2: Install from Git

pip install "git+https://github.com/DamianSiuta/entropyguard.git"

Requirements:

  • Python 3.10, 3.11, or 3.12 (3.13 not supported yet)
  • git available on your system

Option 3: Docker

# Build image
docker build -t entropyguard:latest .

# Run container
docker run -v $(pwd):/data entropyguard:latest \
  --input /data/input.jsonl \
  --output /data/output.jsonl \
  --text-column text

Option 4: Development Setup

git clone https://github.com/DamianSiuta/entropyguard.git
cd entropyguard
poetry install

๐Ÿข Enterprise / Advanced Usage

Configuration File (.entropyguardrc.json)

Create a configuration file in your home directory or project root:

{
  "text_column": "text",
  "min_length": 100,
  "dedup_threshold": 0.95,
  "chunk_size": 500,
  "chunk_overlap": 50,
  "remove_pii": true,
  "normalize_text": true,
  "show_progress": true
}

Then run:

entropyguard --input data.jsonl --output clean.jsonl

Monitoring & Observability

# Enable Prometheus metrics
entropyguard \
  --input data.jsonl \
  --output clean.jsonl \
  --metrics-port 9090 \
  --text-column text

# Enable memory profiling
entropyguard \
  --input data.jsonl \
  --output clean.jsonl \
  --profile-memory \
  --text-column text

# JSON logs for machine parsing
entropyguard \
  --input data.jsonl \
  --output clean.jsonl \
  --json-logs \
  --text-column text

Exit Codes

EntropyGuard follows the sysexits.h standard:

Code Meaning
0 Success
1 General error
2 Usage error (invalid arguments)
64 Data format error
65 Input file error
66 Output file error
70 Software error (internal bug)
130 Process interrupted (SIGINT/Ctrl+C)

๐Ÿ“Š Comparison

Feature EntropyGuard Basic Scripts Vector DBs
Exact Deduplication โœ… Hash-based (fast) โš ๏ธ Manual โŒ
Semantic Deduplication โœ… AI-powered โŒ โœ…
Local Processing โœ… 100% local โœ… โš ๏ธ Requires DB
Memory Safety โœ… Chunked processing โš ๏ธ Manual โš ๏ธ Depends on DB
Fault Tolerance โœ… Checkpoint/Resume โŒ โš ๏ธ Depends on DB
Unix Pipes โœ… Native support โš ๏ธ Manual โŒ
Observability โœ… Metrics + Logs โŒ โš ๏ธ Depends on DB
Configuration โœ… Pydantic validation โŒ โš ๏ธ DB-specific
Type Safety โœ… Full type hints โŒ โš ๏ธ Depends on language

๐Ÿ› ๏ธ Tech Stack

  • Core: Python 3.10+, Polars (LazyFrame)
  • AI/ML: PyTorch (CPU), FAISS, Sentence-Transformers
  • Validation: Pydantic v2
  • Logging: structlog (optional)
  • Metrics: Prometheus Client (optional)
  • Infrastructure: Poetry, Docker-ready

๐Ÿ“‹ Edition Comparison

EntropyGuard is available in two editions:

Feature Community (Open Source) Enterprise
CLI Tool โœ… Full-featured โœ… Full-featured
Semantic Deduplication โœ… Unlimited โœ… Unlimited
PII Removal โœ… Unlimited โœ… Unlimited
Data Formats โœ… All formats โœ… All formats
Docker Support โœ… Yes โœ… Yes
Audit Logs โœ… Yes โœ… Enhanced
Web Dashboard โŒ โœ… Professional Analytics Platform
Real-time Monitoring โŒ โœ… Live telemetry & metrics
Alert System โŒ โœ… Custom alert rules (Watchtower)
API Access โŒ โœ… RESTful API
SSO Integration โŒ โœ… SAML 2.0, OAuth 2.0
Support Community Priority support with SLA
License MIT License Commercial license required

๐Ÿ“Œ Legal Notice: Enterprise features (Control Plane, Dashboard, API, Alerting System) are proprietary software covered by a commercial license. These components are NOT included in the Open Source release and are NOT subject to the MIT license terms.


๐Ÿ“š Documentation


๐Ÿค Contributing

Contributions are welcome! Please read our contributing guidelines and code of conduct before submitting pull requests.


๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


๐Ÿ™ Acknowledgments

Built with โค๏ธ by the EntropyGuard Team

Special thanks to:


โฌ† Back to Top

Made with โค๏ธ for the LLM community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

entropyguard-1.22.0.tar.gz (57.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

entropyguard-1.22.0-py3-none-any.whl (66.6 kB view details)

Uploaded Python 3

File details

Details for the file entropyguard-1.22.0.tar.gz.

File metadata

  • Download URL: entropyguard-1.22.0.tar.gz
  • Upload date:
  • Size: 57.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.11.9 Windows/10

File hashes

Hashes for entropyguard-1.22.0.tar.gz
Algorithm Hash digest
SHA256 a2601718b8210593afe56009f5f68cf7332593c13482bd9a01fa6e8d595ec948
MD5 9fef12c3ae536710bc9fca9c4715ce6a
BLAKE2b-256 5263a3c8bda6605fd0cc901b5a4c546dc81cb8c8a75da739b7cd4a155a972dad

See more details on using hashes here.

File details

Details for the file entropyguard-1.22.0-py3-none-any.whl.

File metadata

  • Download URL: entropyguard-1.22.0-py3-none-any.whl
  • Upload date:
  • Size: 66.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.11.9 Windows/10

File hashes

Hashes for entropyguard-1.22.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ada7a295bb252c92c9bb83092e24eb00cf5a4c67a04be703972a634fee44fe3f
MD5 0c19627a32c19306a1a7065654e4dda5
BLAKE2b-256 e58bce84345053e7e81535b26f64b73e778f91b459c07f3ccb71025fa86f1ff1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page