Skip to main content

High-Performance Semantic Deduplication Tool for RAG Pipelines

Project description

๐Ÿ›ก๏ธ EntropyGuard v1.22.1

The Unbreakable RAG Data Cleaner

License: MIT Python 3.10+ Docker Production Ready

Enterprise-grade semantic data deduplication and sanitization engine for LLM training data.

Features โ€ข Quick Start โ€ข Installation โ€ข Documentation


Why EntropyGuard?

The Problem: Dirty Data = Hallucinations & Wasted Money

Training Large Language Models on contaminated, redundant, or low-quality data leads to:

  • Model Collapse โ€” Degraded performance from duplicate content
  • Hallucinations โ€” Inaccurate outputs from poor training data
  • Wasted Compute โ€” Paying for processing duplicate data multiple times
  • Compliance Risks โ€” PII and sensitive data in training sets

The Solution: Local CPU Processing with Hybrid Deduplication

EntropyGuard runs 100% locally on your CPUโ€”no data ever leaves your machine. Perfect for:

  • Air-gapped environments (no cloud dependencies)
  • Privacy compliance (GDPR, HIPAA, SOC 2)
  • Cost efficiency (no API calls, no cloud fees)
  • Enterprise security (complete data sovereignty)

โœจ Key Features

๐Ÿ›ก๏ธ Fault Tolerant

  • Checkpoint/Resume System โ€” Automatic recovery from failures
  • Memory Safety โ€” Chunked processing prevents OOM errors
  • Graceful Shutdown โ€” SIGINT/SIGTERM handling (Windows + Unix)
  • Error Recovery โ€” Automatic retry with exponential backoff

๐Ÿš€ High Performance

  • Hybrid Engine โ€” Hash-based exact dedup + AI semantic similarity
  • Unix Pipes Support โ€” Stream processing for data engineering workflows
  • Lazy Evaluation โ€” Polars LazyFrame for datasets larger than RAM
  • Optimized Memory โ€” Pre-materialization checks prevent OOM

๐Ÿ“‰ Memory Safe

  • Chunked Processing โ€” Process datasets larger than available RAM
  • Memory Profiling โ€” Track memory usage per pipeline stage
  • Resource Guards โ€” Disk space and memory checks before operations

๐Ÿ“Š Observability

  • Prometheus Metrics โ€” Export pipeline metrics for monitoring
  • Structured Logging โ€” JSON logs with correlation IDs
  • Progress Tracking โ€” Real-time ETA and throughput estimation
  • Audit Logs โ€” Complete audit trail of all operations

๐Ÿ”’ Enterprise Ready

  • Standard Exit Codes โ€” sysexits.h compliant for automation
  • Type Safety โ€” Full type hints (MyPy strict compatible)
  • Configuration Validation โ€” Pydantic-based schema validation
  • Input Validation โ€” Format detection and consistency checks

โšก Quick Start

The "Magic" Command

# Unix pipe example (the most common use case)
cat data.jsonl | entropyguard --dedup-threshold 0.95 > clean.jsonl

Basic Usage

# File-to-file processing
entropyguard \
  --input data.jsonl \
  --output clean.jsonl \
  --text-column text \
  --dedup-threshold 0.95

# With custom settings
entropyguard \
  --input data.ndjson \
  --output cleaned.ndjson \
  --text-column content \
  --min-length 100 \
  --dedup-threshold 0.9 \
  --chunk-size 500

Advanced: Checkpoint & Resume

# Enable automatic checkpoint recovery
entropyguard \
  --input large_dataset.jsonl \
  --output clean.jsonl \
  --checkpoint-dir ./checkpoints \
  --text-column text

# Resume from checkpoint manually
entropyguard \
  --input large_dataset.jsonl \
  --output clean.jsonl \
  --checkpoint-dir ./checkpoints \
  --resume \
  --text-column text

๐Ÿ“ฆ Installation

Option 1: pip from PyPI (Recommended)

pip install entropyguard

Requirements:

  • Python 3.10, 3.11, or 3.12 (3.13 not supported yet)

Option 2: Install from Git

pip install "git+https://github.com/DamianSiuta/entropyguard.git"

Requirements:

  • Python 3.10, 3.11, or 3.12 (3.13 not supported yet)
  • git available on your system

Option 3: Docker

# Build image
docker build -t entropyguard:latest .

# Run container
docker run -v $(pwd):/data entropyguard:latest \
  --input /data/input.jsonl \
  --output /data/output.jsonl \
  --text-column text

Option 4: Development Setup

git clone https://github.com/DamianSiuta/entropyguard.git
cd entropyguard
poetry install

๐Ÿ“‹ CLI Flags Reference

Complete reference for all available flags:

Flag Type Default Description
Input/Output
--input string - (stdin) Path to input file (CSV, JSON, NDJSON). Use - for stdin
--output string - (stdout) Path to output file (NDJSON). Use - for stdout
--text-column string auto-detect Name of text column to process. Auto-detects first string column if omitted
--required-columns string None Comma-separated list of required columns (optional schema validation)
Processing Options
--min-length int 50 Minimum text length after sanitization (characters)
--dedup-threshold float 0.95 Similarity threshold for semantic deduplication (0.0-1.0). Higher = stricter
--model-name string all-MiniLM-L6-v2 Sentence-transformers model for embeddings. Use paraphrase-multilingual-MiniLM-L12-v2 for multilingual
--batch-size int 10000 Batch size for embedding processing. Reduce for low-memory systems
Chunking (RAG)
--chunk-size int None Chunk size (characters) for splitting long texts. Disabled if not set
--chunk-overlap int 50 Overlap size (characters) between consecutive chunks. Only used with --chunk-size
--separators list default Custom separators for chunking (space-separated). Use \n for newline, \t for tab
Checkpoint & Resume
--checkpoint-dir string None Directory to save checkpoints for error recovery
--resume flag false Resume from last checkpoint if available. Requires --checkpoint-dir
--no-auto-resume flag false Disable automatic checkpoint recovery (requires explicit --resume)
Logging & Output
--verbose flag false Enable verbose logging (INFO level)
--debug flag false Enable debug mode (DEBUG level + full tracebacks). Implies --verbose
--demo flag false Demo mode: Hide INFO logs, show only progress bars and final summary
--quiet flag false Disable progress bars (useful for CI/CD)
--json flag false Output results as JSON (machine-readable format)
--json-logs flag false Output logs as JSON (for log aggregation systems)
Monitoring & Profiling
--profile-memory flag false Enable memory profiling. Tracks usage at each pipeline stage
--memory-report-path string None Path to save memory profiling report (JSON). Requires --profile-memory
--metrics-port int None Start Prometheus metrics HTTP server on specified port
--audit-log string None Path to JSON file for audit log of dropped/duplicate rows
Configuration
--config string auto-detect Path to config file (JSON/YAML/TOML). Auto-detects .entropyguardrc in current/home dir
Utility
--dry-run flag false Simulate processing without expensive operations. Shows statistics only
--version flag - Show version number and exit

Flag Categories Explained

Input/Output: Control where data comes from and goes to. Supports Unix pipes (- for stdin/stdout).

Processing Options: Core deduplication settings. --dedup-threshold controls how similar texts must be to be considered duplicates (0.95 = 95% similarity).

Chunking (RAG): For Retrieval-Augmented Generation workflows. Splits long texts into smaller chunks with configurable overlap.

Checkpoint & Resume: Fault tolerance features. Automatically saves progress and can resume from failures.

Logging & Output: Control verbosity and output format. --demo is perfect for video demonstrations.

Monitoring & Profiling: Production observability. Memory profiling helps debug OOM issues, Prometheus metrics enable monitoring.

Configuration: Use config files to avoid repeating flags. CLI arguments override config file values.


๐Ÿข Enterprise / Advanced Usage

Configuration File (.entropyguardrc.json)

Create a configuration file in your home directory or project root:

{
  "text_column": "text",
  "min_length": 100,
  "dedup_threshold": 0.95,
  "chunk_size": 500,
  "chunk_overlap": 50,
  "remove_pii": true,
  "normalize_text": true,
  "show_progress": true
}

Then run:

entropyguard --input data.jsonl --output clean.jsonl

Monitoring & Observability

# Enable Prometheus metrics
entropyguard \
  --input data.jsonl \
  --output clean.jsonl \
  --metrics-port 9090 \
  --text-column text

# Enable memory profiling
entropyguard \
  --input data.jsonl \
  --output clean.jsonl \
  --profile-memory \
  --text-column text

# JSON logs for machine parsing
entropyguard \
  --input data.jsonl \
  --output clean.jsonl \
  --json-logs \
  --text-column text

Exit Codes

EntropyGuard follows the sysexits.h standard:

Code Meaning
0 Success
1 General error
2 Usage error (invalid arguments)
64 Data format error
65 Input file error
66 Output file error
70 Software error (internal bug)
130 Process interrupted (SIGINT/Ctrl+C)

๐Ÿ“Š Comparison

Feature EntropyGuard Basic Scripts Vector DBs
Exact Deduplication โœ… Hash-based (fast) โš ๏ธ Manual โŒ
Semantic Deduplication โœ… AI-powered โŒ โœ…
Local Processing โœ… 100% local โœ… โš ๏ธ Requires DB
Memory Safety โœ… Chunked processing โš ๏ธ Manual โš ๏ธ Depends on DB
Fault Tolerance โœ… Checkpoint/Resume โŒ โš ๏ธ Depends on DB
Unix Pipes โœ… Native support โš ๏ธ Manual โŒ
Observability โœ… Metrics + Logs โŒ โš ๏ธ Depends on DB
Configuration โœ… Pydantic validation โŒ โš ๏ธ DB-specific
Type Safety โœ… Full type hints โŒ โš ๏ธ Depends on language

๐Ÿ› ๏ธ Tech Stack

  • Core: Python 3.10+, Polars (LazyFrame)
  • AI/ML: PyTorch (CPU), FAISS, Sentence-Transformers
  • Validation: Pydantic v2
  • Logging: structlog (optional)
  • Metrics: Prometheus Client (optional)
  • Infrastructure: Poetry, Docker-ready

๐Ÿ“‹ Edition Comparison

EntropyGuard is available in two editions:

Feature Community (Open Source) Enterprise
CLI Tool โœ… Full-featured โœ… Full-featured
Semantic Deduplication โœ… Unlimited โœ… Unlimited
PII Removal โœ… Unlimited โœ… Unlimited
Data Formats โœ… All formats โœ… All formats
Docker Support โœ… Yes โœ… Yes
Audit Logs โœ… Yes โœ… Enhanced
Web Dashboard โŒ โœ… Professional Analytics Platform
Real-time Monitoring โŒ โœ… Live telemetry & metrics
Alert System โŒ โœ… Custom alert rules (Watchtower)
API Access โŒ โœ… RESTful API
SSO Integration โŒ โœ… SAML 2.0, OAuth 2.0
Support Community Priority support with SLA
License MIT License Commercial license required

๐Ÿ“Œ Legal Notice: Enterprise features (Control Plane, Dashboard, API, Alerting System) are proprietary software covered by a commercial license. These components are NOT included in the Open Source release and are NOT subject to the MIT license terms.


๐Ÿ“š Documentation


๐Ÿค Contributing

Contributions are welcome! Please read our contributing guidelines and code of conduct before submitting pull requests.


๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


๐Ÿ™ Acknowledgments

Built with โค๏ธ by the EntropyGuard Team

Special thanks to:


โฌ† Back to Top

Made with โค๏ธ for the LLM community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

entropyguard-1.22.1.tar.gz (61.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

entropyguard-1.22.1-py3-none-any.whl (69.3 kB view details)

Uploaded Python 3

File details

Details for the file entropyguard-1.22.1.tar.gz.

File metadata

  • Download URL: entropyguard-1.22.1.tar.gz
  • Upload date:
  • Size: 61.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.11.9 Windows/10

File hashes

Hashes for entropyguard-1.22.1.tar.gz
Algorithm Hash digest
SHA256 f335f8f9ab8d5659aeaa749de2c76e11b6ee2cab7a62d92357298a1b4e3c10e9
MD5 5b7288cb9680900379752e3341f55d63
BLAKE2b-256 b9d205fff850aacc123642431a9231259e686b7b32a8ff528c9fe1b44223b54f

See more details on using hashes here.

File details

Details for the file entropyguard-1.22.1-py3-none-any.whl.

File metadata

  • Download URL: entropyguard-1.22.1-py3-none-any.whl
  • Upload date:
  • Size: 69.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.11.9 Windows/10

File hashes

Hashes for entropyguard-1.22.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f7ebc7c584322184ebdcf17e7e35afb76ceb4057b3ccaea132d67a793be64a58
MD5 e9846c006da0fef183a0721c57da4876
BLAKE2b-256 4912fdcd278cf188d9135d5a6c2f5a41c6ab8099d10eb31a3e93030a630a04f3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page