Enterprise-grade semantic data deduplication and sanitization engine.
Project description
EntropyGuard
I built this because processing 100GB datasets with Pandas kept crashing my laptop (OOM). This is a CLI wrapper around Polars and FAISS to dedup text locally.
Problem
Training data for LLMs is usually full of duplicates. Hash-based dedup misses semantic duplicates ("What's the weather?" vs "How's the weather?"). Cloud APIs cost money and leak data. Custom scripts OOM on large files.
Solution
Two-stage deduplication:
- Exact dedup: xxHash on normalized text (~5K rows/sec)
- Semantic dedup: FAISS vector search on sentence-transformers embeddings (~500-1000 rows/sec)
Uses Polars LazyFrame so you can process datasets larger than RAM. Everything runs locally. No cloud calls.
Tech Stack
- Polars LazyFrame: Lazy evaluation, processes data > RAM
- FAISS: Vector similarity search (IndexFlatL2)
- xxHash: Fast non-crypto hashing for exact duplicates
- sentence-transformers: Embeddings (default: all-MiniLM-L6-v2, 384-dim)
- Python 3.10+: Full type hints, MyPy strict compatible
Installation
pip install entropyguard
Requires Python 3.10, 3.11, or 3.12. Python 3.13 not supported (missing FAISS wheels).
Usage
Basic run:
entropyguard --input data.jsonl --output clean.jsonl --text-column text
Strict mode (higher similarity threshold):
entropyguard --input data.jsonl --output clean.jsonl --dedup-threshold 0.98 --min-length 100
Unix pipe:
cat data.jsonl | entropyguard --dedup-threshold 0.95 > clean.jsonl
With audit log (for compliance):
entropyguard --input data.jsonl --output clean.jsonl --text-column text --audit-log audit.json
Checkpoint/resume (for large datasets) - available via config file (see Configuration File section):
{
"checkpoint_dir": "./checkpoints",
"resume": true
}
Benchmarks
Tested on 16GB RAM laptop, Python 3.11:
| Dataset Size | Time | Peak Memory | Duplicates Removed |
|---|---|---|---|
| 1K rows | ~2s | ~150MB | ~30% |
| 10K rows | ~15s | ~400MB | ~45% |
| 65K rows | ~2m | ~900MB | ~52% |
For comparison, a naive Pandas approach on the same 65K dataset:
- OOM at ~40K rows (16GB RAM limit)
- Would take ~15-20 minutes if it didn't crash
Features
- Local-first: No data leaves your machine
- Resumable: Checkpoint system for fault tolerance
- Pipe-friendly: Works with stdin/stdout
- Memory-safe: Chunked processing, handles datasets > RAM
- Format support: JSONL, CSV, Parquet, Excel
- Exit codes: sysexits.h compliant (0=success, 1=error, 2=usage error, etc.)
Known Limitations
- CLI-only: No web UI, no API. It's a command-line tool.
- English-optimized: Default model (all-MiniLM-L6-v2) is English-only. Multilingual model available (
--model-name paraphrase-multilingual-MiniLM-L12-v2) but slower. - Slower than hash-only: Semantic dedup is ~10x slower than pure hash dedup. Trade-off for accuracy.
- CPU-only: No GPU acceleration (yet). Uses PyTorch CPU backend.
- FAISS IndexFlatL2: O(n²) duplicate detection. For 10M+ rows, consider approximate search (not implemented).
CLI Flags
Essential flags:
| Flag | Default | Description |
|---|---|---|
--input |
required | Input file path (or - for stdin) |
--output |
required | Output file path (or - for stdout) |
--text-column |
auto-detect | Column name containing text |
--dedup-threshold |
0.95 | Similarity threshold (0.0-1.0, higher = stricter) |
--min-length |
50 | Minimum text length after sanitization |
--model-name |
all-MiniLM-L6-v2 | Sentence-transformers model for embeddings |
--required-columns |
None | Comma-separated list of required columns (schema validation) |
--audit-log |
None | Path to JSON file for audit log of dropped/duplicate rows |
--chunk-size |
None | Chunk size (characters) for splitting long texts before embedding |
--chunk-overlap |
50 | Overlap size (characters) between consecutive chunks |
--separators |
None | Custom separators for text chunking (space-separated list) |
--profile-memory |
false | Enable memory profiling during processing |
Note: --batch-size and checkpoint-related flags (--checkpoint-dir, --resume) are available via configuration file only. See Configuration File section below.
Full flag reference: entropyguard --help
Configuration File
Create .entropyguardrc.json in your project root:
{
"text_column": "text",
"min_length": 100,
"dedup_threshold": 0.95,
"batch_size": 10000,
"checkpoint_dir": "./checkpoints",
"resume": false,
"auto_resume": true
}
CLI flags override config file values. Some options (like batch_size, checkpoint_dir, resume) are only available via configuration file.
Exit Codes
| Code | Meaning |
|---|---|
| 0 | Success |
| 1 | General error |
| 2 | Usage error (invalid args) |
| 64 | Data format error |
| 65 | Input file error |
| 66 | Output file error |
| 70 | Software error (bug) |
| 130 | Interrupted (Ctrl+C) |
License
MIT License. See LICENSE file.
Links
- GitHub: https://github.com/DamianSiuta/entropyguard
- PyPI: https://pypi.org/project/entropyguard/
- Documentation: See ARCHITECTURE.md
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file entropyguard-1.22.3.tar.gz.
File metadata
- Download URL: entropyguard-1.22.3.tar.gz
- Upload date:
- Size: 49.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.2.1 CPython/3.11.9 Windows/10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b5800be8fa20183c6e8cd38ecb6f4e63be52b78319db3502c6b6cf77414df83f
|
|
| MD5 |
e1eff4d6ee3a475efc609f9f228fb4d6
|
|
| BLAKE2b-256 |
b9b69a5defbb8209c233d04159fc335be54c675dd5974da88d0d5add7e89d55b
|
File details
Details for the file entropyguard-1.22.3-py3-none-any.whl.
File metadata
- Download URL: entropyguard-1.22.3-py3-none-any.whl
- Upload date:
- Size: 59.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.2.1 CPython/3.11.9 Windows/10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9d47daff17cbb015e0c8074cf358119f806f1ded985dcd806990cbf4b265d108
|
|
| MD5 |
ad5367507c75c397306f6c50a76b487b
|
|
| BLAKE2b-256 |
04a62f580aa83e00ae10478debbdb2c24a34f9f6165b6285aab71cc065daee55
|