High-Performance Semantic Deduplication Tool for RAG Pipelines
Project description
EntropyGuard
I built this because processing 100GB datasets with Pandas kept crashing my laptop (OOM). This is a CLI wrapper around Polars and FAISS to dedup text locally.
Problem
Training data for LLMs is usually full of duplicates. Hash-based dedup misses semantic duplicates ("What's the weather?" vs "How's the weather?"). Cloud APIs cost money and leak data. Custom scripts OOM on large files.
Solution
Two-stage deduplication:
- Exact dedup: xxHash on normalized text (~5K rows/sec)
- Semantic dedup: FAISS vector search on sentence-transformers embeddings (~500-1000 rows/sec)
Uses Polars LazyFrame so you can process datasets larger than RAM. Everything runs locally. No cloud calls.
Tech Stack
- Polars LazyFrame: Lazy evaluation, processes data > RAM
- FAISS: Vector similarity search (IndexFlatL2)
- xxHash: Fast non-crypto hashing for exact duplicates
- sentence-transformers: Embeddings (default: all-MiniLM-L6-v2, 384-dim)
- Python 3.10+: Full type hints, MyPy strict compatible
Installation
pip install entropyguard
Requires Python 3.10, 3.11, or 3.12. Python 3.13 not supported (missing FAISS wheels).
Usage
Basic run:
entropyguard --input data.jsonl --output clean.jsonl --text-column text
Strict mode (higher similarity threshold):
entropyguard --input data.jsonl --output clean.jsonl --dedup-threshold 0.98 --min-length 100
Unix pipe:
cat data.jsonl | entropyguard --dedup-threshold 0.95 > clean.jsonl
Checkpoint/resume (for large datasets):
entropyguard --input large.jsonl --output clean.jsonl --checkpoint-dir ./checkpoints
# If it crashes, resume:
entropyguard --input large.jsonl --output clean.jsonl --checkpoint-dir ./checkpoints --resume
Benchmarks
Tested on 16GB RAM laptop, Python 3.11:
| Dataset Size | Time | Peak Memory | Duplicates Removed |
|---|---|---|---|
| 1K rows | ~2s | ~150MB | ~30% |
| 10K rows | ~15s | ~400MB | ~45% |
| 65K rows | ~2m | ~900MB | ~52% |
For comparison, a naive Pandas approach on the same 65K dataset:
- OOM at ~40K rows (16GB RAM limit)
- Would take ~15-20 minutes if it didn't crash
Features
- Local-first: No data leaves your machine
- Resumable: Checkpoint system for fault tolerance
- Pipe-friendly: Works with stdin/stdout
- Memory-safe: Chunked processing, handles datasets > RAM
- Format support: JSONL, CSV, Parquet, Excel
- Exit codes: sysexits.h compliant (0=success, 1=error, 2=usage error, etc.)
Known Limitations
- CLI-only: No web UI, no API. It's a command-line tool.
- English-optimized: Default model (all-MiniLM-L6-v2) is English-only. Multilingual model available (
--model-name paraphrase-multilingual-MiniLM-L12-v2) but slower. - Slower than hash-only: Semantic dedup is ~10x slower than pure hash dedup. Trade-off for accuracy.
- CPU-only: No GPU acceleration (yet). Uses PyTorch CPU backend.
- FAISS IndexFlatL2: O(n²) duplicate detection. For 10M+ rows, consider approximate search (not implemented).
CLI Flags
Essential flags:
| Flag | Default | Description |
|---|---|---|
--input |
stdin | Input file path (or - for stdin) |
--output |
stdout | Output file path (or - for stdout) |
--text-column |
auto-detect | Column name containing text |
--dedup-threshold |
0.95 | Similarity threshold (0.0-1.0, higher = stricter) |
--min-length |
50 | Minimum text length after sanitization |
--batch-size |
10000 | Embedding batch size (reduce if OOM) |
--checkpoint-dir |
None | Directory for checkpoint files |
--resume |
false | Resume from last checkpoint |
Full flag reference: entropyguard --help
Configuration File
Create .entropyguardrc.json in your project root:
{
"text_column": "text",
"min_length": 100,
"dedup_threshold": 0.95,
"batch_size": 10000
}
CLI flags override config file values.
Exit Codes
| Code | Meaning |
|---|---|
| 0 | Success |
| 1 | General error |
| 2 | Usage error (invalid args) |
| 64 | Data format error |
| 65 | Input file error |
| 66 | Output file error |
| 70 | Software error (bug) |
| 130 | Interrupted (Ctrl+C) |
License
MIT License. See LICENSE file.
Links
- GitHub: https://github.com/DamianSiuta/entropyguard
- PyPI: https://pypi.org/project/entropyguard/
- Documentation: See PROJECT_COMPREHENSIVE_DOCUMENTATION.md
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file entropyguard-1.22.2.tar.gz.
File metadata
- Download URL: entropyguard-1.22.2.tar.gz
- Upload date:
- Size: 49.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.2.1 CPython/3.11.9 Windows/10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a484c1d5c1b951d60394d181cba50bcaec6a6bee2dad0c4da26f02730d9ecc5f
|
|
| MD5 |
4b9cb5165ba3b529da88904cb35e44f1
|
|
| BLAKE2b-256 |
07de476ebe7021f899b2826f0b5f45219ee45526585e8c3501b59596750791c5
|
File details
Details for the file entropyguard-1.22.2-py3-none-any.whl.
File metadata
- Download URL: entropyguard-1.22.2-py3-none-any.whl
- Upload date:
- Size: 59.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.2.1 CPython/3.11.9 Windows/10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8318c46ecf9d869f18cc4b3c3cf3c5760ec81c9edb1f874c58375d8c60e4b042
|
|
| MD5 |
7725e7d35ee7766697e37733ae8cab19
|
|
| BLAKE2b-256 |
53a925722235f3785c3bf5ea9ef7898cf764e0783e54b5c86bfdefebbd83eb1d
|