High-Performance Semantic Deduplication Tool for RAG Pipelines

These details have not been verified by PyPI

Project links

Project description

EntropyGuard

I built this because processing 100GB datasets with Pandas kept crashing my laptop (OOM). This is a CLI wrapper around Polars and FAISS to dedup text locally.

Problem

Training data for LLMs is usually full of duplicates. Hash-based dedup misses semantic duplicates ("What's the weather?" vs "How's the weather?"). Cloud APIs cost money and leak data. Custom scripts OOM on large files.

Solution

Two-stage deduplication:

Exact dedup: xxHash on normalized text (~5K rows/sec)
Semantic dedup: FAISS vector search on sentence-transformers embeddings (~500-1000 rows/sec)

Uses Polars LazyFrame so you can process datasets larger than RAM. Everything runs locally. No cloud calls.

Tech Stack

Polars LazyFrame: Lazy evaluation, processes data > RAM
FAISS: Vector similarity search (IndexFlatL2)
xxHash: Fast non-crypto hashing for exact duplicates
sentence-transformers: Embeddings (default: all-MiniLM-L6-v2, 384-dim)
Python 3.10+: Full type hints, MyPy strict compatible

Installation

pip install entropyguard

Requires Python 3.10, 3.11, or 3.12. Python 3.13 not supported (missing FAISS wheels).

Usage

Basic run:

entropyguard --input data.jsonl --output clean.jsonl --text-column text

Strict mode (higher similarity threshold):

entropyguard --input data.jsonl --output clean.jsonl --dedup-threshold 0.98 --min-length 100

Unix pipe:

cat data.jsonl | entropyguard --dedup-threshold 0.95 > clean.jsonl

Checkpoint/resume (for large datasets):

entropyguard --input large.jsonl --output clean.jsonl --checkpoint-dir ./checkpoints
# If it crashes, resume:
entropyguard --input large.jsonl --output clean.jsonl --checkpoint-dir ./checkpoints --resume

Benchmarks

Tested on 16GB RAM laptop, Python 3.11:

Dataset Size	Time	Peak Memory	Duplicates Removed
1K rows	~2s	~150MB	~30%
10K rows	~15s	~400MB	~45%
65K rows	~2m	~900MB	~52%

For comparison, a naive Pandas approach on the same 65K dataset:

OOM at ~40K rows (16GB RAM limit)
Would take ~15-20 minutes if it didn't crash

Features

Local-first: No data leaves your machine
Resumable: Checkpoint system for fault tolerance
Pipe-friendly: Works with stdin/stdout
Memory-safe: Chunked processing, handles datasets > RAM
Format support: JSONL, CSV, Parquet, Excel
Exit codes: sysexits.h compliant (0=success, 1=error, 2=usage error, etc.)

Known Limitations

CLI-only: No web UI, no API. It's a command-line tool.
English-optimized: Default model (all-MiniLM-L6-v2) is English-only. Multilingual model available (--model-name paraphrase-multilingual-MiniLM-L12-v2) but slower.
Slower than hash-only: Semantic dedup is ~10x slower than pure hash dedup. Trade-off for accuracy.
CPU-only: No GPU acceleration (yet). Uses PyTorch CPU backend.
FAISS IndexFlatL2: O(n²) duplicate detection. For 10M+ rows, consider approximate search (not implemented).

CLI Flags

Essential flags:

Flag	Default	Description
`--input`	stdin	Input file path (or `-` for stdin)
`--output`	stdout	Output file path (or `-` for stdout)
`--text-column`	auto-detect	Column name containing text
`--dedup-threshold`	0.95	Similarity threshold (0.0-1.0, higher = stricter)
`--min-length`	50	Minimum text length after sanitization
`--batch-size`	10000	Embedding batch size (reduce if OOM)
`--checkpoint-dir`	None	Directory for checkpoint files
`--resume`	false	Resume from last checkpoint

Full flag reference: entropyguard --help

Configuration File

Create .entropyguardrc.json in your project root:

{
  "text_column": "text",
  "min_length": 100,
  "dedup_threshold": 0.95,
  "batch_size": 10000
}

CLI flags override config file values.

Exit Codes

Code	Meaning
0	Success
1	General error
2	Usage error (invalid args)
64	Data format error
65	Input file error
66	Output file error
70	Software error (bug)
130	Interrupted (Ctrl+C)

License

MIT License. See LICENSE file.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.23.0

Dec 31, 2025

1.22.3

Dec 29, 2025

This version

1.22.2

Dec 27, 2025

1.22.1

Dec 25, 2025

1.22.0

Dec 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

entropyguard-1.22.2.tar.gz (49.6 kB view details)

Uploaded Dec 27, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

entropyguard-1.22.2-py3-none-any.whl (59.7 kB view details)

Uploaded Dec 27, 2025 Python 3

File details

Details for the file entropyguard-1.22.2.tar.gz.

File metadata

Download URL: entropyguard-1.22.2.tar.gz
Upload date: Dec 27, 2025
Size: 49.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.2.1 CPython/3.11.9 Windows/10

File hashes

Hashes for entropyguard-1.22.2.tar.gz
Algorithm	Hash digest
SHA256	`a484c1d5c1b951d60394d181cba50bcaec6a6bee2dad0c4da26f02730d9ecc5f`
MD5	`4b9cb5165ba3b529da88904cb35e44f1`
BLAKE2b-256	`07de476ebe7021f899b2826f0b5f45219ee45526585e8c3501b59596750791c5`

See more details on using hashes here.

File details

Details for the file entropyguard-1.22.2-py3-none-any.whl.

File metadata

Download URL: entropyguard-1.22.2-py3-none-any.whl
Upload date: Dec 27, 2025
Size: 59.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.2.1 CPython/3.11.9 Windows/10

File hashes

Hashes for entropyguard-1.22.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8318c46ecf9d869f18cc4b3c3cf3c5760ec81c9edb1f874c58375d8c60e4b042`
MD5	`7725e7d35ee7766697e37733ae8cab19`
BLAKE2b-256	`53a925722235f3785c3bf5ea9ef7898cf764e0783e54b5c86bfdefebbd83eb1d`

See more details on using hashes here.

entropyguard 1.22.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

EntropyGuard

Problem

Solution

Tech Stack

Installation

Usage

Benchmarks

Features

Known Limitations

CLI Flags

Configuration File

Exit Codes

License

Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes