Skip to main content

High-Performance Semantic Deduplication Tool for RAG Pipelines

Project description

EntropyGuard

I built this because processing 100GB datasets with Pandas kept crashing my laptop (OOM). This is a CLI wrapper around Polars and FAISS to dedup text locally.

Problem

Training data for LLMs is usually full of duplicates. Hash-based dedup misses semantic duplicates ("What's the weather?" vs "How's the weather?"). Cloud APIs cost money and leak data. Custom scripts OOM on large files.

Solution

Two-stage deduplication:

  1. Exact dedup: xxHash on normalized text (~5K rows/sec)
  2. Semantic dedup: FAISS vector search on sentence-transformers embeddings (~500-1000 rows/sec)

Uses Polars LazyFrame so you can process datasets larger than RAM. Everything runs locally. No cloud calls.

Tech Stack

  • Polars LazyFrame: Lazy evaluation, processes data > RAM
  • FAISS: Vector similarity search (IndexFlatL2)
  • xxHash: Fast non-crypto hashing for exact duplicates
  • sentence-transformers: Embeddings (default: all-MiniLM-L6-v2, 384-dim)
  • Python 3.10+: Full type hints, MyPy strict compatible

Installation

pip install entropyguard

Requires Python 3.10, 3.11, or 3.12. Python 3.13 not supported (missing FAISS wheels).

Usage

Basic run:

entropyguard --input data.jsonl --output clean.jsonl --text-column text

Strict mode (higher similarity threshold):

entropyguard --input data.jsonl --output clean.jsonl --dedup-threshold 0.98 --min-length 100

Unix pipe:

cat data.jsonl | entropyguard --dedup-threshold 0.95 > clean.jsonl

Checkpoint/resume (for large datasets):

entropyguard --input large.jsonl --output clean.jsonl --checkpoint-dir ./checkpoints
# If it crashes, resume:
entropyguard --input large.jsonl --output clean.jsonl --checkpoint-dir ./checkpoints --resume

Benchmarks

Tested on 16GB RAM laptop, Python 3.11:

Dataset Size Time Peak Memory Duplicates Removed
1K rows ~2s ~150MB ~30%
10K rows ~15s ~400MB ~45%
65K rows ~2m ~900MB ~52%

For comparison, a naive Pandas approach on the same 65K dataset:

  • OOM at ~40K rows (16GB RAM limit)
  • Would take ~15-20 minutes if it didn't crash

Features

  • Local-first: No data leaves your machine
  • Resumable: Checkpoint system for fault tolerance
  • Pipe-friendly: Works with stdin/stdout
  • Memory-safe: Chunked processing, handles datasets > RAM
  • Format support: JSONL, CSV, Parquet, Excel
  • Exit codes: sysexits.h compliant (0=success, 1=error, 2=usage error, etc.)

Known Limitations

  • CLI-only: No web UI, no API. It's a command-line tool.
  • English-optimized: Default model (all-MiniLM-L6-v2) is English-only. Multilingual model available (--model-name paraphrase-multilingual-MiniLM-L12-v2) but slower.
  • Slower than hash-only: Semantic dedup is ~10x slower than pure hash dedup. Trade-off for accuracy.
  • CPU-only: No GPU acceleration (yet). Uses PyTorch CPU backend.
  • FAISS IndexFlatL2: O(n²) duplicate detection. For 10M+ rows, consider approximate search (not implemented).

CLI Flags

Essential flags:

Flag Default Description
--input stdin Input file path (or - for stdin)
--output stdout Output file path (or - for stdout)
--text-column auto-detect Column name containing text
--dedup-threshold 0.95 Similarity threshold (0.0-1.0, higher = stricter)
--min-length 50 Minimum text length after sanitization
--batch-size 10000 Embedding batch size (reduce if OOM)
--checkpoint-dir None Directory for checkpoint files
--resume false Resume from last checkpoint

Full flag reference: entropyguard --help

Configuration File

Create .entropyguardrc.json in your project root:

{
  "text_column": "text",
  "min_length": 100,
  "dedup_threshold": 0.95,
  "batch_size": 10000
}

CLI flags override config file values.

Exit Codes

Code Meaning
0 Success
1 General error
2 Usage error (invalid args)
64 Data format error
65 Input file error
66 Output file error
70 Software error (bug)
130 Interrupted (Ctrl+C)

License

MIT License. See LICENSE file.

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

entropyguard-1.22.2.tar.gz (49.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

entropyguard-1.22.2-py3-none-any.whl (59.7 kB view details)

Uploaded Python 3

File details

Details for the file entropyguard-1.22.2.tar.gz.

File metadata

  • Download URL: entropyguard-1.22.2.tar.gz
  • Upload date:
  • Size: 49.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.11.9 Windows/10

File hashes

Hashes for entropyguard-1.22.2.tar.gz
Algorithm Hash digest
SHA256 a484c1d5c1b951d60394d181cba50bcaec6a6bee2dad0c4da26f02730d9ecc5f
MD5 4b9cb5165ba3b529da88904cb35e44f1
BLAKE2b-256 07de476ebe7021f899b2826f0b5f45219ee45526585e8c3501b59596750791c5

See more details on using hashes here.

File details

Details for the file entropyguard-1.22.2-py3-none-any.whl.

File metadata

  • Download URL: entropyguard-1.22.2-py3-none-any.whl
  • Upload date:
  • Size: 59.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.11.9 Windows/10

File hashes

Hashes for entropyguard-1.22.2-py3-none-any.whl
Algorithm Hash digest
SHA256 8318c46ecf9d869f18cc4b3c3cf3c5760ec81c9edb1f874c58375d8c60e4b042
MD5 7725e7d35ee7766697e37733ae8cab19
BLAKE2b-256 53a925722235f3785c3bf5ea9ef7898cf764e0783e54b5c86bfdefebbd83eb1d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page