Enterprise-grade semantic data deduplication and sanitization engine.

These details have not been verified by PyPI

Project description

EntropyGuard

I built this because processing 100GB datasets with Pandas kept crashing my laptop (OOM). This is a CLI wrapper around Polars and FAISS to dedup text locally.

Problem

Training data for LLMs is usually full of duplicates. Hash-based dedup misses semantic duplicates ("What's the weather?" vs "How's the weather?"). Cloud APIs cost money and leak data. Custom scripts OOM on large files.

Solution

Two-stage deduplication:

Exact dedup: xxHash on normalized text (~5K rows/sec)
Semantic dedup: FAISS vector search on sentence-transformers embeddings (~500-1000 rows/sec)

Uses Polars LazyFrame so you can process datasets larger than RAM. Everything runs locally. No cloud calls.

Tech Stack

Polars LazyFrame: Lazy evaluation, processes data > RAM
FAISS: Vector similarity search (IndexFlatL2)
xxHash: Fast non-crypto hashing for exact duplicates
sentence-transformers: Embeddings (default: all-MiniLM-L6-v2, 384-dim)
Python 3.10+: Full type hints, MyPy strict compatible

Installation

pip install entropyguard

For PDF support (optional):

pip install entropyguard[pdf]

Requires Python 3.10, 3.11, or 3.12. Python 3.13 not supported (missing FAISS wheels).

Usage

Basic run:

entropyguard --input data.jsonl --output clean.jsonl --text-column text

Strict mode (higher similarity threshold):

entropyguard --input data.jsonl --output clean.jsonl --dedup-threshold 0.98 --min-length 100

Unix pipe:

cat data.jsonl | entropyguard --dedup-threshold 0.95 > clean.jsonl

With audit log (for compliance):

entropyguard --input data.jsonl --output clean.jsonl --text-column text --audit-log audit.json

PDF directory (requires entropyguard[pdf]):

entropyguard --input ./my_pdfs_folder --output clean.jsonl --text-column text

Checkpoint/resume (for large datasets) - available via config file (see Configuration File section):

{
  "checkpoint_dir": "./checkpoints",
  "resume": true
}

Benchmarks

Tested on 16GB RAM laptop, Python 3.11:

Dataset Size	Time	Peak Memory	Duplicates Removed
1K rows	~2s	~150MB	~30%
10K rows	~15s	~400MB	~45%
65K rows	~2m	~900MB	~52%

For comparison, a naive Pandas approach on the same 65K dataset:

OOM at ~40K rows (16GB RAM limit)
Would take ~15-20 minutes if it didn't crash

Features

Local-first: No data leaves your machine
Resumable: Checkpoint system for fault tolerance
Pipe-friendly: Works with stdin/stdout
Memory-safe: Chunked processing, handles datasets > RAM
Format support: JSONL, CSV, Parquet, Excel, PDF directories (optional)
Exit codes: sysexits.h compliant (0=success, 1=error, 2=usage error, etc.)

PDF Support (Optional)

EntropyGuard can process PDF directories directly when installed with the pdf extra:

pip install entropyguard[pdf]
entropyguard --input ./pdf_folder --output clean.jsonl

Features:

Recursively scans directories for PDF files
Converts PDFs to Markdown (preserves table structure via IBM's docling)
Memory-safe: processes PDFs one at a time using generators
Includes source filename metadata (source_file column)
Graceful error handling: skips corrupted PDFs and continues processing

Output format: PDF directories are converted to JSONL with text and source_file columns. The text column contains the extracted Markdown content.

Known Limitations

CLI-only: No web UI, no API. It's a command-line tool.
English-optimized: Default model (all-MiniLM-L6-v2) is English-only. Multilingual model available (--model-name paraphrase-multilingual-MiniLM-L12-v2) but slower.
Slower than hash-only: Semantic dedup is ~10x slower than pure hash dedup. Trade-off for accuracy.
CPU-only: No GPU acceleration (yet). Uses PyTorch CPU backend.
FAISS IndexFlatL2: O(n²) duplicate detection. For 10M+ rows, consider approximate search (not implemented).
PDF support: Requires optional pdf extra and IBM's docling library

CLI Flags

Essential flags:

Flag	Default	Description
`--input`	required	Input file path (CSV, JSONL, Parquet, Excel) or directory (PDF files). Use `-` for stdin
`--output`	required	Output file path (or `-` for stdout)
`--text-column`	auto-detect	Column name containing text (defaults to 'text' for PDF directories)
`--dedup-threshold`	0.95	Similarity threshold (0.0-1.0, higher = stricter)
`--min-length`	50	Minimum text length after sanitization
`--model-name`	all-MiniLM-L6-v2	Sentence-transformers model for embeddings
`--required-columns`	None	Comma-separated list of required columns (schema validation)
`--audit-log`	None	Path to JSON file for audit log of dropped/duplicate rows
`--chunk-size`	None	Chunk size (characters) for splitting long texts before embedding
`--chunk-overlap`	50	Overlap size (characters) between consecutive chunks
`--separators`	None	Custom separators for text chunking (space-separated list)
`--profile-memory`	false	Enable memory profiling during processing

Note: --batch-size and checkpoint-related flags (--checkpoint-dir, --resume) are available via configuration file only. See Configuration File section below.

Full flag reference: entropyguard --help

Configuration File

Create .entropyguardrc.json in your project root:

{
  "text_column": "text",
  "min_length": 100,
  "dedup_threshold": 0.95,
  "batch_size": 10000,
  "checkpoint_dir": "./checkpoints",
  "resume": false,
  "auto_resume": true
}

CLI flags override config file values. Some options (like batch_size, checkpoint_dir, resume) are only available via configuration file.

Exit Codes

Code	Meaning
0	Success
1	General error
2	Usage error (invalid args)
64	Data format error
65	Input file error
66	Output file error
70	Software error (bug)
130	Interrupted (Ctrl+C)

License

MIT License. See LICENSE file.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.23.0

Dec 31, 2025

1.22.3

Dec 29, 2025

1.22.2

Dec 27, 2025

1.22.1

Dec 25, 2025

1.22.0

Dec 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

entropyguard-1.23.0.tar.gz (52.9 kB view details)

Uploaded Dec 31, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

entropyguard-1.23.0-py3-none-any.whl (63.7 kB view details)

Uploaded Dec 31, 2025 Python 3

File details

Details for the file entropyguard-1.23.0.tar.gz.

File metadata

Download URL: entropyguard-1.23.0.tar.gz
Upload date: Dec 31, 2025
Size: 52.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.2.1 CPython/3.11.9 Windows/10

File hashes

Hashes for entropyguard-1.23.0.tar.gz
Algorithm	Hash digest
SHA256	`ce46cc7997458f6b97152663bfa5b2b298b9b521f22c77b92be3d30c7a13ad0c`
MD5	`5f65f9b458c6cde1c060a73dcb8feef0`
BLAKE2b-256	`dbd0b2365ed2224fae699a3faa401b59770e30b9908886fea905be12cb1d663e`

See more details on using hashes here.

File details

Details for the file entropyguard-1.23.0-py3-none-any.whl.

File metadata

Download URL: entropyguard-1.23.0-py3-none-any.whl
Upload date: Dec 31, 2025
Size: 63.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.2.1 CPython/3.11.9 Windows/10

File hashes

Hashes for entropyguard-1.23.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0905f54da9fa58d4b8bb3ce310df5bd2f3f6705bc84291a9d5dd88b6e3b5aeb9`
MD5	`1150c2da0ae2ea1c166c55d422aef96f`
BLAKE2b-256	`6781a5e206289dc28b79144c85bcb7e1318c217512efbc23459bb1bf8ba7197a`

See more details on using hashes here.

entropyguard 1.23.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

EntropyGuard

Problem

Solution

Tech Stack

Installation

Usage

Benchmarks

Features

PDF Support (Optional)

Known Limitations

CLI Flags

Configuration File

Exit Codes

License

Links

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes