Skip to main content

Enterprise-grade semantic data deduplication and sanitization engine.

Project description

EntropyGuard

I built this because processing 100GB datasets with Pandas kept crashing my laptop (OOM). This is a CLI wrapper around Polars and FAISS to dedup text locally.

Problem

Training data for LLMs is usually full of duplicates. Hash-based dedup misses semantic duplicates ("What's the weather?" vs "How's the weather?"). Cloud APIs cost money and leak data. Custom scripts OOM on large files.

Solution

Two-stage deduplication:

  1. Exact dedup: xxHash on normalized text (~5K rows/sec)
  2. Semantic dedup: FAISS vector search on sentence-transformers embeddings (~500-1000 rows/sec)

Uses Polars LazyFrame so you can process datasets larger than RAM. Everything runs locally. No cloud calls.

Tech Stack

  • Polars LazyFrame: Lazy evaluation, processes data > RAM
  • FAISS: Vector similarity search (IndexFlatL2)
  • xxHash: Fast non-crypto hashing for exact duplicates
  • sentence-transformers: Embeddings (default: all-MiniLM-L6-v2, 384-dim)
  • Python 3.10+: Full type hints, MyPy strict compatible

Installation

pip install entropyguard

For PDF support (optional):

pip install entropyguard[pdf]

Requires Python 3.10, 3.11, or 3.12. Python 3.13 not supported (missing FAISS wheels).

Usage

Basic run:

entropyguard --input data.jsonl --output clean.jsonl --text-column text

Strict mode (higher similarity threshold):

entropyguard --input data.jsonl --output clean.jsonl --dedup-threshold 0.98 --min-length 100

Unix pipe:

cat data.jsonl | entropyguard --dedup-threshold 0.95 > clean.jsonl

With audit log (for compliance):

entropyguard --input data.jsonl --output clean.jsonl --text-column text --audit-log audit.json

PDF directory (requires entropyguard[pdf]):

entropyguard --input ./my_pdfs_folder --output clean.jsonl --text-column text

Checkpoint/resume (for large datasets) - available via config file (see Configuration File section):

{
  "checkpoint_dir": "./checkpoints",
  "resume": true
}

Benchmarks

Tested on 16GB RAM laptop, Python 3.11:

Dataset Size Time Peak Memory Duplicates Removed
1K rows ~2s ~150MB ~30%
10K rows ~15s ~400MB ~45%
65K rows ~2m ~900MB ~52%

For comparison, a naive Pandas approach on the same 65K dataset:

  • OOM at ~40K rows (16GB RAM limit)
  • Would take ~15-20 minutes if it didn't crash

Features

  • Local-first: No data leaves your machine
  • Resumable: Checkpoint system for fault tolerance
  • Pipe-friendly: Works with stdin/stdout
  • Memory-safe: Chunked processing, handles datasets > RAM
  • Format support: JSONL, CSV, Parquet, Excel, PDF directories (optional)
  • Exit codes: sysexits.h compliant (0=success, 1=error, 2=usage error, etc.)

PDF Support (Optional)

EntropyGuard can process PDF directories directly when installed with the pdf extra:

pip install entropyguard[pdf]
entropyguard --input ./pdf_folder --output clean.jsonl

Features:

  • Recursively scans directories for PDF files
  • Converts PDFs to Markdown (preserves table structure via IBM's docling)
  • Memory-safe: processes PDFs one at a time using generators
  • Includes source filename metadata (source_file column)
  • Graceful error handling: skips corrupted PDFs and continues processing

Output format: PDF directories are converted to JSONL with text and source_file columns. The text column contains the extracted Markdown content.

Known Limitations

  • CLI-only: No web UI, no API. It's a command-line tool.
  • English-optimized: Default model (all-MiniLM-L6-v2) is English-only. Multilingual model available (--model-name paraphrase-multilingual-MiniLM-L12-v2) but slower.
  • Slower than hash-only: Semantic dedup is ~10x slower than pure hash dedup. Trade-off for accuracy.
  • CPU-only: No GPU acceleration (yet). Uses PyTorch CPU backend.
  • FAISS IndexFlatL2: O(n²) duplicate detection. For 10M+ rows, consider approximate search (not implemented).
  • PDF support: Requires optional pdf extra and IBM's docling library

CLI Flags

Essential flags:

Flag Default Description
--input required Input file path (CSV, JSONL, Parquet, Excel) or directory (PDF files). Use - for stdin
--output required Output file path (or - for stdout)
--text-column auto-detect Column name containing text (defaults to 'text' for PDF directories)
--dedup-threshold 0.95 Similarity threshold (0.0-1.0, higher = stricter)
--min-length 50 Minimum text length after sanitization
--model-name all-MiniLM-L6-v2 Sentence-transformers model for embeddings
--required-columns None Comma-separated list of required columns (schema validation)
--audit-log None Path to JSON file for audit log of dropped/duplicate rows
--chunk-size None Chunk size (characters) for splitting long texts before embedding
--chunk-overlap 50 Overlap size (characters) between consecutive chunks
--separators None Custom separators for text chunking (space-separated list)
--profile-memory false Enable memory profiling during processing

Note: --batch-size and checkpoint-related flags (--checkpoint-dir, --resume) are available via configuration file only. See Configuration File section below.

Full flag reference: entropyguard --help

Configuration File

Create .entropyguardrc.json in your project root:

{
  "text_column": "text",
  "min_length": 100,
  "dedup_threshold": 0.95,
  "batch_size": 10000,
  "checkpoint_dir": "./checkpoints",
  "resume": false,
  "auto_resume": true
}

CLI flags override config file values. Some options (like batch_size, checkpoint_dir, resume) are only available via configuration file.

Exit Codes

Code Meaning
0 Success
1 General error
2 Usage error (invalid args)
64 Data format error
65 Input file error
66 Output file error
70 Software error (bug)
130 Interrupted (Ctrl+C)

License

MIT License. See LICENSE file.

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

entropyguard-1.23.0.tar.gz (52.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

entropyguard-1.23.0-py3-none-any.whl (63.7 kB view details)

Uploaded Python 3

File details

Details for the file entropyguard-1.23.0.tar.gz.

File metadata

  • Download URL: entropyguard-1.23.0.tar.gz
  • Upload date:
  • Size: 52.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.11.9 Windows/10

File hashes

Hashes for entropyguard-1.23.0.tar.gz
Algorithm Hash digest
SHA256 ce46cc7997458f6b97152663bfa5b2b298b9b521f22c77b92be3d30c7a13ad0c
MD5 5f65f9b458c6cde1c060a73dcb8feef0
BLAKE2b-256 dbd0b2365ed2224fae699a3faa401b59770e30b9908886fea905be12cb1d663e

See more details on using hashes here.

File details

Details for the file entropyguard-1.23.0-py3-none-any.whl.

File metadata

  • Download URL: entropyguard-1.23.0-py3-none-any.whl
  • Upload date:
  • Size: 63.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.11.9 Windows/10

File hashes

Hashes for entropyguard-1.23.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0905f54da9fa58d4b8bb3ce310df5bd2f3f6705bc84291a9d5dd88b6e3b5aeb9
MD5 1150c2da0ae2ea1c166c55d422aef96f
BLAKE2b-256 6781a5e206289dc28b79144c85bcb7e1318c217512efbc23459bb1bf8ba7197a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page