Skip to main content

Enterprise-grade semantic data deduplication and sanitization engine.

Project description

EntropyGuard

I built this because processing 100GB datasets with Pandas kept crashing my laptop (OOM). This is a CLI wrapper around Polars and FAISS to dedup text locally.

Problem

Training data for LLMs is usually full of duplicates. Hash-based dedup misses semantic duplicates ("What's the weather?" vs "How's the weather?"). Cloud APIs cost money and leak data. Custom scripts OOM on large files.

Solution

Two-stage deduplication:

  1. Exact dedup: xxHash on normalized text (~5K rows/sec)
  2. Semantic dedup: FAISS vector search on sentence-transformers embeddings (~500-1000 rows/sec)

Uses Polars LazyFrame so you can process datasets larger than RAM. Everything runs locally. No cloud calls.

Tech Stack

  • Polars LazyFrame: Lazy evaluation, processes data > RAM
  • FAISS: Vector similarity search (IndexFlatL2)
  • xxHash: Fast non-crypto hashing for exact duplicates
  • sentence-transformers: Embeddings (default: all-MiniLM-L6-v2, 384-dim)
  • Python 3.10+: Full type hints, MyPy strict compatible

Installation

pip install entropyguard

Requires Python 3.10, 3.11, or 3.12. Python 3.13 not supported (missing FAISS wheels).

Usage

Basic run:

entropyguard --input data.jsonl --output clean.jsonl --text-column text

Strict mode (higher similarity threshold):

entropyguard --input data.jsonl --output clean.jsonl --dedup-threshold 0.98 --min-length 100

Unix pipe:

cat data.jsonl | entropyguard --dedup-threshold 0.95 > clean.jsonl

With audit log (for compliance):

entropyguard --input data.jsonl --output clean.jsonl --text-column text --audit-log audit.json

Checkpoint/resume (for large datasets) - available via config file (see Configuration File section):

{
  "checkpoint_dir": "./checkpoints",
  "resume": true
}

Benchmarks

Tested on 16GB RAM laptop, Python 3.11:

Dataset Size Time Peak Memory Duplicates Removed
1K rows ~2s ~150MB ~30%
10K rows ~15s ~400MB ~45%
65K rows ~2m ~900MB ~52%

For comparison, a naive Pandas approach on the same 65K dataset:

  • OOM at ~40K rows (16GB RAM limit)
  • Would take ~15-20 minutes if it didn't crash

Features

  • Local-first: No data leaves your machine
  • Resumable: Checkpoint system for fault tolerance
  • Pipe-friendly: Works with stdin/stdout
  • Memory-safe: Chunked processing, handles datasets > RAM
  • Format support: JSONL, CSV, Parquet, Excel
  • Exit codes: sysexits.h compliant (0=success, 1=error, 2=usage error, etc.)

Known Limitations

  • CLI-only: No web UI, no API. It's a command-line tool.
  • English-optimized: Default model (all-MiniLM-L6-v2) is English-only. Multilingual model available (--model-name paraphrase-multilingual-MiniLM-L12-v2) but slower.
  • Slower than hash-only: Semantic dedup is ~10x slower than pure hash dedup. Trade-off for accuracy.
  • CPU-only: No GPU acceleration (yet). Uses PyTorch CPU backend.
  • FAISS IndexFlatL2: O(n²) duplicate detection. For 10M+ rows, consider approximate search (not implemented).

CLI Flags

Essential flags:

Flag Default Description
--input required Input file path (or - for stdin)
--output required Output file path (or - for stdout)
--text-column auto-detect Column name containing text
--dedup-threshold 0.95 Similarity threshold (0.0-1.0, higher = stricter)
--min-length 50 Minimum text length after sanitization
--model-name all-MiniLM-L6-v2 Sentence-transformers model for embeddings
--required-columns None Comma-separated list of required columns (schema validation)
--audit-log None Path to JSON file for audit log of dropped/duplicate rows
--chunk-size None Chunk size (characters) for splitting long texts before embedding
--chunk-overlap 50 Overlap size (characters) between consecutive chunks
--separators None Custom separators for text chunking (space-separated list)
--profile-memory false Enable memory profiling during processing

Note: --batch-size and checkpoint-related flags (--checkpoint-dir, --resume) are available via configuration file only. See Configuration File section below.

Full flag reference: entropyguard --help

Configuration File

Create .entropyguardrc.json in your project root:

{
  "text_column": "text",
  "min_length": 100,
  "dedup_threshold": 0.95,
  "batch_size": 10000,
  "checkpoint_dir": "./checkpoints",
  "resume": false,
  "auto_resume": true
}

CLI flags override config file values. Some options (like batch_size, checkpoint_dir, resume) are only available via configuration file.

Exit Codes

Code Meaning
0 Success
1 General error
2 Usage error (invalid args)
64 Data format error
65 Input file error
66 Output file error
70 Software error (bug)
130 Interrupted (Ctrl+C)

License

MIT License. See LICENSE file.

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

entropyguard-1.22.3.tar.gz (49.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

entropyguard-1.22.3-py3-none-any.whl (59.5 kB view details)

Uploaded Python 3

File details

Details for the file entropyguard-1.22.3.tar.gz.

File metadata

  • Download URL: entropyguard-1.22.3.tar.gz
  • Upload date:
  • Size: 49.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.11.9 Windows/10

File hashes

Hashes for entropyguard-1.22.3.tar.gz
Algorithm Hash digest
SHA256 b5800be8fa20183c6e8cd38ecb6f4e63be52b78319db3502c6b6cf77414df83f
MD5 e1eff4d6ee3a475efc609f9f228fb4d6
BLAKE2b-256 b9b69a5defbb8209c233d04159fc335be54c675dd5974da88d0d5add7e89d55b

See more details on using hashes here.

File details

Details for the file entropyguard-1.22.3-py3-none-any.whl.

File metadata

  • Download URL: entropyguard-1.22.3-py3-none-any.whl
  • Upload date:
  • Size: 59.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.11.9 Windows/10

File hashes

Hashes for entropyguard-1.22.3-py3-none-any.whl
Algorithm Hash digest
SHA256 9d47daff17cbb015e0c8074cf358119f806f1ded985dcd806990cbf4b265d108
MD5 ad5367507c75c397306f6c50a76b487b
BLAKE2b-256 04a62f580aa83e00ae10478debbdb2c24a34f9f6165b6285aab71cc065daee55

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page