FastRustRAG: Blazing-fast document deduplication using MinHash and LSH - 50-100x faster than Python
Project description
FastRustRAG
Rust-based document deduplication for RAG pipelines. 50-100x faster than Python implementations.
Why?
Python's MinHash libraries are slow. This uses Rust + Rayon for parallel processing. Built it because I needed fast deduplication for preprocessing large document collections.
Performance
Benchmarked against datasketch (popular Python MinHash library):
| Documents | Python | FastRustRAG | Speedup |
|---|---|---|---|
| 500 | 246ms | 2ms | 121x |
| 1,000 | 414ms | 5ms | 81x |
| 2,000 | 838ms | 10ms | 79x |
| 5,000 | 2.1s | 38ms | 56x |
| 10,000 | 4.2s | 360ms | 11x |
| 50,000 | 21s | 2.4s | 8x |
Best performance on 500-5000 document batches (typical RAG use case).
Installation
pip install fastrustrag
Usage
import fastrustrag
# Create pipeline
pipeline = fastrustrag.DeduplicationPipeline(
num_bands=20,
num_hashes=128,
shingle_size=3,
similarity_threshold=0.8
)
# Your documents
docs = [
"The quick brown fox jumps over the lazy dog",
"A quick brown fox jumps over a lazy dog",
"Completely different content",
]
# Process and find duplicates
pipeline.process_documents(docs)
duplicates = pipeline.deduplicate_corpus()
for i, j, similarity in duplicates:
print(f"Docs {i} and {j} are {similarity*100:.1f}% similar")
API
DeduplicationPipeline
Parameters:
num_bands(int): LSH bands. Higher = more precision, fewer false positives. Default: 20num_hashes(int): MinHash functions. Higher = more accuracy, slower. Default: 128shingle_size(int): n-gram size. Use 2-3 for short texts, 3-5 for longer documents. Default: 3similarity_threshold(float): Minimum similarity (0-1) to consider duplicates. Default: 0.8
Methods:
# Process documents (returns count)
count = pipeline.process_documents(documents: list[str]) -> int
# Find all duplicate pairs
duplicates = pipeline.deduplicate_corpus() -> list[tuple[int, int, float]]
# Returns: [(doc_id1, doc_id2, similarity), ...]
# Find duplicates for specific query
results = pipeline.find_duplicates(query: str) -> list[tuple[int, str, float]]
# Get document by ID
doc = pipeline.get_document(doc_id: int) -> str | None
How it works
- MinHash: Generates hash signatures for fast similarity estimation
- LSH: Locality-sensitive hashing for efficient candidate generation
- Rayon: Automatic parallelization across CPU cores
The speedup comes from:
- Compiled Rust (no Python interpreter overhead)
- Parallel processing with Rayon (uses all cores)
- Efficient memory layout
Use cases
- Remove duplicate documents before indexing for RAG
- Deduplicate web scraping results
- Find plagiarized or copied content
- Clean datasets before training
Technical details
- Built with PyO3 for Python bindings
- Uses Rayon for data parallelism
- Thread-safe with RwLock for concurrent access
- AHash for fast non-cryptographic hashing
License
MIT
Contributing
Issues and PRs welcome. Main areas for improvement:
- Streaming API for very large datasets
- Additional distance metrics
- Persistence (save/load index)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fastrustrag-0.1.2.tar.gz.
File metadata
- Download URL: fastrustrag-0.1.2.tar.gz
- Upload date:
- Size: 31.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d26c6cb802dcc199e137f0e5baab7a9ffdea17938a93b554b79dcc9e56c11372
|
|
| MD5 |
1a961fe130974030586811d650915470
|
|
| BLAKE2b-256 |
1457866c56e33228e3f4eb32d0b3bd377e449da62432f5ce8bfdd1b2580e73b8
|
File details
Details for the file fastrustrag-0.1.2-cp312-cp312-macosx_11_0_arm64.whl.
File metadata
- Download URL: fastrustrag-0.1.2-cp312-cp312-macosx_11_0_arm64.whl
- Upload date:
- Size: 281.7 kB
- Tags: CPython 3.12, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
21172fda9e21c800c81aab1815e1e0eec642cd21735dceae08dc2c92b5ef6f6c
|
|
| MD5 |
8dab16825d703adff9f8eef00c5e3085
|
|
| BLAKE2b-256 |
af6c7888a274dbdeaba29e4315b774757a397d471f5af726a83f3531bfaaf832
|