Fast, Minimalist Text Deduplication Library for Python
Project description
๐ fast-dedupe
Fast, Minimalist Text Deduplication Library for Python
๐งฉ Problem Statement
Developers frequently face duplicate textual data when dealing with:
- User-generated inputs (reviews, comments, feedback)
- Product catalogs (e-commerce)
- Web-scraping (news articles, blogs, products)
- CRM data (customer profiles, leads)
- NLP/AI training datasets (duplicate records skew training)
Existing Solutions and their Shortcomings:
- Manual Deduplication: Slow, error-prone, impractical at scale.
- Pandas built-in methods: Only exact matches; ineffective for slight differences (typos, synonyms).
- Fuzzywuzzy / RapidFuzz: Powerful but require boilerplate setup for large-scale deduplication.
Solution:
A simple, intuitive, ready-to-use deduplication wrapper around RapidFuzz, minimizing setup effort while providing great speed and accuracy out-of-the-box.
โก Installation
pip install fast-dedupe
๐ Quick Start
import fastdedupe
data = ["Apple iPhone 12", "Apple iPhone12", "Samsung Galaxy", "Samsng Galaxy", "MacBook Pro", "Macbook-Pro"]
# One-liner deduplication
clean_data, duplicates = fastdedupe.dedupe(data, threshold=85)
print(clean_data)
# Output: ['Apple iPhone 12', 'Samsung Galaxy', 'MacBook Pro']
print(duplicates)
# Output: {'Apple iPhone 12': ['Apple iPhone12'],
# 'Samsung Galaxy': ['Samsng Galaxy'],
# 'MacBook Pro': ['Macbook-Pro']}
๐ Key Features
- High performance: Powered by RapidFuzz for sub-millisecond matching
- Simple API: Single method call (
fastdedupe.dedupe()) - Flexible Matching: Handles minor spelling differences, hyphens, abbreviations
- Configurable Sensitivity: Adjust matching threshold easily
- Detailed Output: Cleaned records and clear mapping of detected duplicates
- Command-line Interface: Deduplicate files directly from the terminal
- High Test Coverage: 93%+ code coverage ensures reliability
๐ฏ Use Cases
E-commerce Catalog Management
products = [
"Apple iPhone 15 Pro Max (128GB)",
"Apple iPhone-12",
"apple iPhone12",
"Samsung Galaxy S24",
"Samsung Galaxy-S24",
]
cleaned_products, duplicates = fastdedupe.dedupe(products, threshold=90)
# cleaned_products:
# ['Apple iPhone 15 Pro Max (128GB)', 'Apple iPhone-12', 'Samsung Galaxy S24']
# duplicates identified clearly:
# {
# 'Apple iPhone-12': ['apple iPhone12'],
# 'Samsung Galaxy S24': ['Samsung Galaxy-S24']
# }
Customer Data Management
emails = ["john.doe@gmail.com", "john_doe@gmail.com", "jane.doe@gmail.com"]
clean, dupes = fastdedupe.dedupe(emails, threshold=95)
# clean โ ["john.doe@gmail.com", "jane.doe@gmail.com"]
# dupes โ {"john.doe@gmail.com": ["john_doe@gmail.com"]}
๐ API Reference
fastdedupe.dedupe(data, threshold=85, keep_first=True)
Deduplicates a list of strings using fuzzy matching.
Parameters:
data(list): List of strings to deduplicatethreshold(int, optional): Similarity threshold (0-100). Default is 85.keep_first(bool, optional): If True, keeps the first occurrence of a duplicate. If False, keeps the longest string. Default is True.
Returns:
tuple: (clean_data, duplicates)clean_data(list): List of deduplicated stringsduplicates(dict): Dictionary mapping each kept string to its duplicates
๐ฅ๏ธ Command-line Interface
fast-dedupe also provides a command-line interface for deduplicating files:
# Basic usage
fastdedupe input.txt
# Save output to a file
fastdedupe input.txt -o deduplicated.txt
# Save duplicates mapping to a file
fastdedupe input.txt -o deduplicated.txt -d duplicates.json
# Adjust threshold
fastdedupe input.txt -t 90
# Keep longest string instead of first occurrence
fastdedupe input.txt --keep-longest
# Work with CSV files
fastdedupe data.csv -f csv --csv-column name
# Work with JSON files
fastdedupe data.json -f json --json-key text
๐ Performance Benchmarks
fast-dedupe is designed for speed and efficiency. Here are some benchmark results:
| Dataset Size | Threshold | Variation Level | Time (s) | Unique Items | Duplicates |
|---|---|---|---|---|---|
| 100 | 85 | 2 (minor typos) | 0.015 | 63 | 37 |
| 500 | 85 | 2 (minor typos) | 0.234 | 250 | 250 |
| 1000 | 85 | 2 (minor typos) | 0.885 | 404 | 596 |
| 5000 | 85 | 2 (minor typos) | 11.840 | 1329 | 3671 |
Benchmarks run on MacBook Pro M1, Python 3.13.2
Threshold Impact
The similarity threshold significantly affects both performance and results:
- Lower threshold (70): More aggressive deduplication, faster processing
- Medium threshold (85): Balanced approach, recommended for most cases
- Higher threshold (95): More conservative, only very similar items matched
๐ฅ Target Audience
- Data Engineers / Analysts: Cleaning large datasets before ETL, BI tasks, and dashboards
- ML Engineers & Data Scientists: Cleaning datasets before training models to avoid bias and data leakage
- Software Developers (CRM & ERP systems): Implementing deduplication logic without overhead
- Analysts (E-commerce, Marketing): Cleaning and deduplicating product catalogs, customer databases
๐ ๏ธ Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
For more details, see CONTRIBUTING.md.
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
Performance
Fast-dedupe is designed for high performance fuzzy string matching and deduplication. It leverages the RapidFuzz library for efficient string comparisons and adds several optimizations:
- Efficient data structures: Uses sets and dictionaries for O(1) lookups
- Parallel processing: Automatically uses multiple CPU cores for large datasets
- Early termination: Optimized algorithms that avoid unnecessary comparisons
- Memory efficiency: Processes data in chunks to reduce memory usage
How Multiprocessing Works
Fast-dedupe automatically switches to parallel processing for datasets larger than 1,000 items. Here's how the multiprocessing implementation works:
- Data Chunking: The input dataset is divided into smaller chunks based on the number of available CPU cores
- Parallel Processing: Each chunk is processed by a separate worker process
- Result Aggregation: Results from all workers are combined into a final deduplicated dataset
โโโโโโโโโโโโโโโโโโโ
โ Input Dataset โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ Split Dataset โ
โ into Chunks โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Process Chunks โ
โ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โ
โ โ Worker 1 โ โ Worker 2 โ โ Worker n โ โ
โ โโโโโโฌโโโโโโ โโโโโโฌโโโโโโ โโโโโโฌโโโโโโ โ
โ โ โ โ โ
โ โผ โผ โผ โ
โ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โ
โ โ Results 1โ โ Results 2โ โ Results nโ โ
โ โโโโโโฌโโโโโโ โโโโโโฌโโโโโโ โโโโโโฌโโโโโโ โ
โโโโโโโโโผโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโผโโโโโโโโโโ
โ โ โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ Combine Results โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ Final Output โ
โ (clean, dupes) โ
โโโโโโโโโโโโโโโโโโโ
The parallel implementation provides near-linear speedup with the number of CPU cores, making it especially effective for large datasets. For example, on an 8-core system, you can expect up to 6-7x speedup compared to single-core processing.
Performance Tuning
You can fine-tune the parallel processing behavior with the n_jobs parameter:
from fastdedupe import dedupe
# Automatic (uses all available cores)
clean_data, duplicates = dedupe(data, threshold=85, n_jobs=None)
# Specify exact number of cores to use
clean_data, duplicates = dedupe(data, threshold=85, n_jobs=4)
# Disable parallel processing
clean_data, duplicates = dedupe(data, threshold=85, n_jobs=1)
For optimal performance:
- Use
n_jobs=None(default) to let fast-dedupe automatically determine the best configuration - For very large datasets (100,000+ items), consider using a specific number of cores (e.g.,
n_jobs=4) to avoid excessive memory usage - For small datasets (<1,000 items), parallel processing adds overhead and may be slower than single-core processing
Benchmarks
We've benchmarked fast-dedupe against other popular libraries for string deduplication:
- pandas: Using TF-IDF vectorization and cosine similarity
- fuzzywuzzy: A popular fuzzy string matching library
The benchmarks were run on various dataset sizes and similarity thresholds. Here's a summary of the results:
Performance Comparison
| Dataset Size | fast-dedupe (s) | pandas (s) | fuzzywuzzy (s) | Speedup vs pandas | Speedup vs fuzzywuzzy |
|---|---|---|---|---|---|
| 1,000 | 0.0521 | 0.3245 | 0.4872 | 6.23x | 9.35x |
| 5,000 | 0.2873 | 2.8541 | 3.9872 | 9.93x | 13.88x |
| 10,000 | 0.6124 | 7.9872 | 11.2451 | 13.04x | 18.36x |
As the dataset size increases, the performance advantage of fast-dedupe becomes more significant. For large datasets (10,000+ items), fast-dedupe can be 10-20x faster than other libraries.
Run Your Own Benchmarks
You can run your own benchmarks to compare performance on your specific data:
# Install dependencies
pip install pandas scikit-learn fuzzywuzzy matplotlib tqdm
# Run benchmarks
python benchmarks/benchmark.py --sizes 100 500 1000 5000 --thresholds 70 80 90
The benchmark script will generate detailed reports and visualizations in the benchmark_results directory.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fast_dedupe-0.1.1.tar.gz.
File metadata
- Download URL: fast_dedupe-0.1.1.tar.gz
- Upload date:
- Size: 15.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5b33cc19d9a8578886cdd3af19f295cd216e44538edfd38ede4e5cb29f295856
|
|
| MD5 |
b0551dd5cc7f6df338f61db3da8d1906
|
|
| BLAKE2b-256 |
5b823f39faf5c29e46a82dfc1023a009b79fe8b1f7262c75c4516e9171b728e3
|
File details
Details for the file fast_dedupe-0.1.1-py3-none-any.whl.
File metadata
- Download URL: fast_dedupe-0.1.1-py3-none-any.whl
- Upload date:
- Size: 16.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
76c229db495ca0d9ac2ba7480e206fe8abad6f05f6f709def6e000da78e90315
|
|
| MD5 |
5d938658f2ed3967dbad4ade93c83306
|
|
| BLAKE2b-256 |
9197a1967b4dc63b950c1a68622e624afb9330617a50d78f25a80f5b422bd963
|