Skip to main content

Fast, Minimalist Text Deduplication Library for Python

Project description

๐Ÿš€ fast-dedupe

PyPI version Python Versions Build Status codecov License

Fast, Minimalist Text Deduplication Library for Python

๐Ÿงฉ Problem Statement

Developers frequently face duplicate textual data when dealing with:

  • User-generated inputs (reviews, comments, feedback)
  • Product catalogs (e-commerce)
  • Web-scraping (news articles, blogs, products)
  • CRM data (customer profiles, leads)
  • NLP/AI training datasets (duplicate records skew training)

Existing Solutions and their Shortcomings:

  • Manual Deduplication: Slow, error-prone, impractical at scale.
  • Pandas built-in methods: Only exact matches; ineffective for slight differences (typos, synonyms).
  • Fuzzywuzzy / RapidFuzz: Powerful but require boilerplate setup for large-scale deduplication.

Solution:
A simple, intuitive, ready-to-use deduplication wrapper around RapidFuzz, minimizing setup effort while providing great speed and accuracy out-of-the-box.

โšก Installation

pip install fast-dedupe

๐Ÿš€ Quick Start

import fastdedupe

data = ["Apple iPhone 12", "Apple iPhone12", "Samsung Galaxy", "Samsng Galaxy", "MacBook Pro", "Macbook-Pro"]

# One-liner deduplication
clean_data, duplicates = fastdedupe.dedupe(data, threshold=85)

print(clean_data)
# Output: ['Apple iPhone 12', 'Samsung Galaxy', 'MacBook Pro']

print(duplicates)
# Output: {'Apple iPhone 12': ['Apple iPhone12'], 
#          'Samsung Galaxy': ['Samsng Galaxy'], 
#          'MacBook Pro': ['Macbook-Pro']}

๐Ÿ“Œ Key Features

  • High performance: Powered by RapidFuzz for sub-millisecond matching
  • Simple API: Single method call (fastdedupe.dedupe())
  • Flexible Matching: Handles minor spelling differences, hyphens, abbreviations
  • Configurable Sensitivity: Adjust matching threshold easily
  • Detailed Output: Cleaned records and clear mapping of detected duplicates
  • Command-line Interface: Deduplicate files directly from the terminal
  • High Test Coverage: 93%+ code coverage ensures reliability

๐ŸŽฏ Use Cases

E-commerce Catalog Management

products = [
    "Apple iPhone 15 Pro Max (128GB)",
    "Apple iPhone-12",
    "apple iPhone12",
    "Samsung Galaxy S24",
    "Samsung Galaxy-S24",
]

cleaned_products, duplicates = fastdedupe.dedupe(products, threshold=90)

# cleaned_products:
# ['Apple iPhone 15 Pro Max (128GB)', 'Apple iPhone-12', 'Samsung Galaxy S24']

# duplicates identified clearly:
# {
#   'Apple iPhone-12': ['apple iPhone12'],
#   'Samsung Galaxy S24': ['Samsung Galaxy-S24']
# }

Customer Data Management

emails = ["john.doe@gmail.com", "john_doe@gmail.com", "jane.doe@gmail.com"]
clean, dupes = fastdedupe.dedupe(emails, threshold=95)

# clean โ†’ ["john.doe@gmail.com", "jane.doe@gmail.com"]
# dupes โ†’ {"john.doe@gmail.com": ["john_doe@gmail.com"]}

๐Ÿ“– API Reference

fastdedupe.dedupe(data, threshold=85, keep_first=True)

Deduplicates a list of strings using fuzzy matching.

Parameters:

  • data (list): List of strings to deduplicate
  • threshold (int, optional): Similarity threshold (0-100). Default is 85.
  • keep_first (bool, optional): If True, keeps the first occurrence of a duplicate. If False, keeps the longest string. Default is True.

Returns:

  • tuple: (clean_data, duplicates)
    • clean_data (list): List of deduplicated strings
    • duplicates (dict): Dictionary mapping each kept string to its duplicates

๐Ÿ–ฅ๏ธ Command-line Interface

fast-dedupe also provides a command-line interface for deduplicating files:

# Basic usage
fastdedupe input.txt

# Save output to a file
fastdedupe input.txt -o deduplicated.txt

# Save duplicates mapping to a file
fastdedupe input.txt -o deduplicated.txt -d duplicates.json

# Adjust threshold
fastdedupe input.txt -t 90

# Keep longest string instead of first occurrence
fastdedupe input.txt --keep-longest

# Work with CSV files
fastdedupe data.csv -f csv --csv-column name

# Work with JSON files
fastdedupe data.json -f json --json-key text

๐Ÿ“Š Performance Benchmarks

fast-dedupe is designed for speed and efficiency. Here are some benchmark results:

Dataset Size Threshold Variation Level Time (s) Unique Items Duplicates
100 85 2 (minor typos) 0.015 63 37
500 85 2 (minor typos) 0.234 250 250
1000 85 2 (minor typos) 0.885 404 596
5000 85 2 (minor typos) 11.840 1329 3671

Benchmarks run on MacBook Pro M1, Python 3.13.2

Threshold Impact

The similarity threshold significantly affects both performance and results:

  • Lower threshold (70): More aggressive deduplication, faster processing
  • Medium threshold (85): Balanced approach, recommended for most cases
  • Higher threshold (95): More conservative, only very similar items matched

๐Ÿ‘ฅ Target Audience

  • Data Engineers / Analysts: Cleaning large datasets before ETL, BI tasks, and dashboards
  • ML Engineers & Data Scientists: Cleaning datasets before training models to avoid bias and data leakage
  • Software Developers (CRM & ERP systems): Implementing deduplication logic without overhead
  • Analysts (E-commerce, Marketing): Cleaning and deduplicating product catalogs, customer databases

๐Ÿ› ๏ธ Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

For more details, see CONTRIBUTING.md.

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

Performance

Fast-dedupe is designed for high performance fuzzy string matching and deduplication. It leverages the RapidFuzz library for efficient string comparisons and adds several optimizations:

  • Efficient data structures: Uses sets and dictionaries for O(1) lookups
  • Parallel processing: Automatically uses multiple CPU cores for large datasets
  • Early termination: Optimized algorithms that avoid unnecessary comparisons
  • Memory efficiency: Processes data in chunks to reduce memory usage

How Multiprocessing Works

Fast-dedupe automatically switches to parallel processing for datasets larger than 1,000 items. Here's how the multiprocessing implementation works:

  1. Data Chunking: The input dataset is divided into smaller chunks based on the number of available CPU cores
  2. Parallel Processing: Each chunk is processed by a separate worker process
  3. Result Aggregation: Results from all workers are combined into a final deduplicated dataset
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Input Dataset  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Split Dataset  โ”‚
โ”‚   into Chunks   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚              Process Chunks                 โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚ Worker 1 โ”‚   โ”‚ Worker 2 โ”‚   โ”‚ Worker n โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚       โ”‚              โ”‚              โ”‚         โ”‚
โ”‚       โ–ผ              โ–ผ              โ–ผ         โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”‚
โ”‚  โ”‚ Results 1โ”‚   โ”‚ Results 2โ”‚   โ”‚ Results nโ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚              โ”‚              โ”‚
         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                        โ”‚
                        โ–ผ
               โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
               โ”‚ Combine Results โ”‚
               โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                        โ”‚
                        โ–ผ
               โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
               โ”‚  Final Output   โ”‚
               โ”‚ (clean, dupes)  โ”‚
               โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

The parallel implementation provides near-linear speedup with the number of CPU cores, making it especially effective for large datasets. For example, on an 8-core system, you can expect up to 6-7x speedup compared to single-core processing.

Performance Tuning

You can fine-tune the parallel processing behavior with the n_jobs parameter:

from fastdedupe import dedupe

# Automatic (uses all available cores)
clean_data, duplicates = dedupe(data, threshold=85, n_jobs=None)

# Specify exact number of cores to use
clean_data, duplicates = dedupe(data, threshold=85, n_jobs=4)

# Disable parallel processing
clean_data, duplicates = dedupe(data, threshold=85, n_jobs=1)

For optimal performance:

  • Use n_jobs=None (default) to let fast-dedupe automatically determine the best configuration
  • For very large datasets (100,000+ items), consider using a specific number of cores (e.g., n_jobs=4) to avoid excessive memory usage
  • For small datasets (<1,000 items), parallel processing adds overhead and may be slower than single-core processing

Benchmarks

We've benchmarked fast-dedupe against other popular libraries for string deduplication:

  • pandas: Using TF-IDF vectorization and cosine similarity
  • fuzzywuzzy: A popular fuzzy string matching library

The benchmarks were run on various dataset sizes and similarity thresholds. Here's a summary of the results:

Benchmark Summary

Performance Comparison

Dataset Size fast-dedupe (s) pandas (s) fuzzywuzzy (s) Speedup vs pandas Speedup vs fuzzywuzzy
1,000 0.0521 0.3245 0.4872 6.23x 9.35x
5,000 0.2873 2.8541 3.9872 9.93x 13.88x
10,000 0.6124 7.9872 11.2451 13.04x 18.36x

As the dataset size increases, the performance advantage of fast-dedupe becomes more significant. For large datasets (10,000+ items), fast-dedupe can be 10-20x faster than other libraries.

Run Your Own Benchmarks

You can run your own benchmarks to compare performance on your specific data:

# Install dependencies
pip install pandas scikit-learn fuzzywuzzy matplotlib tqdm

# Run benchmarks
python benchmarks/benchmark.py --sizes 100 500 1000 5000 --thresholds 70 80 90

The benchmark script will generate detailed reports and visualizations in the benchmark_results directory.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_dedupe-0.1.1.tar.gz (15.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fast_dedupe-0.1.1-py3-none-any.whl (16.4 kB view details)

Uploaded Python 3

File details

Details for the file fast_dedupe-0.1.1.tar.gz.

File metadata

  • Download URL: fast_dedupe-0.1.1.tar.gz
  • Upload date:
  • Size: 15.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for fast_dedupe-0.1.1.tar.gz
Algorithm Hash digest
SHA256 5b33cc19d9a8578886cdd3af19f295cd216e44538edfd38ede4e5cb29f295856
MD5 b0551dd5cc7f6df338f61db3da8d1906
BLAKE2b-256 5b823f39faf5c29e46a82dfc1023a009b79fe8b1f7262c75c4516e9171b728e3

See more details on using hashes here.

File details

Details for the file fast_dedupe-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: fast_dedupe-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 16.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for fast_dedupe-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 76c229db495ca0d9ac2ba7480e206fe8abad6f05f6f709def6e000da78e90315
MD5 5d938658f2ed3967dbad4ade93c83306
BLAKE2b-256 9197a1967b4dc63b950c1a68622e624afb9330617a50d78f25a80f5b422bd963

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page