A Python package for deduplicating data.
Project description
deduplicate_lib
deduplication algorithms in python
Key Features
- Easy to use deduplication algorithms for any vector array
- Suite of tolerance tuning algorithms to help you find the right tolerance value for your system
- Suite of benchmarking tools to ensure rigor, accuracy, and speed (not yet implemented)
- Factory Plugin architecture, for easy extensibility and modification
Implemented Algorithms
- Distance Matrix (Simple, accurate, expensive): Computes the distance matrix for all vectors and determines duplicates by finding those that fall below a given distance
- Multi Hashing (Fast): Smears and rounds the vectors using a normal distribution and computes the hashes for each which are then used to determine duplicates by proportion of hash clashes.
Quick Start
install using pip
pip install deduplicate_lib
load your data into python
from deduplicate_lib.plugins.deduplication_algorithms.multi_hash import MultiHash
# define your paramerters in the MultiHash object
dda=MultiHash(dataset_array=your_data_numpy_array)
# return a list of all unique values
print(dda.deduplicate())
A more detailed example can be seen in the examples directory
Dependencies
- Python 3.9+
numpynumbascipy
🤝 Contributing
We welcome contributions! Please see our Contributing Guide for details.
Development Setup
# Clone the repository
git clone https://github.com/julianholland/deduplicate.git
cd deduplicate
# Install in development mode
pip install -e ".[dev]"
# Run tests
pytest
# Run linting
ruff check .
ruff format .
Running Tests
# Run all tests
pytest
# Run specific test categories
pytest tests/core/
pytest tests/plugins/
pytest tests/plugins/duplicate_detection_algorithms/distance_matrix
# Run with coverage
pytest --cov
📝 Citation
If you use deduplicate_lib in your research, please cite:
@software{deduplicate2026,
title={deduplicate_lib: Auto Tolerance Finding Deduplication Algorithms in Python},
author={Julian Holland},
year={2026},
url={https://github.com/julianholland/deduplicate},
version={0.0.5}
}
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgments
- The Fritz Haber Institute
- Juan Manuel Lombardi <3
- Maximillion Ach
- Chiara Panosetti
Project Links
Project To-Do
- Add example.ipynb
- Create general Pre-allocation protocal
- Add benchmarks for time and robustness
- Add Locality-Sensitive Hashing as an option
- Speedup slow tasks with Numba
- Set up Read the Docs
- Create general deduplicate function
- Speed up NTPP
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file deduplicate_lib-0.0.5.tar.gz.
File metadata
- Download URL: deduplicate_lib-0.0.5.tar.gz
- Upload date:
- Size: 15.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
de3b2ddbfa85d2ee30c7babd76ab7848f047aa1ab124e828ac5eec9264cb3072
|
|
| MD5 |
6f8b8f4e870f44c88dd4be238ef07ab6
|
|
| BLAKE2b-256 |
7117c5c48e4a54ab1e54b94a9a5674045ed091dce8562e52bbf1cef977f75e69
|
File details
Details for the file deduplicate_lib-0.0.5-py3-none-any.whl.
File metadata
- Download URL: deduplicate_lib-0.0.5-py3-none-any.whl
- Upload date:
- Size: 17.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
75a091ab726d2528f17fa70cae7b3874b25ce665c28fa2867e3fa3c4dcb51a9e
|
|
| MD5 |
0e6b256b4bd04a6dc56d37ab4e6e301c
|
|
| BLAKE2b-256 |
1c836311d45b9fbbd05f107a0d6d5ccbe2a7a9370ed1df2b1ed81accd9112339
|