A Python package for deduplicating data.
Project description
deduplicate_lib
deduplication algorithms in python
Key Features
- Easy to use deduplication algorithms for any vector array
- Suite of tolerance tuning algorithms to help you find the right tolerance value for your system
- Suite of benchmarking tools to ensure rigor, accuracy, and speed (not yet implemented)
- Factory Plugin architecture, for easy extensibility and modification
Implemented Algorithms
- Distance Matrix (Simple, accurate, expensive): Computes the distance matrix for all vectors and determines duplicates by finding those that fall below a given distance
- Multi Hashing (Fast): Smears and rounds the vectors using a normal distribution and computes the hashes for each which are then used to determine duplicates by proportion of hash clashes.
Quick Start
install using pip
pip install deduplicate_lib
load your data into python
from deduplicate_lib.plugins.deduplication_algorithms.multi_hash import MultiHash
# define your paramerters in the MultiHash object
dda=MultiHash(
tolerance=0.01,
dataset_array: your_data_array,
perturbations: int = 200,
)
print(dda.get_dataset_unique_structures())
A more detailed example can be seen in the examples directory
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
deduplicate_lib-0.0.1.tar.gz
(9.5 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file deduplicate_lib-0.0.1.tar.gz.
File metadata
- Download URL: deduplicate_lib-0.0.1.tar.gz
- Upload date:
- Size: 9.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
efc867953462394072f096cf81e6f0f5a9ba1760a22dd28d6034e57db670c160
|
|
| MD5 |
711dbd0666311c01169de20f61a9a4d7
|
|
| BLAKE2b-256 |
cf1fb17d6db9fc41e99a3fb1807a24900ba1acdac64cdb17f18e9a394aab473e
|
File details
Details for the file deduplicate_lib-0.0.1-py3-none-any.whl.
File metadata
- Download URL: deduplicate_lib-0.0.1-py3-none-any.whl
- Upload date:
- Size: 12.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6a9462bb4a4948a25ce6cbe71090d0985e10e62acb7e55293bafa3c24630c555
|
|
| MD5 |
87dc28b20d9362528b2aa5d7c0c0e735
|
|
| BLAKE2b-256 |
5538491a9c59e88db0a6e97af42c1864a71e5fc7875d36bf4056740f04bc32bc
|