EntropyHash: near document duplicate detection algorithm
Project description
Entropy Hash
Entropy Hash is a high-performance algorithm for near-duplicate detection in text. It serves as a fast and more accurate alternative to SimHash.
- ✅ 23% more accurate than SimHash
- ⚡ 6.6× faster on synthetic benchmarks
- 🚀 Built on PyTorch for maximum speed and flexibility
Installation
You can install Entropy Hash using:
pip install entropy-hash
If installing locally from source:
git clone https://github.com/yourusername/entropy_hash.git
cd entropy_hash
pip install -e .
Usage
Here's a quick example to get started:
from entropy_hash.pipeline.pipeline import EntropyHash
# Initialize the model
entropy_hash = EntropyHash(device="cuda", num_bits=64)
# Example input
docs = [
"Deep learning is a subset of machine learning.",
"Machine learning includes deep learning.",
"Quantum computing is a different field."
]
# Get hashed vectors (PyTorch tensors)
vectors = entropy_hash.batch(docs, binarization=False)
# Example: compute cosine similarity
import torch.nn.functional as F
similarity = F.cosine_similarity(vectors[0], vectors[1], dim=0)
print("Similarity:", similarity.item())
Reproduce results
git clone https://github.com/saeeddhqan/entropy_hash
cd entropy_hash
apt install libssl-dev
gcc -shared -o entropy_hash/simhash/simhash/libsimhash_parallel.so -fPIC -fopenmp entropy_hash/simhash/simhash/simhash_parallel.c -lcrypto
pip install -r requirements.txt
python -m entropy_hash.benchmark.synthetic_bench
License
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
Citation
If you use this library in your research, please cite it as:
@misc{entropyhash2025,
title={EntropyHash: near duplicate detection algorithm},
author={Saeed Dehqan},
year={2025},
howpublished={\url{https://github.com/saeeddhqan/entropy_hash}},
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file entropy_hash-0.1.0.tar.gz.
File metadata
- Download URL: entropy_hash-0.1.0.tar.gz
- Upload date:
- Size: 17.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
815058b26e1982505227c74c495545283d122a3788e5e599265a19517d3d2e4f
|
|
| MD5 |
f684fa93558b0993670ef3041d7f3ff0
|
|
| BLAKE2b-256 |
8d30deba0d0c2b1a77e3b3ba4c9581d6cc530b580ace37548ce128c7374fe5ec
|
File details
Details for the file entropy_hash-0.1.0-py3-none-any.whl.
File metadata
- Download URL: entropy_hash-0.1.0-py3-none-any.whl
- Upload date:
- Size: 18.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1624a80b872ed60e2f0c1471c729a49c1397b52f8cf1380abbdeaab3dc494ed3
|
|
| MD5 |
081f8fee2f618f4efc101e875280a785
|
|
| BLAKE2b-256 |
038e1ab8bc61dea67055f6cc020197fbaba22133d2fe406ce912231e88a08278
|