Akin is a Python library for detecting near-duplicate texts using min-hashing and locality sensitive hashing.
Project description
Akin
Python library for detecting near duplicate texts in a corpus at scale using Locality Sensitive Hashing, adapted from the algorithm described in chapter three of [Mining Massive Datasets](http://infolab.stanford.edu/~ullman/mmds/ch3.pdf).
This algorithm identifies similar texts in a corpus by efficiently estimating their Jaccard similarity with sub-linear time complexity. This can be used to detect near duplicate texts at scale or locate different versions of a document.
Installation
Install from PyPI using pip:
python3 -m pip install akin
API Documentation
See the API documentation here for API and usage guide.
Quick Start Example
from akin import UniMinHash, LSH
content = [
'Jupiter is primarily composed of hydrogen with a quarter of its mass being helium',
'Jupiter moving out of the inner Solar System would have allowed the formation of inner '
'planets.',
'A helium atom has about four times as much mass as a hydrogen atom, so the composition '
'changes when described as the proportion of mass contributed by different atoms.',
'Jupiter is primarily composed of hydrogen and a quarter of its mass being helium',
'A helium atom has about four times as much mass as a hydrogen atom and the composition '
'changes when described as a proportion of mass contributed by different atoms.',
'Theoretical models indicate that if Jupiter had much more mass than it does at present, it '
'would shrink.',
'This process causes Jupiter to shrink by about 2 cm each year.',
'Jupiter is mostly composed of hydrogen with a quarter of its mass being helium',
'The Great Red Spot is large enough to accommodate Earth within its boundaries.'
]
labels = [i for i in range(1, len(content))]
# Generate MinHash signatures.
minhash = UniMinHash(n_gram=9, permutations=100, hash_bits=64, seed=3)
signatures minhash.transform(content)
# Create LSH model.
lsh = LSH(permutations=minhash.permutations)
lsh.update(signatures, labels)
# Query to find near duplicates for text 1.
print(lsh.query(1, min_jaccard=0.5))
>>> [8, 4]
# Generate minhash signature and add new texts to LSH model.
new_text = [
'Jupiter is primarily composed of hydrogen with a quarter of its mass being helium',
'Jupiter moving out of the inner Solar System would have allowed the formation of '
'inner planets.'
]
new_labels = ['doc1', 'doc2']
new_minhash = MinHash(new_text, n_gram=9, permutations=100, hash_bits=64, seed=3)
lsh.update(new_minhash, new_labels)
# Remove text and label from model.
lsh.remove(5)
# Return adjacency list for all similar texts.
adjacency_list = lsh.adjacency_list(min_jaccard=0.55)
print(adjacency_list)
>>> {
1: ['doc1', 4], 2: ['doc2'], 3: [], 4: [1, 'doc1'], 6: [],
7: [], 8: [], 9: [], 'doc1': [8, 1, 4], 'doc2': [2]
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file akin-1.0.1.tar.gz.
File metadata
- Download URL: akin-1.0.1.tar.gz
- Upload date:
- Size: 12.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
18009515df9a3059e1b0c852ee0798c75a97d9692411dd1469a130d78bf78d89
|
|
| MD5 |
86bd4c2b22da531a5d444714ead0137e
|
|
| BLAKE2b-256 |
c00db25bc06c2c6ce98829563b69050b261f908f53aa802a756a2c08e828fed4
|
File details
Details for the file akin-1.0.1-py3-none-any.whl.
File metadata
- Download URL: akin-1.0.1-py3-none-any.whl
- Upload date:
- Size: 8.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c9a518f72c03cc8d2d2f1d8f030e71e99fd84297aba81061775045057ccb4ffe
|
|
| MD5 |
5d3591e856f60cd23baaf94c30b9b21f
|
|
| BLAKE2b-256 |
25803a7e1e7eed4961c61507e979db6154ad5ec64262e3d76c25e64b840f29f3
|