No project description provided
Project description
Akin
Python library for detecting near duplicate texts in a corpus at scale using Locality Sensitive Hashing, as described in chapter three of Mining Massive Datasets. This algorithm identifies similar texts in a corpus efficiently by estimating their Jaccard similarity with sub-linear time complexity. This can be used to detect near duplicate texts at scale or locate different versions of a document.
Example Usage
from akin import MinHash, LSH
content = [
'Jupiter is primarily composed of hydrogen with a quarter of its mass being helium',
'Jupiter moving out of the inner Solar System would have allowed the formation of inner planets.',
'A helium atom has about four times as much mass as a hydrogen atom, so the composition changes '
'when described as the proportion of mass contributed by different atoms.',
'Jupiter is primarily composed of hydrogen and a quarter of its mass being helium',
'A helium atom has about four times as much mass as a hydrogen atom and the composition changes '
'when described as a proportion of mass contributed by different atoms.',
'Theoretical models indicate that if Jupiter had much more mass than it does at present, it '
'would shrink.',
'This process causes Jupiter to shrink by about 2 cm each year.',
'Jupiter is mostly composed of hydrogen with a quarter of its mass being helium',
'The Great Red Spot is large enough to accommodate Earth within its boundaries.'
]
# Labels for each text in content.
content_labels = [1, 2, 3, 4, 5, 6, 7, 8, 9]
# Create MinHash object.
minhash = MinHash(content, n_gram=9, permutations=100, hash_bits=64, seed=3)
# Create LSH model.
lsh = LSH(minhash, content_labels, no_of_bands=50)
# Query to find near duplicates for text 1.
print(lsh.query(1, min_jaccard=0.5))
>>> [8, 4]
# Generate minhash signature and add new texts to LSH model.
new_text = [
'Jupiter is primarily composed of hydrogen with a quarter of its mass being helium',
'Jupiter moving out of the inner Solar System would have allowed the formation of '
'inner planets.'
]
new_labels = ['doc1', 'doc2']
new_minhash = MinHash(new_text, n_gram=9, permutations=100, hash_bits=64, seed=3)
lsh.update(new_minhash, new_labels)
# Check contents of documents.
print(lsh.contains())
>>> [1, 2, 3, 4, 5, 6, 7, 8, 9, 'doc1', 'doc2']
# Remove text and label from model.
lsh.remove(5)
print(lsh.contains())
>>> [1, 2, 3, 4, 6, 7, 8, 9, 'doc1', 'doc2']
# Return adjacency list for all similar texts.
adjacency_list = lsh.adjacency_list(min_jaccard=0.55)
print(adjacency_list)
>>> {
1: ['doc1', 4], 2: ['doc2'], 3: [], 4: [1, 'doc1'], 6: [],
7: [], 8: [], 9: [], 'doc1': [1, 4], 'doc2': [2]
}
API Guide
MinHash
Creates a MinHash object that contains matrix of Minhash Signatures for each text.
MinHash Parameters
MinHash(text, n_gram=9, n_gram_type='char', permutations=100, hash_bits=64, seed=None)
text
{list or ndarray}
Iterable containing strings of text for each text in a corpus.
n_gram
int, optional, default: 9
Size of each overlapping text shingle to break text into prior to hashing. Shingle size should be carefully selected dependent on average text length as too low a shingle size will yield false similarities, whereas too high a shingle size will fail to return similar documents.
n_gram_type
str, optional, default: 'char'
Type of n gram to use for shingles, must be 'char' to split text into character shingles or 'term' to split text into overlapping sequences of words.
permutations
int, optional, default: 100
Number of randomly sampled hash values to use for generating each texts minhash signature. Intuitively the larger the number of permutations, the more accurate the estimated Jaccard similarity between the texts but longer the algorithm will take to run.
hash_bits
int, optional, default: 64
Hash value size to be used to generate minhash signatures from shingles, must be 32, 64 or 128 bit. Hash value size should be chosen based on text length and a trade off between performance and accuracy. Lower hash values risk false hash collisions leading to false similarities between documents for larger corpora of texts.
method
str, optional, default: 'multi_hash'
Method for random sampling via hashing, must be 'multi_hash' or 'k_smallest_values'.
If multi_hash selected texts are hashed once per permutation and the minimum hash value selected each time to construct a signature.
If k_smallest_values selected each text is hashed once and k smallest values selected for k permutations. This method is much faster than multi_hash but far less stable.
seed
int, optional, default: None
Seed from which to generate random hash function, necessary for reproducibility or to allow updating of the LSH model with new minhash values later.
MinHash Properties
n_gram: int
.n_gram
Returns size of each overlapping text shingle used to create minhash signatures.
n_gram_type: int
.n_gram_type
Returns type of n-gram used for text shingling.
permutations: int
.permutations
Returns number of permutations used to create signatures.
hash_bits: int
.hash_bits
Returns hash value size used to create signatures.
method: str
.method
Returns hashing method used in minhash function.
seed: int
.seed
Returns seed value used to generate random hashes in minhash function.
signatures: numpy.array
.signatures
Returns matrix of text signatures generated by minhash function.
n = text row, m = selected permutations.
LSH
Creates an LSH model of text similarity that can be used to return similar texts based on estimated Jaccard similarity.
LSH Parameters
LSH(minhash=None, labels=None, no_of_bands=None)
minhash
optional, default: None
Minhash object containing minhash signatures returned by MinHash class.
labels
{list or ndarray}, optional, default: None
List, array or Pandas series containing unique labels for each text in minhash object signature. This should be provided in the same order as texts passed to the MinHash class. Example labels include filepaths and database ids.
no_of_bands
optional, default: permutations // 2
Number of bands to break minhash signature into before hashing into buckets. A smaller number of bands will result in a stricter algorithm, requiring larger possibly leading to false negatives missing some similar texts, whereas a higher number may lead to false similarities.
LSH Methods
update
Updates model from a MinHash object containing signatures generated from new texts and their corresponding labels.
.update(minhash, new_labels)
minhash: MinHash object containing signatures of new texts, parameters must match any previous MinHash objects.
new_labels: List, array or Pandas series containing text labels.
query
Takes a label and returns the labels of any similar texts.
.query(label, min_jaccard=None, sensitivity=1)
label: Label of text to return list of similar texts for.
min_jaccard: Jaccard similarity threshold texts have to exceed to be returned as similar.
sensitivity: Number of buckets texts must share to be returned as similar.
remove
Remove file label and minhash signature from model.
.remove(label)
label: Label of text to remove from LSH model.
contains
Returns list of labels contained in the model.
.contains()
adjacency_list
Returns an adjacency list that can be used to create a text similarity graph.
.adjacency_list(min_jaccard=None, sensitivity=1)
min_jaccard: Jaccard similarity threshold texts have to exceed to be returned as similar.
sensitivity: Number of buckets texts must share to be returned as similar.
LSH Properties
no_of_bands: int
.no_of_bands
Number of bands used in LSH model.
permutations: int
.permutations
Number of permutations used to create minhash signatures used in LSH model.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.