Skip to main content

Akin is a Python library for detecting near-duplicate texts using min-hashing and locality sensitive hashing.

Project description

Akin

Python Version License: MIT Build Status PyPI - Downloads


Python library for detecting near duplicate texts in a corpus at scale using Locality Sensitive Hashing, adapted from the algorithm described in chapter three of [Mining Massive Datasets](http://infolab.stanford.edu/~ullman/mmds/ch3.pdf).

This algorithm identifies similar texts in a corpus by efficiently estimating their Jaccard similarity with sub-linear time complexity. This can be used to detect near duplicate texts at scale or locate different versions of a document.

Installation

Install from PyPI using pip: python3 -m pip install akin

API Documentation

See the API documentation here for API and usage guide.

Quick Start Example

from akin import UniMinHash, LSH

content = [
    'Jupiter is primarily composed of hydrogen with a quarter of its mass being helium',
    'Jupiter moving out of the inner Solar System would have allowed the formation of inner '
    'planets.',
    'A helium atom has about four times as much mass as a hydrogen atom, so the composition '
    'changes when described as the proportion of mass contributed by different atoms.',
    'Jupiter is primarily composed of hydrogen and a quarter of its mass being helium',
    'A helium atom has about four times as much mass as a hydrogen atom and the composition '
    'changes when described as a proportion of mass contributed by different atoms.',
    'Theoretical models indicate that if Jupiter had much more mass than it does at present, it '
    'would shrink.',
    'This process causes Jupiter to shrink by about 2 cm each year.',
    'Jupiter is mostly composed of hydrogen with a quarter of its mass being helium',
    'The Great Red Spot is large enough to accommodate Earth within its boundaries.'
]

labels = [i for i in range(1, len(content))]

# Generate MinHash signatures.
minhash = UniMinHash(n_gram=9, permutations=100, hash_bits=64, seed=3)
signatures  minhash.transform(content)

# Create LSH model.
lsh = LSH(permutations=minhash.permutations)
lsh.update(signatures, labels)

# Query to find near duplicates for text 1.
print(lsh.query(1, min_jaccard=0.5))
>>> [8, 4]

# Generate minhash signature and add new texts to LSH model.
new_text = [
    'Jupiter is primarily composed of hydrogen with a quarter of its mass being helium',
    'Jupiter moving out of the inner Solar System would have allowed the formation of '
    'inner planets.'
]

new_labels = ['doc1', 'doc2']

new_minhash = MinHash(new_text, n_gram=9, permutations=100, hash_bits=64, seed=3)

lsh.update(new_minhash, new_labels)

# Remove text and label from model.
lsh.remove(5)

# Return adjacency list for all similar texts.
adjacency_list = lsh.adjacency_list(min_jaccard=0.55)
print(adjacency_list)
>>> {
        1: ['doc1', 4], 2: ['doc2'], 3: [], 4: [1, 'doc1'], 6: [], 
        7: [], 8: [], 9: [], 'doc1': [8, 1, 4], 'doc2': [2]
    }

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

akin-1.0.0.tar.gz (12.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

akin-1.0.0-py3-none-any.whl (8.7 kB view details)

Uploaded Python 3

File details

Details for the file akin-1.0.0.tar.gz.

File metadata

  • Download URL: akin-1.0.0.tar.gz
  • Upload date:
  • Size: 12.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for akin-1.0.0.tar.gz
Algorithm Hash digest
SHA256 097cffd8b602cffe536b6a432007ef6e2bd5d7ab9e62459fd80ac80ef7989a94
MD5 ab23974c078dd21206c1e534102c4906
BLAKE2b-256 0485a57a643c91e32921760548abb9732523e035c3dda883891b1c4e210ae9db

See more details on using hashes here.

File details

Details for the file akin-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: akin-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 8.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for akin-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bd3fd4bd659200852955816f0c94b6e08c1d809b49c370b1832e3b1d19ec9092
MD5 39874a6c9a0da299cbb424123d473960
BLAKE2b-256 4222909e841babcb29b0e043b64af61ed3ed83e472604ceac0f3477926c1a8f8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page