Skip to main content

Akin is a Python library for detecting near-duplicate texts using min-hashing and locality sensitive hashing.

Project description

Akin

Python Version License: MIT Build Status PyPI - Downloads


Python library for detecting near duplicate texts in a corpus at scale using Locality Sensitive Hashing, adapted from the algorithm described in chapter three of [Mining Massive Datasets](http://infolab.stanford.edu/~ullman/mmds/ch3.pdf).

This algorithm identifies similar texts in a corpus by efficiently estimating their Jaccard similarity with sub-linear time complexity. This can be used to detect near duplicate texts at scale or locate different versions of a document.

Installation

Install from PyPI using pip: python3 -m pip install akin

API Documentation

See the API documentation here for API and usage guide.

Quick Start Example

from akin import UniMinHash, LSH

content = [
    'Jupiter is primarily composed of hydrogen with a quarter of its mass being helium',
    'Jupiter moving out of the inner Solar System would have allowed the formation of inner '
    'planets.',
    'A helium atom has about four times as much mass as a hydrogen atom, so the composition '
    'changes when described as the proportion of mass contributed by different atoms.',
    'Jupiter is primarily composed of hydrogen and a quarter of its mass being helium',
    'A helium atom has about four times as much mass as a hydrogen atom and the composition '
    'changes when described as a proportion of mass contributed by different atoms.',
    'Theoretical models indicate that if Jupiter had much more mass than it does at present, it '
    'would shrink.',
    'This process causes Jupiter to shrink by about 2 cm each year.',
    'Jupiter is mostly composed of hydrogen with a quarter of its mass being helium',
    'The Great Red Spot is large enough to accommodate Earth within its boundaries.'
]

labels = [i for i in range(1, len(content))]

# Generate MinHash signatures.
minhash = UniMinHash(n_gram=9, permutations=100, hash_bits=64, seed=3)
signatures  minhash.transform(content)

# Create LSH model.
lsh = LSH(permutations=minhash.permutations)
lsh.update(signatures, labels)

# Query to find near duplicates for text 1.
print(lsh.query(1, min_jaccard=0.5))
>>> [8, 4]

# Generate minhash signature and add new texts to LSH model.
new_text = [
    'Jupiter is primarily composed of hydrogen with a quarter of its mass being helium',
    'Jupiter moving out of the inner Solar System would have allowed the formation of '
    'inner planets.'
]

new_labels = ['doc1', 'doc2']

new_minhash = MinHash(new_text, n_gram=9, permutations=100, hash_bits=64, seed=3)

lsh.update(new_minhash, new_labels)

# Remove text and label from model.
lsh.remove(5)

# Return adjacency list for all similar texts.
adjacency_list = lsh.adjacency_list(min_jaccard=0.55)
print(adjacency_list)
>>> {
        1: ['doc1', 4], 2: ['doc2'], 3: [], 4: [1, 'doc1'], 6: [], 
        7: [], 8: [], 9: [], 'doc1': [8, 1, 4], 'doc2': [2]
    }

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

akin-1.0.1.tar.gz (12.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

akin-1.0.1-py3-none-any.whl (8.7 kB view details)

Uploaded Python 3

File details

Details for the file akin-1.0.1.tar.gz.

File metadata

  • Download URL: akin-1.0.1.tar.gz
  • Upload date:
  • Size: 12.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for akin-1.0.1.tar.gz
Algorithm Hash digest
SHA256 18009515df9a3059e1b0c852ee0798c75a97d9692411dd1469a130d78bf78d89
MD5 86bd4c2b22da531a5d444714ead0137e
BLAKE2b-256 c00db25bc06c2c6ce98829563b69050b261f908f53aa802a756a2c08e828fed4

See more details on using hashes here.

File details

Details for the file akin-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: akin-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 8.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for akin-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c9a518f72c03cc8d2d2f1d8f030e71e99fd84297aba81061775045057ccb4ffe
MD5 5d3591e856f60cd23baaf94c30b9b21f
BLAKE2b-256 25803a7e1e7eed4961c61507e979db6154ad5ec64262e3d76c25e64b840f29f3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page