Skip to main content

a simhash module in cpp for python

Project description

simhash

https://pypi.python.org/pypi/pysimhash https://pypi.python.org/pypi/pysimhash https://github.com/skiloop/simhash/actions?query=workflow%3ACodeQL

simhash cpp module for python, a cpp implement of simhash, support for large dimesion such as 128bit

install

pip install pysimhash

or install from github.com

git clone https://github.com/skiloop/simhash
cd simhash
python setup.py install

requirements

  • boost-python

how to use

example:

import pysimhash
import hashlib
document = "google.com hybridtheory.com youtube.com reddit.com"
tokens = [hashlib.md5(s.encode('utf-8')).hexdigest() for s in document.split(" ")]
s2 = pysimhash.SimHash(128, 16) # f=128, hash_bit=16
s2.build(tokens, base=16)
print(s2.hex())

benchmark

With 10000 creating and 100,000 comparing(using benchmark.py) on the same linux, results go as follow

implement build time comparison time
pure python 1.73s 222.99s
pysimhash 0.14s 49.89s

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pysimhash-1.1.1.tar.gz (8.5 kB view hashes)

Uploaded source

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor NVIDIA NVIDIA PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page