Skip to main content

Fast Hamming distance calculation for hexadecimal strings

Project description

Pip Prs Github

What does it do?

This module performs a fast bitwise hamming distance of two hexadecimal strings.

This looks like:

DEADBEEF = 11011110101011011011111011101111
00000000 = 00000000000000000000000000000000
XOR      = 11011110101011011011111011101111
Hamming  = number of 1s in XOR = 24

This essentially amounts to

>>> import gmpy
>>> gmpy.popcount(0xdeadbeef ^ 0x00000000)
24

except with Python strings, so

>>> import gmpy
>>> gmpy.popcount(int('deadbeef', 16) ^ int('00000000', 16))
24

A few assumptions are made and enforced:

  • this is a valid hexadecimal string (i.e., [a-fA-F0-9]+)

  • the strings are the same length

  • the strings do not begin with '0x'

Why yet another Hamming distance library?

There are a lot of fantastic (python) libraries that offer methods to calculate various edit distances, including Hamming distances: Distance, textdistance, scipy, jellyfish, etc.

In this case, I needed a hamming distance library that worked on hexadecimal strings (i.e., a Python str) and performed blazingly fast. Furthermore, I often did not care about hex strings greater than 256 bits. That length constraint is different vs all the other libraries and enabled me to explore vectorization techniques via numba, numpy, and even SSE/AVX.

Lastly, I wanted to minimize dependencies, meaning you do not need to install numpy, gmpy, cython, pypy, pythran, etc.

Eventually, after playing around with gmpy.popcount, numba.jit, pythran.run, numpy, and AVX2, I decided to write what I wanted in a raw C header. At this point, I’m using raw char* and int*, so exploring re-writing this in Fortran makes little sense. Vectorization techniques also ended up adding more overhead due to converting the hex strings to vector-register-ingestible floats from char*.

Installation

To install, ensure you have Python 2.7 or 3.4+. Run:

pip install hexhamming

or to install from source:

git clone https://github.com/mrecachinas/hexhamming
cd hexhamming
python setup.py install # or pip install .

If you want to contribute to hexhamming, you should install the dev dependencies:

pip install -r requirements-dev.txt

and make sure the tests pass with:

pytest # or tox -e py27,...

Example

To use the base C++ extension, you can simply run

>>> from hexhamming import hamming_distance
>>> hamming_distance('deadbeef', '00000000')
24

Benchmark

Below is a benchmark using pytest-benchmark on my early 2016 1.2 GHz Intel m5 8 GB 1867 MHz LPDDR3 macOS Mojave (10.14.3) with Python 2.7.15 and clang-1000.11.45.5.

https://github.com/mrecachinas/hexhamming/blob/master/docs/benchmark.png?raw=true

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hexhamming-1.2.0.tar.gz (5.7 kB view hashes)

Uploaded Source

Built Distributions

hexhamming-1.2.0-cp37-cp37m-macosx_10_14_x86_64.whl (5.9 kB view hashes)

Uploaded CPython 3.7m macOS 10.14+ x86-64

hexhamming-1.2.0-cp27-cp27m-macosx_10_13_x86_64.whl (5.4 kB view hashes)

Uploaded CPython 2.7m macOS 10.13+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page