A Python wrapper for the BBHash Minimal Perfect Hash Function

Project description

pybbhash

This is a Python (Cython) wrapper for the BBHash codebase for building minimal perfect hash functions.

Right now, this is supporting k-mer-based hashing needs from spacegraphcats, using hash values generated (mostly) by murmurhash, e.g. from khmer's Nodetable and sourmash hashing. As such, I am focused on building MPHF for 64-bit hashes and am wrapping only that bit of the interface; the rest should be ~straightforward (hah!).

I've also added a Python-accessible "values table", BBHashTable, in the bbhash_table module. This is a table that supports a dictionary-like feature where you can associate a hash with a value, and then query the table with the hash to retrieve the value. The only tricky bit here is that unlike the bbhash module, this table supports queries with hashes that are not in the MPHF.

Thoughts for further improvement.

I would like to be able to use generic Python iterators in the PyMPHF construction. Right now there is a round of memory-inefficient copying of hashes, which is bad when you have a lot of k-mers!
I would like to be able to save to/load from strings, not just files.

I also need to investigate thread safety.

Usage

Usage of core bbhash functionality:

import bbhash

# some collection of 64-bit (or smaller) hashes
uint_hashes = [10, 20, 50, 80]

num_threads = 1 # hopefully self-explanatory :)
gamma = 1.0     # internal gamma parameter for BBHash

mph = bbhash.PyMPHF(uint_hashes, len(uint_hashes), num_threads, gamma)

for val in uint_hashes:
    print('{} now hashes to {}'.format(val, mph.lookup(val)))

# can also use 'mph.save(filename)' and 'mph = bbhash.load_mphf(filename)'.

Usage of BBHashTable

import random
from collections import defaultdict
from bbhash_table import BBHashTable

all_hashes = [ random.randint(100, 2**32) for i in range(200) ]
half_hashes = all_hashes[:100]

table = BBHashTable()

# hash the first 100 of the hashes
table.initialize(half_hashes)

# store associated values
for hashval, value in zip(half_hashes, [ 1, 2, 3, 4, 5 ] *20):
   table[hashval] = value
   
# retrieve & count for all (which will include hashes not in MPHF)
d = defaultdict(int)
for hashval in all_hashes:
   value = table[hashval]
   d[value] += 1

assert d[1] == 20
assert d[None] == 100

The last for loop can be done quickly, in Cython, using

d = table.get_unique_values(all_hashes)

Motivation: the table is a useful way to (just for one hypothetical example :) store a mapping from k-mers to compact De Bruijn graph node IDs. (We use this in several places in spacegraphcats!)

CTB Oct 2020

Project details

Release history Release notifications | RSS feed

This version

0.5.4

Oct 25, 2020

0.5.3

Oct 25, 2020

0.5.2

Oct 25, 2020

0.5.1

Oct 24, 2020

0.5

Oct 24, 2020

0.4.2

Oct 24, 2020

0.4.1

Oct 24, 2020

0.3

Oct 23, 2020

0.1dev2 pre-release

Jan 25, 2018

0.1dev pre-release

Jan 24, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bbhash-0.5.4.tar.gz (19.4 kB view details)

Uploaded Oct 25, 2020 Source

File details

Details for the file bbhash-0.5.4.tar.gz.

File metadata

Download URL: bbhash-0.5.4.tar.gz
Upload date: Oct 25, 2020
Size: 19.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.6.0 requests/2.24.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.7.6

File hashes

Hashes for bbhash-0.5.4.tar.gz
Algorithm	Hash digest
SHA256	`1b5e07cd99927c1517441b97bf625c5f4fb3a3bafa114c621ad83f014b4d9ea8`
MD5	`0ed687dd9edd595e7800386d6e73d65a`
BLAKE2b-256	`b2c5e5868524242e28fb56f7b0d0f8eafa061d6ade8202bd82fd723d1c62c0e8`

See more details on using hashes here.

bbhash 0.5.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

pybbhash

Thoughts for further improvement.

Usage

Usage of core bbhash functionality:

Usage of BBHashTable

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes