A simple, time-tested, family of random hash functions in Python, based on CRC32 and xxHash, affine transformations, and the Mersenne Twister.

These details have not been verified by PyPI

Project links

Project description

Python `randomhash` package

A simple, time-tested, family of random hash functions in Python, based on CRC32 and xxHash, affine transformations, and the Mersenne Twister.

This is a companion library to the identical Java version.

Installation and usage

The library is available on PyPI, and can be installed through normal means:

$ pip install randomhash

Once installed, it can be called either by instantiating a family of random hash functions, or using the default instantiated functions:

import randomhash

# Create a family of random hash functions, with 10 hash functions

rfh = randomhash.RandomHashFamily(count=10)
print(rfh.hashes("hello"))  # will compute the ten hashes for "hello"

# Use the default instantiated functions

print(randomhash.hashes("hello", count=10))

Features

This introduces a family of hash functions that can be used to implement probabilistic algorithms such as HyperLogLog. It is based on affine transformations of either the CRC32 hash functions, which have been empirically shown to provide good performance (for consistency with other versions of this library, such as the Java version), or the more complex xxHash hash functions that are made available through the xxhash Python bindings. The pseudo-random numbers are drawn according to the standard Python implementation of the Mersenne Twister.

Some history

In 1983, G. N. N. Martin and Philippe Flajolet introduced the algorithm known as Probabilistic Counting, designed to provide an extremely accurate and efficient estimate of the number of unique words from a document that may contain repetitions. This was an incredibly important algorithm, which introduced a revolutionary idea at the time:

The only assumption made is that records can be hashed in a suitably pseudo-uniform manner. This does not however appear to be a severe limitation since empirical studies on large industrial files [5] reveal that careful implementations of standard hashing techniques do achieve practically uniformity of hashed values.

The idea is that hash functions can "transform" data into pseudo-random variables. Then a text can be treated as a sequence of random variables drawn from a uniform distribution, where a given word will always occur as the same random value (i.e., a b c a a b c could be hashed as .00889 .31423 .70893 .00889 .00889 .31423 .70893 with every occurrence of a hashing to the same value). While this sounds strange, empirical evidence suggests it is true enough in practice, and eventually some theoretical basis has come to support the practice.

The original Probabilistic Counting (1983) algorithm gave way to LogLog (2004), and then eventually HyperLogLog (2007), one of the most famous algorithms in the world as described in this article. These algorithms and others all used the same idea of hashing inputs to treat them as random variables, and proved remarkably efficient and accurate.

But as highlighted in the above passage, it is important to be careful.

Hash functions in practice

In practice, it is easy to use poor quality hash functions, or to use cryptographic functions which will significantly slow down the speed (and relevance) of the probabilistic estimates. However, on most data, some the cyclic polynomial checksums (such as Adler32 or CRC32) provide good results---as do efficient, general-purpose non-cryptographic hash functions such as xxHash.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.6.0

May 31, 2022

0.5.0

May 31, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

randomhash-0.6.0.tar.gz (9.8 kB view details)

Uploaded May 31, 2022 Source

Built Distribution

randomhash-0.6.0-py3-none-any.whl (12.4 kB view details)

Uploaded May 31, 2022 Python 3

File details

Details for the file randomhash-0.6.0.tar.gz.

File metadata

Download URL: randomhash-0.6.0.tar.gz
Upload date: May 31, 2022
Size: 9.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.1.13 CPython/3.9.4 Darwin/21.4.0

File hashes

Hashes for randomhash-0.6.0.tar.gz
Algorithm	Hash digest
SHA256	`4f5ac31313c15aeac4ea8d57b9cde2ae362f29615292d0db22b4f23e8245b379`
MD5	`7cc4ece1629baae5223115175d6b1076`
BLAKE2b-256	`813e40102b847315ca7d8e6d5910b25a0570663c55ac7e7329d8348c800389b8`

See more details on using hashes here.

File details

Details for the file randomhash-0.6.0-py3-none-any.whl.

File metadata

Download URL: randomhash-0.6.0-py3-none-any.whl
Upload date: May 31, 2022
Size: 12.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.1.13 CPython/3.9.4 Darwin/21.4.0

File hashes

Hashes for randomhash-0.6.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7dbc2506774d2e814411e2c7400c5463119124a47eef64cd53a516a6874f7937`
MD5	`dc3a28f2be47b724be793f37b7f05cbb`
BLAKE2b-256	`8e1915aa9178557103636c35f79c762ffc9a48527a0546c3fcde4b272fe5edd3`

See more details on using hashes here.

randomhash 0.6.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Python `randomhash` package

Installation and usage

Features

Some history

Hash functions in practice

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

randomhash 0.6.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Python randomhash package

Installation and usage

Features

Some history

Hash functions in practice

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Python `randomhash` package