Skip to main content

Prehashing keys for memory saving.

Project description

Prehashed

Build Status

A python dictionary that hashes keys before they are inserted to the dict. This yields substantial memory savings when the keys of a dictionary are quite large (for example long stings).

The point of this is that we can store keys that are really large (long strings) cheaply. For example when storing the documents in the tokenized Dailymail dataset the keys take up 1.018 GB while this implementation takes 10.53 MB. A small space saving (7.79 MB vs 10.53 MB) can be found using the built in hash function but the results of this are not shareable across runs because python changes the seed across runs.

Collisions

Obviously we want to know What is the probability that I will have a hash collision?. We can figure this out by asking the reverse. What is the probability that all of keys are unique? Once we have this we can answer the original question by subtracting this from 1.

Given N possible hash values (2 ** 160 for sha1) we know the first hash will be unique, Then for the second one we know there are N - 1 hashes we could use and still be unique, this means our probability is N - 1 / N. This continues for N - 2, 3 etc. So if we are hashing k keys we can find the probability of them all being unique with PI i=1 -> k-1 (N - i) / N. This can be approximated with e^((-k * (k - 1)) / (2 * N)). This can be further approximated with (k * (k - 1) / (2 * N)) for small k because 1 - e^x ~= x and when N is 2 ** 160 most k is small.

Graph

To sum this up here is a table with some of the probabilities of your keys colliding.

k Odds
1.71x10^15 1 in 10 ^ 18
1.71x10^19 1 in 10 billion
1.71x10^21 1 in a million
1.71x10^23 1 in 100
1.42x10^24 1 in 2

So unless you plan use put 171,000,000,000,000,000,000,000 keys into this dict or people will die if your code has bugs I wouldn't worry about collisions.

There is a small chance that keys will collide. When this happens this dictionary cannot detect that and as a result these keys will overwrite each other. This is so rare that git doesn't have a mitigation strategy either.

While a collisions are super rare if you are worried about it I would suggest that all of your values are the same type so that you aren't expecting a string and get an int in the super unlikely case if a collision.

If you are still scared about collisions there is also a function initial_add(k, v) that will modify your key until it doesn't have a collision, adds it to the dictionary, and returns the new key to use. You need to keep this key to get the value later so this kind of breaks the point of this dict where you want to be able to throw away your keys.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

prehashed-1.0.1.tar.gz (4.1 kB view details)

Uploaded Source

File details

Details for the file prehashed-1.0.1.tar.gz.

File metadata

  • Download URL: prehashed-1.0.1.tar.gz
  • Upload date:
  • Size: 4.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for prehashed-1.0.1.tar.gz
Algorithm Hash digest
SHA256 7fb85333e2ee61daefe8ae6ea12bd4de6bff9112dc8aafa0cf3900a0d8e289de
MD5 81b5287334a49e1eb08463ede7d3ae22
BLAKE2b-256 83865eccc452eb91a4a13b224ad54007a1b61df3f538830775e4337b1b4774e6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page