Skip to main content

Fast minimal perfect hash functions

Project description

phobic

Fast minimal perfect hash functions for Python.

A perfect hash function maps a known set of n keys to distinct integers in [0, m) with no collisions. Build it once from your key set, then every lookup is O(1). Implemented in C11 with no runtime dependencies.

Install

pip install phobic

Usage

import phobic

# Build a perfect hash function
keys = ["alice", "bob", "charlie", "diana"]
phf = phobic.build(keys)

# Query: returns a unique integer per key
phf["alice"]    # e.g., 2
phf["bob"]      # e.g., 0

# Properties
phf.num_keys      # 4
phf.range_size    # 8 (with default alpha=1.0)
phf.bits_per_key  # ~1.0
phf.collisions    # 0
phf.is_perfect    # True

# Persistence
data = phf.to_bytes()
phf2 = phobic.from_bytes(data)

Options

# Closer to minimal: 5% overhead instead of 100% (slower build, may need more retries)
phf = phobic.build(keys, alpha=0.05)

# Fixed seed for reproducibility
phf = phobic.build(keys, seed=42)

# Control retry budget (default 100)
phf = phobic.build(keys, max_retries=200)

Non-perfect builds

By default, build() raises RuntimeError if no perfect hash is found within the retry budget. With strict=False, it always returns the best result found:

phf = phobic.build(keys, alpha=0.05, strict=False)
phf.is_perfect   # False if no perfect build was found
phf.collisions   # number of keys placed in already-occupied slots

The builder tries multiple seeds and gradually widens alpha across retries, keeping the attempt with the fewest collisions.

Space Efficiency

PHOBIC achieves ~1 bit/key for the hash structure at 100k keys. When used as a filter (PHF + n-bit fingerprints), total space is bpk + log2(1/e) bits/key for false positive rate e. This beats Bloom filters (1.44 * log2(1/e)) whenever e is small enough that the PHF overhead is absorbed.

Performance

C11 core, single-threaded. The GIL is released during construction so other Python threads can run concurrently. Typical build times: ~0.5 us/key at n=1k, ~1.8 us/key at n=100k. Query throughput: ~2M queries/sec from Python (~500 ns/query including str-to-bytes encoding).

Demo

See demo.ipynb for an interactive walkthrough covering the API, alpha trade-offs, space efficiency, serialization, and a matplotlib plot of how collisions decrease with retry budget.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phobic-0.1.0.tar.gz (12.6 kB view details)

Uploaded Source

File details

Details for the file phobic-0.1.0.tar.gz.

File metadata

  • Download URL: phobic-0.1.0.tar.gz
  • Upload date:
  • Size: 12.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for phobic-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3688dc8c5cbc0748f9917700d2f1afe6545c0406ca9b8a94f34718e846ec1b94
MD5 a6d5570c4e131df2cbcb4ac41b492e67
BLAKE2b-256 c3f90c2bbe92ab0621bcbfba7db14b8ec74fbd5bd94ee6ebc4b742e0fa005164

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page