Fast minimal perfect hash functions
Project description
phobic
Fast minimal perfect hash functions for Python.
A perfect hash function maps a known set of n keys to distinct integers in [0, m) with no collisions. Build it once from your key set, then every lookup is O(1). Implemented in C11 with no runtime dependencies.
Install
pip install phobic
Usage
import phobic
# Build a perfect hash function
keys = ["alice", "bob", "charlie", "diana"]
phf = phobic.build(keys)
# Query: returns a unique integer per key
phf["alice"] # e.g., 2
phf["bob"] # e.g., 0
# Properties
phf.num_keys # 4
phf.range_size # 8 (with default alpha=1.0)
phf.bits_per_key # ~1.0
phf.collisions # 0
phf.is_perfect # True
# Persistence
data = phf.to_bytes()
phf2 = phobic.from_bytes(data)
Build with slot map
When you'll immediately fill a slot array indexed by phf[k] for every key (a common pattern in PHF-backed retrieval and cipher-map structures), build_with_slots returns the slot list as a build by-product, saving a redundant pass:
phf, slots = phobic.build_with_slots(keys)
# slots[i] == phf[keys[i]] for every i
table = [None] * phf.range_size
for i, s in enumerate(slots):
table[s] = values[i]
Options
# Closer to minimal: 5% overhead instead of 100% (slower build, may need more retries)
phf = phobic.build(keys, alpha=0.05)
# Fixed seed for reproducibility
phf = phobic.build(keys, seed=42)
# Control retry budget (default 100)
phf = phobic.build(keys, max_retries=200)
# Average keys per bucket. None (default) picks ceil(log2 N), which
# minimises bits/key. Smaller values (e.g. 8) build dramatically
# faster at large N, at the cost of ~2x bits/key. At N=100K keys,
# bucket_size=8 is ~6x faster than the default for cipher-maps-style
# workloads. See .claude/SWEEP_BUCKET_SIZE.md for the full sweep.
phf = phobic.build(keys, bucket_size=8)
Non-perfect builds
By default, build() raises RuntimeError if no perfect hash is found within the retry budget. With strict=False, it always returns the best result found:
phf = phobic.build(keys, alpha=0.05, strict=False)
phf.is_perfect # False if no perfect build was found
phf.collisions # number of keys placed in already-occupied slots
The builder tries multiple seeds and gradually widens alpha across retries, keeping the attempt with the fewest collisions.
Space Efficiency
PHOBIC achieves ~1 bit/key for the hash structure at 100k keys. When used as a filter (PHF + n-bit fingerprints), total space is bpk + log2(1/e) bits/key for false positive rate e. This beats Bloom filters (1.44 * log2(1/e)) whenever e is small enough that the PHF overhead is absorbed.
Performance
C11 core, single-threaded. The GIL is released during construction so other Python threads can run concurrently. Typical build times: ~0.5 us/key at n=1k, ~1.8 us/key at n=100k. Query throughput: ~2M queries/sec from Python (~500 ns/query including str-to-bytes encoding).
Scale: partitioned builds
For n above ~500K, single-threaded pilot search becomes the bottleneck; at n=1M the default parameters take ~100 s and above ~2M the build can fail entirely. Use build_partitioned to shard across cores:
import phobic
keys = [f"key_{i}".encode() for i in range(10_000_000)]
# Uses ThreadPoolExecutor + the GIL-releasing C build to parallelize.
phf = phobic.build_partitioned(keys)
phf.num_keys # 10_000_000
phf.num_shards # 666 (auto: 15K keys per shard)
phf.bits_per_key # ~1.18 (per-shard headers + partition wrapper)
phf[b"key_42"] # unique slot in [0, phf.range_size)
# Wire-serializable with its own magic bytes.
data = phf.to_bytes()
phf2 = phobic.PartitionedPHF.from_bytes(data)
Measured speedup (8-core machine):
| n | serial | partitioned | speedup |
|---|---|---|---|
| 100K | 0.10 s | 0.18 s | 0.6x (threading overhead wins at small n) |
| 500K | 1.08 s | 0.94 s | 1.2x |
| 1M | 133.5 s | 2.0 s | 67.7x |
| 2M | fails | 4.0 s | serial unbuildable |
| 10M | - | 20.6 s | - |
At small n the per-shard metadata makes partitioning slightly slower. From roughly 500K up it wins decisively; at 1M+ it is often the only way to build at all. See src/phobic/partitioned.py for the implementation.
Demo
See demo.ipynb for an interactive walkthrough covering the API, alpha trade-offs, space efficiency, serialization, and a matplotlib plot of how collisions decrease with retry budget.
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file phobic-0.2.0.tar.gz.
File metadata
- Download URL: phobic-0.2.0.tar.gz
- Upload date:
- Size: 22.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6490c77a643a0555724efbfe5965b6a8cc1b30f9e2228b4e3caf12b895a2fc9a
|
|
| MD5 |
625f8a6ac5155e1ccaad501c0638779a
|
|
| BLAKE2b-256 |
3d65590c1fc8bf2d83ee6ccb7bdab56611a1fe1cfd3c1a009c86adfd355f7ec9
|