Skip to main content

Fast minimal perfect hash functions

Project description

phobic

Fast minimal perfect hash functions for Python.

A perfect hash function maps a known set of n keys to distinct integers in [0, m) with no collisions. Build it once from your key set, then every lookup is O(1). Implemented in C11 with no runtime dependencies.

Install

pip install phobic

Usage

import phobic

# Build a perfect hash function
keys = ["alice", "bob", "charlie", "diana"]
phf = phobic.build(keys)

# Query: returns a unique integer per key
phf["alice"]    # e.g., 2
phf["bob"]      # e.g., 0

# Properties
phf.num_keys      # 4
phf.range_size    # 8 (with default alpha=1.0)
phf.bits_per_key  # ~1.0
phf.collisions    # 0
phf.is_perfect    # True

# Persistence
data = phf.to_bytes()
phf2 = phobic.from_bytes(data)

Build with slot map

When you'll immediately fill a slot array indexed by phf[k] for every key (a common pattern in PHF-backed retrieval and cipher-map structures), build_with_slots returns the slot list as a build by-product, saving a redundant pass:

phf, slots = phobic.build_with_slots(keys)
# slots[i] == phf[keys[i]] for every i
table = [None] * phf.range_size
for i, s in enumerate(slots):
    table[s] = values[i]

Options

# Closer to minimal: 5% overhead instead of 100% (slower build, may need more retries)
phf = phobic.build(keys, alpha=0.05)

# Fixed seed for reproducibility
phf = phobic.build(keys, seed=42)

# Control retry budget (default 100)
phf = phobic.build(keys, max_retries=200)

# Average keys per bucket. None (default) picks ceil(log2 N), which
# minimises bits/key. Smaller values (e.g. 8) build dramatically
# faster at large N, at the cost of ~2x bits/key. At N=100K keys,
# bucket_size=8 is ~6x faster than the default for cipher-maps-style
# workloads. See .claude/SWEEP_BUCKET_SIZE.md for the full sweep.
phf = phobic.build(keys, bucket_size=8)

Non-perfect builds

By default, build() raises RuntimeError if no perfect hash is found within the retry budget. With strict=False, it always returns the best result found:

phf = phobic.build(keys, alpha=0.05, strict=False)
phf.is_perfect   # False if no perfect build was found
phf.collisions   # number of keys placed in already-occupied slots

The builder tries multiple seeds and gradually widens alpha across retries, keeping the attempt with the fewest collisions.

Space Efficiency

PHOBIC achieves ~1 bit/key for the hash structure at 100k keys. When used as a filter (PHF + n-bit fingerprints), total space is bpk + log2(1/e) bits/key for false positive rate e. This beats Bloom filters (1.44 * log2(1/e)) whenever e is small enough that the PHF overhead is absorbed.

Performance

C11 core, single-threaded. The GIL is released during construction so other Python threads can run concurrently. Typical build times: ~0.5 us/key at n=1k, ~1.8 us/key at n=100k. Query throughput: ~2M queries/sec from Python (~500 ns/query including str-to-bytes encoding).

Scale: partitioned builds

For n above ~500K, single-threaded pilot search becomes the bottleneck; at n=1M the default parameters take ~100 s and above ~2M the build can fail entirely. Use build_partitioned to shard across cores:

import phobic

keys = [f"key_{i}".encode() for i in range(10_000_000)]

# Uses ThreadPoolExecutor + the GIL-releasing C build to parallelize.
phf = phobic.build_partitioned(keys)

phf.num_keys      # 10_000_000
phf.num_shards    # 666 (auto: 15K keys per shard)
phf.bits_per_key  # ~1.18 (per-shard headers + partition wrapper)
phf[b"key_42"]    # unique slot in [0, phf.range_size)

# Wire-serializable with its own magic bytes.
data = phf.to_bytes()
phf2 = phobic.PartitionedPHF.from_bytes(data)

Measured speedup (8-core machine):

n serial partitioned speedup
100K 0.10 s 0.18 s 0.6x (threading overhead wins at small n)
500K 1.08 s 0.94 s 1.2x
1M 133.5 s 2.0 s 67.7x
2M fails 4.0 s serial unbuildable
10M - 20.6 s -

At small n the per-shard metadata makes partitioning slightly slower. From roughly 500K up it wins decisively; at 1M+ it is often the only way to build at all. See src/phobic/partitioned.py for the implementation.

Demo

See demo.ipynb for an interactive walkthrough covering the API, alpha trade-offs, space efficiency, serialization, and a matplotlib plot of how collisions decrease with retry budget.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phobic-0.2.0.tar.gz (22.3 kB view details)

Uploaded Source

File details

Details for the file phobic-0.2.0.tar.gz.

File metadata

  • Download URL: phobic-0.2.0.tar.gz
  • Upload date:
  • Size: 22.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for phobic-0.2.0.tar.gz
Algorithm Hash digest
SHA256 6490c77a643a0555724efbfe5965b6a8cc1b30f9e2228b4e3caf12b895a2fc9a
MD5 625f8a6ac5155e1ccaad501c0638779a
BLAKE2b-256 3d65590c1fc8bf2d83ee6ccb7bdab56611a1fe1cfd3c1a009c86adfd355f7ec9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page