Skip to main content

Fast minimal perfect hash functions

Project description

phobic

Minimal-ish perfect hash functions for very large key sets, with a parameterised load factor.

A perfect hash function maps a known set of n keys to distinct integers in [0, m) with no collisions. Build it once from your key set, then every lookup is O(1). phobic builds in parallel C, queries in parallel C, and serialises to a portable wire format.

pip install phobic

Usage

import phobic

# Build
keys = [f"key_{i}".encode() for i in range(1_000_000)]
phf = phobic.build(keys)

# Query
phf[b"key_42"]                  # scalar -> int slot
phf.lookup(keys[:100])          # batch -> list[int] (parallel C above ~1K keys)

# Persist (wire format v3, mmap-friendly layout)
data = phf.to_bytes()
phf2 = phobic.from_bytes(data)
assert phf == phf2

# Inspect
len(phf)             # number of keys
phf.range_size       # output range, [0, range_size)
phf.num_shards       # internal sharding (1 at small N, more at scale)
phf.bits_per_key     # serialised size in bits, divided by num_keys

Knobs

phf = phobic.build(
    keys,
    load_factor=0.5,    # num_keys / range_size, in (0, 1]. 1.0 = MPHF.
    seed=None,          # one knob; reproducible builds
    num_shards=None,    # None = auto (1 below ~32K keys; scales above)
    num_threads=None,   # None = auto (CPU count)
    bucket_size=None,   # None = ceil(log2(per_shard_N))
    max_retries=100,    # per-shard retry budget on the pilot search
    assume_unique=False, # skip the O(N) Python-side uniqueness check (opt-in)
)

load_factor (the central trade-off)

load_factor is the canonical hash-table sense: keys per slot, in (0, 1].

load_factor meaning build cost space
0.1 very loose, ~10x range trivial wasteful
0.5 (default) 2x range fast moderate
0.95 nearly minimal slow tight
1.0 MPHF (range == num_keys) hardest optimal

If a build can't find a perfect hash within max_retries, phobic raises a RuntimeError that names the failing shard and suggests how to relax the constraints.

Parallelism

Both the build and lookup() release the GIL and fan out across pthreads:

  • Build parallelism is per-shard. At num_shards=1 the build is single-threaded; above that it scales nearly linearly to num_threads.
  • Batch query is chunked across num_threads above an internal threshold (~1024 keys). Below the threshold the call is serial.

Determinism is preserved: same seed and same num_shards produce byte-identical to_bytes() regardless of num_threads.

Performance

Measured on a 12-core machine, default load_factor=0.5, 16-byte uniform keys (cipher-maps shape):

N num_shards bpk build @ 12 threads with assume_unique=True
100K 7 1.17 26 ms 26 ms
1M 63 1.17 306 ms ~250 ms
10M 625 1.16 2.37 s 1.18 s

assume_unique=True skips the Python-side uniqueness check (a 2-second set(raw) hash at 10M keys). For callers whose keys are unique by construction (e.g. cipher-maps, HMAC outputs), this is pure win.

vs phobic 0.2.0: 1M parallel went from 2.0 s to 0.34 s (5.9x). 10M went from "would not finish" serially in 0.2.0 to ~1.2 s parallel in 0.3.1 with assume_unique=True.

vs maph's partitioned_phf<phobic4> (the reference in the sibling research repo): at 10M, phobic 0.3.1 is 6x faster and uses 2.4x fewer bits per key. See .claude/SWEEP_0_3_0.md for the full sweep.

Wire format

phf.to_bytes() produces a versioned, mmap-friendly v3 blob (magic b"PHF3"):

56 bytes  global header   (num_keys, total_range, num_shards, seed, shard_seed)
40 bytes  per shard       (descriptor table, fixed-size, indexable)
variable  per shard       (uint16 pilot blocks, 8-byte aligned)

Fixed-size descriptor table + absolute pilot offsets means a future phobic.from_file(path) can mmap the blob directly without parsing variable-size sections. (Not in 0.3.0; the format pre-pays for it.)

Cross-process transport

PHF instances support equality (==), copy.deepcopy, and the __reduce__ protocol used by multiprocessing.Pool:

import multiprocessing as mp

phf = phobic.build(keys)
with mp.Pool(8) as p:
    p.map(worker_using_phf, work_items)   # phf serialises through to_bytes / from_bytes

What phobic explicitly is not

phobic is a pure PHF builder. The following live elsewhere:

  • Membership testing / Bloom filters / fingerprinting: out of scope. Phobic does not promise that a key not in the build set will be detected; it returns an arbitrary slot for unknown keys. See maph for membership oracles.
  • Value retrieval / Bloomier maps / cipher maps: out of scope. Build an external value array indexed by phf[key].
  • Approximate / non-perfect modes: a PHF is always perfect. Build failure raises with a diagnostic; non-perfect output is a category error.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phobic-0.3.1.tar.gz (26.9 kB view details)

Uploaded Source

File details

Details for the file phobic-0.3.1.tar.gz.

File metadata

  • Download URL: phobic-0.3.1.tar.gz
  • Upload date:
  • Size: 26.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for phobic-0.3.1.tar.gz
Algorithm Hash digest
SHA256 b9eb0435f793960b31b46d351952f03545b5731192ab6724dd31f81fd8a468e7
MD5 af41a6c99f0fffb9eb4de0958ffdb485
BLAKE2b-256 dc4e91493539b786f7ecd1681af422f03b5ed4a2e9592fb1a02f8ff947bc4fc6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page