Benchmark harness for black-box optimizers that speak an ask/tell JSON Lines protocol
Project description
hypara
A benchmark harness for measuring how well an optimizer searches an unknown black-box evaluation function.
hypara is deliberately not about solving famous problems (TSP, knapsack, bin packing) where a strong off-the-shelf solver wins. Each problem ships a natural-language description, a mixed search space, and a hidden evaluator whose shape changes with the instance seed. To score well an optimizer has to read the description, reason about the space, and adapt its strategy from the evaluation history within a limited budget.
Optimizers are language-agnostic external processes: they talk to the runner over a stdin/stdout JSON Lines protocol, so an optimizer can be written in Python, Rust, Go, TypeScript, or any executable.
Install
pip install hypara
For development (tests + build tooling):
pip install -e .[dev]
python -m pytest
Quickstart
List the built-in problems:
hypara list
Write a minimal optimizer. Create my_opt/manifest.json:
{"name": "my_opt", "command": ["python", "main.py"]}
and my_opt/main.py:
import json, random, sys
space = []
rng = random.Random()
def send(msg):
sys.stdout.write(json.dumps(msg) + "\n")
sys.stdout.flush()
for line in sys.stdin:
msg = json.loads(line)
t = msg.get("type")
if t == "init":
space = msg["problem"]["space"]
rng = random.Random(msg.get("optimizer_seed"))
send({"type": "ready"})
elif t == "ask":
# propose a candidate; here, a trivial random pick over numeric params
cand = {}
for p in space:
if p.get("condition") is not None:
continue
if p["type"] == "categorical":
cand[p["name"]] = rng.choice(p["choices"])
elif p["type"] == "bool":
cand[p["name"]] = rng.random() < 0.5
else:
lo, hi = p["low"], p["high"]
v = rng.uniform(lo, hi)
cand[p["name"]] = int(round(v)) if p["type"] == "int" else v
send({"type": "propose", "candidate": cand})
elif t == "tell":
pass # inspect msg["score"], msg["valid"], msg["remaining"] to adapt
elif t == "finish":
break
Run it against one problem, then aggregate:
hypara run --problem smooth_hill --optimizer ./my_opt --seed 1
The source repository also includes two reference optimizers
(optimizers/random_search, optimizers/hill_climb) and ready-made suite
configs (configs/smoke.json, configs/full.json):
hypara suite --config configs/smoke.json
hypara report --dir results/smoke-YYYYmmdd-HHMMSS
Built-in problems
All problems are single-objective, maximize, with an achievable maximum near 1.0. The hidden landscape is reseeded per run, so memorizing an instance does not help.
| Problem | What it tests |
|---|---|
smooth_hill |
Smooth unimodal surface; local search should win. |
rugged_trap |
Multimodal with a decoy hill; needs restarts / exploration. |
conditional_knobs |
A categorical choice switches which knobs exist. |
noisy_lab |
Additive gaussian noise; beware chasing lucky readings. |
multi_fidelity |
Cheap biased low-fidelity vs. expensive true high-fidelity. |
sparse_needle |
One hidden combination scores high; weak partial-match signal. |
cost_aware |
The candidate's own samples knob drives its evaluation cost. |
rag_pipeline |
Surrogate RAG tuning (chunking, top_k, reranker interactions). |
image_pipeline |
Surrogate diffusion tuning; steps drive quality and cost. |
dispatch_policy |
Surrogate delivery policy; balance, batching, mild noise. |
Protocol
The runner launches the optimizer as a child process (working directory = the
optimizer's directory; if command[0] is "python" it is replaced with the
runner's own interpreter). Messages are one JSON object per line: runner →
optimizer on stdin, optimizer → runner on stdout. Optimizer stdout is
protocol-only; write debug output to stderr (the runner saves it to
optimizer.stderr.log). Receivers ignore unknown keys. NaN/Infinity must
not be sent. Current protocol_version is 1.
Messages and turn-taking
| Direction | type |
Reply |
|---|---|---|
| runner → optimizer | init |
ready (once) |
| runner → optimizer | ask |
propose (once) |
| runner → optimizer | tell |
none |
| runner → optimizer | finish |
none; exit promptly |
Only one ask is outstanding at a time. The init reply may take up to 30s,
each ask reply up to 60s by default; overruns end the run as
optimizer_timeout. A crash, an unparseable line, or an out-of-order message
ends the run as failed. The best-so-far is recorded in every case.
init (runner → optimizer):
{"type": "init", "protocol_version": 1, "run_id": "smooth_hill--my_opt--s1",
"problem": {
"description": "natural-language prompt",
"space": [ ...param specs (below)... ],
"objective": "maximize",
"budget": {"evaluations": 100, "cost_limit": null, "time_limit_sec": 300.0},
"fidelities": null
},
"optimizer_seed": 12345}
budget always has at least one of evaluations or cost_limit non-null.
fidelities, when non-null, is ordered low→high (last entry = top fidelity).
ready / propose (optimizer → runner):
{"type": "ready"}
{"type": "propose", "candidate": {"x0": 0.5, "algo": "alpha"}, "fidelity": "low"}
fidelity is optional; omitted/null means top fidelity. Sending a non-null
fidelity to a problem with no fidelities is invalid.
tell (runner → optimizer):
{"type": "tell", "candidate_id": "c-0007", "candidate": {"x0": 0.5},
"valid": true, "score": 0.73, "cost": 1.0, "fidelity": null, "error": null,
"remaining": {"evaluations": 92, "cost": null, "time_sec": 291.3}}
When invalid: valid: false, score: null, and error gives the reason.
finish (runner → optimizer): {"type": "finish", "reason": "budget_exhausted"}
(reason is budget_exhausted or time_limit).
Search space
[
{"name": "lr", "type": "float", "low": 1e-4, "high": 1.0, "log": true},
{"name": "layers", "type": "int", "low": 1, "high": 12},
{"name": "opt", "type": "categorical", "choices": ["sgd", "adam"]},
{"name": "warmup", "type": "bool"},
{"name": "warmup_steps", "type": "int", "low": 10, "high": 1000,
"condition": {"param": "warmup", "equals": [true]}}
]
- Types:
float,int,categorical,bool. Boundslow/highare inclusive;log: truehints a log scale. - A param with
conditionis active only whencandidate[condition.param]is inequals. Conditioning is one level deep (the parent must be unconditional).
A candidate is validated by the runner: it must be a JSON object containing exactly the active params (no unknown keys, no inactive params, none missing), each of the right type and within range.
Budget rules
- A valid evaluation consumes the evaluator's
cost(may depend on the candidate/fidelity); theevaluationsaxis always consumes 1. - An invalid proposal still consumes budget (1 evaluation, cost 1.0), so spamming invalid candidates cannot mine the space for free.
- The stop check runs before each
ask, so the final evaluation may slightly overshootcost_limit. - For problems with
fidelities, only top-fidelity evaluations count towardbest_score; lower fidelities are available as history but not scored.
Metrics
hypara report recomputes everything from the saved logs. Per run: best
score, best candidate, best-so-far curve (over evaluations or cumulative
cost), valid rate, status, wall time. Aggregated per (problem, optimizer):
mean best, a baseline-relative normalized best and normalized anytime AUC
(0 = baseline median, 1 = best observed for that problem), and an overall
mean across problems.
Adding a problem
Implement Problem under src/hypara/problems/ and register it in
src/hypara/registry.py. Keep the description and the evaluator's actual
behavior in sync — the point of the benchmark is that reading the description
helps. The shared invariants in tests/test_problems.py (finite scores,
determinism given a seed, instance-seed sensitivity) apply automatically.
License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hypara-0.1.0.tar.gz.
File metadata
- Download URL: hypara-0.1.0.tar.gz
- Upload date:
- Size: 37.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7e71d6b29ee1c21b1701afa6e7b1453b85ae7b9dfa7fa966de88d1b7c9c65ed4
|
|
| MD5 |
63578c02a52e7c7a7fec1dcb220e080e
|
|
| BLAKE2b-256 |
27b35436a2282d4f7c362d2a57f296033ac04ce4012e9b22e527b7f8ba470ce9
|
File details
Details for the file hypara-0.1.0-py3-none-any.whl.
File metadata
- Download URL: hypara-0.1.0-py3-none-any.whl
- Upload date:
- Size: 36.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4ada38a4fb9dcbaba8121b5bde35fa95c3c0384eccedde0226a9e7f3c53e313e
|
|
| MD5 |
2134e0b15ca6ce1d1ee4b56dceda173d
|
|
| BLAKE2b-256 |
54fa0600effee4a77312817b3e17607fb71f322caa050e7cdd6339c1a176125c
|