Skip to main content

Benchmark harness for black-box optimizers that speak an ask/tell JSON Lines protocol

Project description

hypara

A benchmark harness for measuring how well an optimizer searches an unknown black-box evaluation function.

hypara is deliberately not about solving famous problems (TSP, knapsack, bin packing) where a strong off-the-shelf solver wins. Each problem ships a natural-language description, a mixed search space, and a hidden evaluator whose shape changes with the instance seed. To score well an optimizer has to read the description, reason about the space, and adapt its strategy from the evaluation history within a limited budget.

Optimizers are language-agnostic external processes: they talk to the runner over a stdin/stdout JSON Lines protocol, so an optimizer can be written in Python, Rust, Go, TypeScript, or any executable.

Install

pip install hypara

For development (tests + build tooling):

pip install -e .[dev]
python -m pytest

Quickstart

List the built-in problems:

hypara list

Write a minimal optimizer. Create my_opt/manifest.json:

{"name": "my_opt", "command": ["python", "main.py"]}

and my_opt/main.py:

import json, random, sys

space = []
rng = random.Random()

def send(msg):
    sys.stdout.write(json.dumps(msg) + "\n")
    sys.stdout.flush()

for line in sys.stdin:
    msg = json.loads(line)
    t = msg.get("type")
    if t == "init":
        space = msg["problem"]["space"]
        rng = random.Random(msg.get("optimizer_seed"))
        send({"type": "ready"})
    elif t == "ask":
        # propose a candidate; here, a trivial random pick over numeric params
        cand = {}
        for p in space:
            if p.get("condition") is not None:
                continue
            if p["type"] == "categorical":
                cand[p["name"]] = rng.choice(p["choices"])
            elif p["type"] == "bool":
                cand[p["name"]] = rng.random() < 0.5
            else:
                lo, hi = p["low"], p["high"]
                v = rng.uniform(lo, hi)
                cand[p["name"]] = int(round(v)) if p["type"] == "int" else v
        send({"type": "propose", "candidate": cand})
    elif t == "tell":
        pass  # inspect msg["score"], msg["valid"], msg["remaining"] to adapt
    elif t == "finish":
        break

Run it against one problem, then aggregate:

hypara run --problem smooth_hill --optimizer ./my_opt --seed 1

The source repository also includes two reference optimizers (optimizers/random_search, optimizers/hill_climb) and ready-made suite configs (configs/smoke.json, configs/full.json):

hypara suite --config configs/smoke.json
hypara report --dir results/smoke-YYYYmmdd-HHMMSS

Built-in problems

All problems are single-objective, maximize, with an achievable maximum near 1.0. The hidden landscape is reseeded per run, so memorizing an instance does not help.

Problem What it tests
smooth_hill Smooth unimodal surface; local search should win.
rugged_trap Multimodal with a decoy hill; needs restarts / exploration.
conditional_knobs A categorical choice switches which knobs exist.
noisy_lab Additive gaussian noise; beware chasing lucky readings.
multi_fidelity Cheap biased low-fidelity vs. expensive true high-fidelity.
sparse_needle One hidden combination scores high; weak partial-match signal.
cost_aware The candidate's own samples knob drives its evaluation cost.
rag_pipeline Surrogate RAG tuning (chunking, top_k, reranker interactions).
image_pipeline Surrogate diffusion tuning; steps drive quality and cost.
dispatch_policy Surrogate delivery policy; balance, batching, mild noise.

Protocol

The runner launches the optimizer as a child process (working directory = the optimizer's directory; if command[0] is "python" it is replaced with the runner's own interpreter). Messages are one JSON object per line: runner → optimizer on stdin, optimizer → runner on stdout. Optimizer stdout is protocol-only; write debug output to stderr (the runner saves it to optimizer.stderr.log). Receivers ignore unknown keys. NaN/Infinity must not be sent. Current protocol_version is 1.

Messages and turn-taking

Direction type Reply
runner → optimizer init ready (once)
runner → optimizer ask propose (once)
runner → optimizer tell none
runner → optimizer finish none; exit promptly

Only one ask is outstanding at a time. The init reply may take up to 30s, each ask reply up to 60s by default; overruns end the run as optimizer_timeout. A crash, an unparseable line, or an out-of-order message ends the run as failed. The best-so-far is recorded in every case.

init (runner → optimizer):

{"type": "init", "protocol_version": 1, "run_id": "smooth_hill--my_opt--s1",
 "problem": {
   "description": "natural-language prompt",
   "space": [ ...param specs (below)... ],
   "objective": "maximize",
   "budget": {"evaluations": 100, "cost_limit": null, "time_limit_sec": 300.0},
   "fidelities": null
 },
 "optimizer_seed": 12345}

budget always has at least one of evaluations or cost_limit non-null. fidelities, when non-null, is ordered low→high (last entry = top fidelity).

ready / propose (optimizer → runner):

{"type": "ready"}
{"type": "propose", "candidate": {"x0": 0.5, "algo": "alpha"}, "fidelity": "low"}

fidelity is optional; omitted/null means top fidelity. Sending a non-null fidelity to a problem with no fidelities is invalid.

tell (runner → optimizer):

{"type": "tell", "candidate_id": "c-0007", "candidate": {"x0": 0.5},
 "valid": true, "score": 0.73, "cost": 1.0, "fidelity": null, "error": null,
 "remaining": {"evaluations": 92, "cost": null, "time_sec": 291.3}}

When invalid: valid: false, score: null, and error gives the reason.

finish (runner → optimizer): {"type": "finish", "reason": "budget_exhausted"} (reason is budget_exhausted or time_limit).

Search space

[
  {"name": "lr", "type": "float", "low": 1e-4, "high": 1.0, "log": true},
  {"name": "layers", "type": "int", "low": 1, "high": 12},
  {"name": "opt", "type": "categorical", "choices": ["sgd", "adam"]},
  {"name": "warmup", "type": "bool"},
  {"name": "warmup_steps", "type": "int", "low": 10, "high": 1000,
   "condition": {"param": "warmup", "equals": [true]}}
]
  • Types: float, int, categorical, bool. Bounds low/high are inclusive; log: true hints a log scale.
  • A param with condition is active only when candidate[condition.param] is in equals. Conditioning is one level deep (the parent must be unconditional).

A candidate is validated by the runner: it must be a JSON object containing exactly the active params (no unknown keys, no inactive params, none missing), each of the right type and within range.

Budget rules

  • A valid evaluation consumes the evaluator's cost (may depend on the candidate/fidelity); the evaluations axis always consumes 1.
  • An invalid proposal still consumes budget (1 evaluation, cost 1.0), so spamming invalid candidates cannot mine the space for free.
  • The stop check runs before each ask, so the final evaluation may slightly overshoot cost_limit.
  • For problems with fidelities, only top-fidelity evaluations count toward best_score; lower fidelities are available as history but not scored.

Metrics

hypara report recomputes everything from the saved logs. Per run: best score, best candidate, best-so-far curve (over evaluations or cumulative cost), valid rate, status, wall time. Aggregated per (problem, optimizer): mean best, a baseline-relative normalized best and normalized anytime AUC (0 = baseline median, 1 = best observed for that problem), and an overall mean across problems.

Adding a problem

Implement Problem under src/hypara/problems/ and register it in src/hypara/registry.py. Keep the description and the evaluator's actual behavior in sync — the point of the benchmark is that reading the description helps. The shared invariants in tests/test_problems.py (finite scores, determinism given a seed, instance-seed sensitivity) apply automatically.

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hypara-0.1.0.tar.gz (37.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hypara-0.1.0-py3-none-any.whl (36.1 kB view details)

Uploaded Python 3

File details

Details for the file hypara-0.1.0.tar.gz.

File metadata

  • Download URL: hypara-0.1.0.tar.gz
  • Upload date:
  • Size: 37.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for hypara-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7e71d6b29ee1c21b1701afa6e7b1453b85ae7b9dfa7fa966de88d1b7c9c65ed4
MD5 63578c02a52e7c7a7fec1dcb220e080e
BLAKE2b-256 27b35436a2282d4f7c362d2a57f296033ac04ce4012e9b22e527b7f8ba470ce9

See more details on using hashes here.

File details

Details for the file hypara-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: hypara-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 36.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for hypara-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4ada38a4fb9dcbaba8121b5bde35fa95c3c0384eccedde0226a9e7f3c53e313e
MD5 2134e0b15ca6ce1d1ee4b56dceda173d
BLAKE2b-256 54fa0600effee4a77312817b3e17607fb71f322caa050e7cdd6339c1a176125c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page