Skip to main content

SDK for the HUD platform.

Project description

HUD

HUD is a platform for building RL environments for AI agents, across coding, browser, computer-use, and robotics. Define an environment, write tasks, and run them as evals and training across any model, at any scale.

To learn more, see the documentation and API reference.

PyPI License Add docs to Cursor Discord X Follow Scarf Docs

Install

# Install the CLI (recommended)
uv tool install hud-python --python 3.12

# …or as a library
pip install hud-python

Get your API key at hud.ai/project/api-keys and set it:

hud set HUD_API_KEY=your-key-here
# or: export HUD_API_KEY=your-key-here

Then scaffold your first environment:

hud init my-env

Agent running on SheetBench

The protocol

HUD is protocol-first. An agent and an environment exchange just three things: a manifest (the environment's capabilities and tasks), tasks.start that returns the prompt, and tasks.grade that returns the reward. In between, the agent just works, driving the capabilities itself. HUD owns only that thin envelope, so any model or harness plugs into any environment.

sequenceDiagram
    participant Agent
    participant Env as Environment
    participant Caps as Capabilities (ssh · mcp · cdp · rfb · robot)
    Agent->>Env: manifest exchange
    Env-->>Agent: capabilities + tasks
    Agent->>Env: tasks.start
    Env-->>Agent: prompt
    rect rgb(238,238,238)
    Note over Agent,Caps: the agent works, driving capabilities directly
    Agent->>Caps: shell · browser · GUI · tools · robot
    Caps-->>Agent: observations
    end
    Agent->>Env: tasks.grade
    Env-->>Agent: reward

Because the protocol only exposes capabilities (never a fixed agent), an environment outlives any single harness: new harnesses and models keep running against the same environments, benchmarks, and tasks.

Package & run anywhere

A built image is the end product for your tasks: one build packs every task from a single definition. The recommended path is hud deploy, which builds and registers your environment on HUD in one step; then sync a taskset and run remotely:

hud deploy
hud sync tasks my-taskset
hud eval my-taskset --remote

For local iteration, the same protocol works against a container on your laptop:

hud build .
docker run -d --name run1 my-env
docker exec run1 hud task start fix_bug
docker exec run1 hud task grade fix_bug --answer "…"
docker rm -f run1

Package & deploy

Environments & templates

A template is an async generator registered with @env.template(): yield a prompt, receive the agent's answer, yield a reward. Calling the template mints a runnable Task; one function spans a whole dataset of variants. The simplest needs no capabilities — just a prompt and a grader:

from hud import Environment

env = Environment(name="letter-count")

@env.template()
async def count_letter(word: str = "strawberry", letter: str = "r"):
    answer = yield f"How many '{letter}'s are in '{word}'? Reply with just the number."
    yield 1.0 if answer and str(word.count(letter)) in answer else 0.0

tasks = [count_letter(word=w) for w in ("strawberry", "raspberry", "blueberry")]

Run it immediately against any model:

hud eval tasks.py claude --group 3

Each graded evaluation is a trace (the SDK's live handle is a Run). With HUD_API_KEY set, every rollout is recorded on hud.ai. Tasks that need a shell, browser, GUI, or robot declare capabilities (below); everything else — variants, grading, batching — stays identical.

Quickstart · Tasks & tasksets

Capabilities & harnesses

A capability is a connection the environment exposes; a harness attaches its own tools to it. The same environment serves a one-shot Q&A or a full computer-use rollout, depending on which capabilities the harness opens.

Protocol What it exposes
ssh Shell + files in a sandboxed workspace (env.workspace(root))
mcp Tools over the Model Context Protocol
cdp Browser control over the Chrome DevTools Protocol
rfb Full computer-use over VNC: screen + keyboard/mouse
robot (beta) Schema-driven robot observation/action loop over WebSocket

Ships natively: Claude, OpenAI (Responses), OpenAI-compatible endpoints, and Gemini via create_agent("claude-sonnet-4-5") (or gpt-…, gemini-…). The harness wires capability-backed tools for the model you choose at run time.

Bring your own: a harness attaches to a capability and defines a tool spec — wrap browser-use on cdp, a VLA policy on robot, or your own agent on ssh / mcp. No protocol work required.

Capabilities · Models · Robots

Deploy on the platform

From the platform UI you can run batches, compare models on the same taskset, and inspect every trace.

Deploy · Leaderboards

Train on rewards

Every rollout returns a Run carrying a trace_id and a reward, so the tasks you evaluate are already training data. Run a group per task and turn the rewards into GRPO advantages with group_relative():

from hud.agents import create_agent
from hud.eval import Taskset, group_relative

agent = create_agent("claude-sonnet-4-5")
job = await Taskset(count_letter(word=w) for w in words).run(agent, group=16)
for runs in job.results.values():
    advantages = group_relative([r.reward for r in runs], normalize_std=True)
    ...  # feed (run.trace_id, adv) into your optimizer

HUD is the environment-and-reward source for your own GRPO/PPO loop — the same environment trains any model, text or multimodal, unchanged.

Training · Designing tasks for signal

Links

Enterprise

Building agents at scale? We work with teams on custom environments, benchmarks, and training.

📅 Book a call · 📧 founders@hud.ai

Contributing

We welcome contributions! See CONTRIBUTING.md.

Key areas: Agents · Environments · Capabilities · Eval

Citation

@software{hud2025agentevalplatform,
  author = {HUD and Jay Ram and Lorenss Martinsons and Parth Patel and Govind Pimpale and Dylan Bowman and Jaideep and Nguyen Nhat Minh},
  title  = {HUD: An Evaluation and RL Envrionments Platform for Agents},
  date   = {2025-04},
  url    = {https://github.com/hud-evals/hud-python},
  langid = {en}
}

MIT License · LICENSE

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hud_python-0.6.2.tar.gz (310.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hud_python-0.6.2-py3-none-any.whl (406.3 kB view details)

Uploaded Python 3

File details

Details for the file hud_python-0.6.2.tar.gz.

File metadata

  • Download URL: hud_python-0.6.2.tar.gz
  • Upload date:
  • Size: 310.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.23 {"installer":{"name":"uv","version":"0.11.23","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for hud_python-0.6.2.tar.gz
Algorithm Hash digest
SHA256 a58e64c8d5cda924ec7f58a3584450e1afd1667b4edc96d2491461950c18fb05
MD5 88bc8c4969cfcead9e78d5af0f62a2a4
BLAKE2b-256 9ed3ad411df704faeaff21aa956d1bc761ad30729586aa2e777891489807ceda

See more details on using hashes here.

File details

Details for the file hud_python-0.6.2-py3-none-any.whl.

File metadata

  • Download URL: hud_python-0.6.2-py3-none-any.whl
  • Upload date:
  • Size: 406.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.23 {"installer":{"name":"uv","version":"0.11.23","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for hud_python-0.6.2-py3-none-any.whl
Algorithm Hash digest
SHA256 6f8aabd0a24b89c983105ca0e7780817ffe8af90a60ec1a41e7bfaae4ce8cb51
MD5 0ac37bb8d2cbb1178a2871e789c8260a
BLAKE2b-256 206298b202f70ebac9f85f62d86f4e0cb5e57d5e3c7e83195b63ee4f53c42666

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page