hud-python

SDK for the HUD platform.

These details have not been verified by PyPI

Project links

Project description

OSS RL environment + evals toolkit. Wrap software as environments, run benchmarks, and train with RL – locally or at scale.

Are you a startup building agents?

📅 Hop on a call or 📧 founders@hud.so

Highlights

🚀 MCP environment skeleton – any agent can call any environment.
⚡️ Live telemetry – inspect every tool call, observation, and reward in real time.
🗂️ Public benchmarks – OSWorld-Verified, SheetBench-50, and more.
🌐 Cloud browsers – AnchorBrowser, Steel, BrowserBase integrations for browser automation.
🛠️ Hot-reload dev loop – hud dev for iterating on environments without rebuilds.
🎓 One-click RL – Run hud rl to get a trained model on any environment.

We welcome contributors and feature requests – open an issue or hop on a call to discuss improvements!

Installation

# SDK - MCP servers, telemetry, evaluation
pip install hud-python

# CLI - RL pipeline, environment design
uv tool install hud-python
# uv tool update-shell

See docs.hud.so, or add docs to any MCP client: claude mcp add --transport http docs-hud https://docs.hud.so/mcp

Before starting, get your HUD_API_KEY at hud.so.

Quickstart: Evals

For a tutorial that explains the agent and evaluation design, run:

uvx hud-python quickstart

Or just write your own agent loop (more examples here).

import asyncio, hud, os
from hud.settings import settings
from hud.clients import MCPClient
from hud.agents import ClaudeAgent
from hud.datasets import Task  # See docs: https://docs.hud.so/reference/tasks

async def main() -> None:
    with hud.trace("Quick Start 2048"): # All telemetry works for any MCP-based agent (see https://hud.so)
        task = {
            "prompt": "Reach 64 in 2048.",
            "mcp_config": {
                "hud": {
                    "url": "https://mcp.hud.so/v3/mcp",  # HUD's cloud MCP server (see https://docs.hud.so/core-concepts/architecture)
                    "headers": {
                        "Authorization": f"Bearer {settings.api_key}",  # Get your key at https://hud.so
                        "Mcp-Image": "hudpython/hud-text-2048:v1.2"  # Docker image from https://hub.docker.com/u/hudpython
                    }
                }
            },
            "evaluate_tool": {"name": "evaluate", "arguments": {"name": "max_number", "arguments": {"target": 64}}},
        }
        task = Task(**task)

        # 1. Define the client explicitly:
        client = MCPClient(mcp_config=task.mcp_config)
        agent = ClaudeAgent(
            mcp_client=client,
            model="claude-sonnet-4-20250514",  # requires ANTHROPIC_API_KEY
        )

        result = await agent.run(task)

        # 2. Or just:
        # result = await ClaudeAgent().run(task)

        print(f"Reward: {result.reward}")
        await client.shutdown()

asyncio.run(main())

The above example let's the agent play 2048 (See replay)

Agent playing 2048

Quickstart: Training

RL using GRPO a Qwen2.5-VL model on any hud dataset:

hud get hud-evals/2048-basic # from HF
hud rl 2048-basic.json

See agent training docs

Or make your own environment and dataset:

hud init my-env && cd my-env
hud dev --interactive
# When ready to run:
hud rl

See environment design docs

Benchmarking Agents

This is Claude Computer Use running on our proprietary financial analyst benchmark SheetBench-50:

Trace screenshot

See this trace on hud.so

This example runs the full dataset (only takes ~20 minutes) using run_evaluation.py:

python examples/run_evaluation.py hud-evals/SheetBench-50 --full --agent claude

Or in code:

import asyncio
from hud.datasets import run_dataset
from hud.agents import ClaudeAgent

results = await run_dataset(
    name="My SheetBench-50 Evaluation",
    dataset="hud-evals/SheetBench-50",      # <-- HuggingFace dataset
    agent_class=ClaudeAgent,                # <-- Your custom agent can replace this (see https://docs.hud.so/evaluate-agents/create-agents)
    agent_config={"model": "claude-sonnet-4-20250514"},
    max_concurrent=50,
    max_steps=30,
)
print(f"Average reward: {sum(r.reward for r in results) / len(results):.2f}")

Running a dataset creates a job and streams results to the hud.so platform for analysis and leaderboard submission.

Building Environments (MCP)

This is how you can make any environment into an interactable one in 5 steps:

Define MCP server layer using MCPServer

from hud.server import MCPServer
from hud.tools import HudComputerTool

mcp = MCPServer("My Environment")

# Add hud tools (see all tools: https://docs.hud.so/reference/tools)
mcp.tool(HudComputerTool())

# Or custom tools (see https://docs.hud.so/build-environments/adapting-software)
@mcp.tool("launch_app"):
def launch_app(name: str = "Gmail")
...

if __name__ == "__main__":
    mcp.run()

Write a simple Dockerfile that installs packages and runs:

CMD ["python", "-m", "hud_controller.server"]

And build the image:

hud build # runs docker build under the hood

Or run it in interactible mode

hud dev

Debug it with the CLI to see if it launches:

$ hud debug my-name/my-environment:latest

✓ Phase 1: Docker image exists
✓ Phase 2: MCP server responds to initialize 
✓ Phase 3: Tools are discoverable
✓ Phase 4: Basic tool execution works
✓ Phase 5: Parallel performance is good

Progress: [█████████████████████] 5/5 phases (100%)
✅ All phases completed successfully!

Analyze it to see if all tools appear:

$ hud analyze hudpython/hud-remote-browser:latest
⠏ ✓ Analysis complete
...
Tools
├── Regular Tools
│   ├── computer
│   │   └── Control computer with mouse, keyboard, and screenshots
...
└── Hub Tools
    ├── setup
    │   ├── navigate_to_url
    │   ├── set_cookies
    │   ├── ...
    └── evaluate
        ├── url_match
        ├── page_contains
        ├── cookie_exists
        ├── ...

📡 Telemetry Data
 Live URL  https://live.anchorbrowser.io?sessionId=abc123def456

When the tests pass, push it up to the docker registry:

hud push # needs docker login, hud api key

Now you can use mcp.hud.so to launch 100s of instances of this environment in parallel with any agent, and see everything live on hud.so:

from hud.agents import ClaudeAgent

result = await ClaudeAgent().run({  # See all agents: https://docs.hud.so/reference/agents
    "prompt": "Please explore this environment",
    "mcp_config": {
        "my-environment": {
            "url": "https://mcp.hud.so/v3/mcp",
            "headers": {
                "Authorization": f"Bearer {os.getenv('HUD_API_KEY')}",
                "Mcp-Image": "my-name/my-environment:latest"
            }
        }
        # "my-environment": { # or use hud run which wraps local and remote running
        #     "cmd": "hud",
        #     "args": [
        #         "run",
        #         "my-name/my-environment:latest",
        #     ]
        # }
    }
})

See the full environment design guide and common pitfalls in environments/README.md

Leaderboards & benchmarks

All leaderboards are publicly available on hud.so/leaderboards (see docs)

Leaderboard

We highly suggest running 3-5 evaluations per dataset for the most consistent results across multiple jobs.

Using the run_dataset function with a HuggingFace dataset automatically assigns your job to that leaderboard page, and allows you to create a scorecard out of it:

Reinforcement Learning with GRPO

This is a Qwen‑2.5‑VL‑3B agent training a policy on the 2048-basic browser environment:

RL curve

Train with the new interactive hud rl flow:

# Install CLI
uv tool install hud-python

# Option A: Run directly from a HuggingFace dataset
hud rl hud-evals/2048-basic

# Option B: Download first, modify, then train
hud get hud-evals/2048-basic
hud rl 2048-basic.json

# Optional: baseline evaluation
hud eval 2048-basic.json

Supports multi‑turn RL for both:

Language‑only models (e.g., Qwen/Qwen2.5-7B-Instruct)
Vision‑Language models (e.g., Qwen/Qwen2.5-VL-3B-Instruct)

By default, hud rl provisions a persistent server and trainer in the cloud, streams telemetry to hud.so, and lets you monitor/manage models at hud.so/models. Use --local to run entirely on your machines (typically 2+ GPUs: one for vLLM, the rest for training).

Any HUD MCP environment and evaluation works with our RL pipeline (including remote configurations). See the guided docs: https://docs.hud.so/train-agents/quickstart.

Pricing: Hosted vLLM and training GPU rates are listed in the Training Quickstart → Pricing. Manage billing at the HUD billing dashboard.

Architecture

%%{init: {"theme": "neutral", "themeVariables": {"fontSize": "14px"}} }%%
graph LR
    subgraph "Platform"
        Dashboard["📊 hud.so"]
        API["🔌 mcp.hud.so"]
    end
  
    subgraph "hud"
        Agent["🤖 Agent"]
        Task["📋 Task"]
        SDK["📦 SDK"]
    end
  
    subgraph "Environments"
        LocalEnv["🖥️ Local Docker<br/>(Development)"]
        RemoteEnv["☁️ Remote Docker<br/>(100s Parallel)"]
    end
  
    subgraph "otel"
        Trace["📡 Traces & Metrics"]
    end
  
    Dataset["📚 Dataset<br/>(HuggingFace)"]
  
    AnyMCP["🔗 Any MCP Client<br/>(Cursor, Claude, Custom)"]
  
    Agent <--> SDK
    Task --> SDK
    Dataset <-.-> Task
    SDK <-->|"MCP"| LocalEnv
    SDK <-->|"MCP"| API
    API  <-->|"MCP"| RemoteEnv
    SDK  --> Trace
    Trace --> Dashboard
    AnyMCP -->|"MCP"| API

CLI reference

Command	Purpose	Docs
`hud init`	Create new environment with boilerplate.	📖
`hud dev`	Hot-reload development with Docker.	📖
`hud build`	Build image and generate lock file.	📖
`hud push`	Share environment to registry.	📖
`hud pull <target>`	Get environment from registry.	📖
`hud analyze <image>`	Discover tools, resources, and metadata.	📖
`hud debug <image>`	Five-phase health check of an environment.	📖
`hud run <image>`	Run MCP server locally or remotely.	📖

Roadmap

Merging our forks in to the main mcp, mcp_use repositories
Helpers for building new environments (see current guide)
Integrations with every major agent framework
Evaluation environment registry
MCP opentelemetry standard

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

Key areas:

Environment examples - Add new MCP environments
Agent implementations - Add support for new LLM providers
Tool library - Extend the built-in tool collection
RL training - Improve reinforcement learning pipelines

Thanks to all our contributors!

Citation

@software{hud2025agentevalplatform,
  author = {HUD and Jay Ram and Lorenss Martinsons and Parth Patel and Oskars Putans and Govind Pimpale and Mayank Singamreddy and Nguyen Nhat Minh},
  title  = {HUD: An Evaluation Platform for Agents},
  date   = {2025-04},
  url    = {https://github.com/hud-evals/hud-python},
  langid = {en}
}

License: HUD is released under the MIT License – see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.5.34

Mar 24, 2026

0.5.33

Mar 17, 2026

0.5.32

Mar 17, 2026

0.5.31

Mar 13, 2026

0.5.30

Mar 11, 2026

0.5.29

Feb 28, 2026

0.5.28

Feb 26, 2026

0.5.27

Feb 22, 2026

0.5.26

Feb 19, 2026

0.5.25

Feb 17, 2026

0.5.24

Feb 13, 2026

0.5.23

Feb 12, 2026

0.5.22

Feb 9, 2026

0.5.21

Feb 8, 2026

0.5.20

Feb 7, 2026

0.5.19

Feb 7, 2026

0.5.18

Feb 3, 2026

0.5.17

Jan 29, 2026

0.5.16

Jan 26, 2026

0.5.15

Jan 22, 2026

0.5.14

Jan 21, 2026

0.5.13

Jan 18, 2026

0.5.12

Jan 16, 2026

0.5.11

Jan 15, 2026

0.5.10

Jan 15, 2026

0.5.9

Jan 14, 2026

0.5.8

Jan 13, 2026

0.5.7

Jan 13, 2026

0.5.6

Jan 12, 2026

0.5.5

Jan 11, 2026

0.5.4

Jan 9, 2026

0.5.3

Jan 9, 2026

0.5.2

Jan 7, 2026

0.5.1

Jan 2, 2026

0.5.0

Dec 17, 2025

0.4.74

Dec 12, 2025

0.4.73

Dec 8, 2025

0.4.72

Dec 7, 2025

0.4.71

Dec 7, 2025

0.4.70

Dec 5, 2025

0.4.69

Dec 1, 2025

0.4.68

Nov 28, 2025

0.4.67

Nov 22, 2025

0.4.66

Nov 21, 2025

0.4.65

Nov 20, 2025

0.4.64

Nov 20, 2025

0.4.63

Nov 20, 2025

0.4.62

Nov 8, 2025

This version

0.4.61

Nov 7, 2025

0.4.60

Nov 3, 2025

0.4.59

Oct 29, 2025

0.4.58

Oct 25, 2025

0.4.57

Oct 24, 2025

0.4.56

Oct 23, 2025

0.4.55

Oct 23, 2025

0.4.54

Oct 20, 2025

0.4.53

Oct 12, 2025

0.4.52

Oct 2, 2025

0.4.51

Oct 1, 2025

0.4.50

Oct 1, 2025

0.4.49

Oct 1, 2025

0.4.48

Oct 1, 2025

0.4.47

Sep 26, 2025

0.4.46

Sep 26, 2025

0.4.45

Sep 26, 2025

0.4.44

Sep 24, 2025

0.4.43

Sep 24, 2025

0.4.42

Sep 24, 2025

0.4.41

Sep 24, 2025

0.4.40

Sep 23, 2025

0.4.39

Sep 23, 2025

0.4.38

Sep 23, 2025

0.4.37

Sep 23, 2025

0.4.36

Sep 22, 2025

0.4.35

Sep 22, 2025

0.4.34

Sep 20, 2025

0.4.33

Sep 19, 2025

0.4.32

Sep 19, 2025

0.4.31

Sep 19, 2025

0.4.30

Sep 18, 2025

0.4.29

Sep 18, 2025

0.4.28

Sep 18, 2025

0.4.27

Sep 17, 2025

0.4.26

Sep 14, 2025

0.4.25

Sep 13, 2025

0.4.24

Sep 12, 2025

0.4.23

Sep 12, 2025

0.4.22

Sep 11, 2025

0.4.21

Sep 9, 2025

0.4.20

Sep 8, 2025

0.4.19

Sep 7, 2025

0.4.18

Sep 5, 2025

0.4.17

Aug 31, 2025

0.4.16

Aug 30, 2025

0.4.15

Aug 30, 2025

0.4.14

Aug 27, 2025

0.4.13

Aug 27, 2025

0.4.12

Aug 27, 2025

0.4.11

Aug 26, 2025

0.4.10

Aug 26, 2025

0.4.9

Aug 26, 2025

0.4.8

Aug 26, 2025

0.4.7

Aug 26, 2025

0.4.6

Aug 26, 2025

0.4.5

Aug 26, 2025

0.4.4

Aug 26, 2025

0.4.3

Aug 26, 2025

0.4.2

Aug 26, 2025

0.4.1

Aug 24, 2025

0.4.0

Aug 24, 2025

0.3.5

Aug 5, 2025

0.3.4

Aug 5, 2025

0.3.3

Aug 5, 2025

0.3.2

Aug 5, 2025

0.3.1

Aug 5, 2025

0.3.0

Aug 2, 2025

0.2.10

Jul 21, 2025

0.2.9

Jul 21, 2025

0.2.8

Jul 17, 2025

0.2.7

Jun 24, 2025

0.2.6

May 28, 2025

0.2.5

May 26, 2025

0.2.4

May 6, 2025

0.2.3

May 6, 2025

0.2.2

Apr 29, 2025

0.2.1

Apr 26, 2025

0.2.0

Apr 18, 2025

0.1.5

Apr 6, 2025

0.1.4

Apr 2, 2025

0.1.3

Mar 31, 2025

0.1.2a0 pre-release

Mar 30, 2025

0.1.1

Mar 30, 2025

0.1.0b3 pre-release

Mar 10, 2025

0.1.0b2 pre-release

Mar 4, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hud_python-0.4.61.tar.gz (465.8 kB view details)

Uploaded Nov 7, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hud_python-0.4.61-py3-none-any.whl (574.2 kB view details)

Uploaded Nov 7, 2025 Python 3

File details

Details for the file hud_python-0.4.61.tar.gz.

File metadata

Download URL: hud_python-0.4.61.tar.gz
Upload date: Nov 7, 2025
Size: 465.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.7

File hashes

Hashes for hud_python-0.4.61.tar.gz
Algorithm	Hash digest
SHA256	`fa40108b415bbb5437e6d9587024ea88ea69b62d649bb75583e94cc2f3e92f41`
MD5	`a5c4393f7ed334af8ca6db1420bf7428`
BLAKE2b-256	`af731285eeaed67d0db7d9775d5575c128e956edb266762022bc56a85cbb643f`

See more details on using hashes here.

File details

Details for the file hud_python-0.4.61-py3-none-any.whl.

File metadata

Download URL: hud_python-0.4.61-py3-none-any.whl
Upload date: Nov 7, 2025
Size: 574.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.7

File hashes

Hashes for hud_python-0.4.61-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5c651766a7d54d9dc4a98035e03ac7b1355a7cf24585db21dd401a660ef5d4ec`
MD5	`b522a97d26f0554f8a35eda0514c5dc1`
BLAKE2b-256	`30539a9c8ac9cfe4d0d68c7fcdde842416c4db30cb73894b8760c4d03a5b6a00`

See more details on using hashes here.

hud-python 0.4.61

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Are you a startup building agents?

Highlights

Installation

Quickstart: Evals

Quickstart: Training

Benchmarking Agents

Building Environments (MCP)

Leaderboards & benchmarks

Reinforcement Learning with GRPO

Architecture

CLI reference

Roadmap

Contributing

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes