Benchmark SDK for instrumenting AI components

These details have not been verified by PyPI

Project links

Project description

zhanla-sdk-py

zhanla-sdk-py is the Python SDK for defining Benchmark components in code.

You use it to declare tools, skills, agents, orchestrations, and evals as Python objects, then run them with zhanla.

Installation

Install the SDK:

pip install zhanla-sdk-py

Requires Python >=3.10.

The SDK itself has no runtime dependencies. Provider packages (anthropic, openai, google-genai) are optional and only required if you use bench.wrap().

If you want to execute components from the command line, install the CLI too:

pip install zhanla

Quick Start

Create a Python file with module-level component instances:

import anthropic
import zhanla

client = bench.wrap(anthropic.Anthropic())


def _classify(message: str, customer_tier: str = "standard", **_) -> dict:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=64,
        system='Reply with JSON: {"priority": "high|normal|low"}.',
        messages=[{"role": "user", "content": message}],
    )
    import json
    result = json.loads(response.content[0].text)
    result["customer_tier"] = customer_tier
    return result


priority_tool = bench.Tool(
    name="priority_tool",
    description="Classify the priority of a support message.",
    input_schema={},
    fn=_classify,
    input_schema={"message": str, "customer_tier": str},
    output_schema={
        "priority": str,
        "customer_tier": str,
    },
)


def priority_eval(model_response, expected_output, **_) -> dict:
    return {
        "score": 1.0 if model_response["priority"] == expected_output["priority"] else 0.0
    }


priority_eval_component = bench.CodeEval(
    name="priority_eval",
    description="Check whether the predicted priority matches the expected value.",
    input_schema={},
    fn=priority_eval,
)

Run it with the CLI:

bench run components.py:priority_tool --dataset tickets.json --eval components.py:priority_eval

If a file contains exactly one runnable component, :component_name is optional.

Core Concepts

The SDK is class-based. The public import is:

import zhanla as bench

Define components as module-level objects so the CLI can discover them when it imports your file.

Runnable Components

`Tool`

Use a Tool for deterministic Python logic.

lookup_customer = bench.Tool(
    name="lookup_customer",
    description="Fetch a customer record by ID.",
    input_schema={},
    fn=get_customer,
    input_schema={"customer_id": str},
    output_schema={"id": str, "email": str},
)

Requirements:

name
description
fn
input_schema
output_schema

Notes:

fn can be sync or async.
input_schema can be a simple dict like {"field": str}, a JSON-Schema-shaped dict, or a Pydantic model class.
If fn returns a non-dict value at runtime, the CLI wraps it as {"result": value}.
The CLI validates the first produced output against output_schema.
output_schema can be a simple dict like {"field": str} or a Pydantic model class.

`Skill`

Use a Skill for reusable instructions, optionally backed by Python code.

summarize_ticket = bench.Skill(
    name="summarize_ticket",
    description="Summarize a support ticket.",
    instructions="Summarize the ticket in one short paragraph.",
)

With tools and an output schema:

summarize_ticket = bench.Skill(
    name="summarize_ticket",
    description="Summarize a support ticket.",
    instructions="Summarize the ticket in one short paragraph.",
    tools=[lookup_customer],
    output_schema={"summary": str},
)

Requirements:

name
description
instructions

Notes:

tools and output_schema are optional.
Skills are prompt-only definitions. They cannot be executed directly in the local CLI; they are composed into Agents and Orchestrations.

`Agent`

Use an Agent to define an LLM-backed component with instructions and references to other components.

import zhanla as bench

support_agent = bench.Agent(
    name="support_agent",
    description="Respond to support requests.",
    instructions="Answer clearly and use available tools when needed.",
    model="claude-sonnet-4-6",
    tools=[lookup_customer],
    skills=[summarize_ticket],
    output_schema={"answer": str},
)

Requirements:

name
description
instructions
model

Notes:

tools, skills, agents, and output_schema are optional.
Local CLI execution requires a configured runner on the component. Without a runner, the CLI raises an error.

`LLMProcessor`

Use an LLMProcessor when you want a prompt-defined LLM transformation step.

import zhanla as bench

intent_classifier = bench.LLMProcessor(
    name="intent_classifier",
    description="Classify the user's intent.",
    instructions="Return the intent as billing, technical, or other.",
    model="claude-haiku-4-5",
    output_schema={"intent": str},
)

Requirements:

name
description
instructions
model

Notes:

output_schema is optional.
Local CLI execution requires a configured runner on the component. Without a runner, the CLI raises an error.

`Orchestration`

Use an Orchestration to compose multiple steps into a DAG.

support_pipeline = bench.Orchestration(
    name="support_pipeline",
    description="Classify intent, then draft a reply.",
    steps=[
        bench.Step(component=intent_classifier, name="classify", next=["reply"]),
        bench.Step(component=support_agent, name="reply"),
    ],
)

Requirements:

name
description
steps

Notes:

bench.Step is an alias for bench.OrchestrationStep.
Step names must be unique.
next targets must point to existing steps.
Cycles are rejected by CLI validation.
During execution, each step receives the accumulated state dictionary.

`Conditional`

Use Conditional inside an orchestration to route between steps.

bench.Step(
    component=bench.Conditional(
        condition=lambda state: state["classify"]["intent"] == "billing",
        if_true="billing_reply",
        if_false="general_reply",
    ),
    name="route",
)

Conditional does not emit output. It only chooses the next step.

Eval Components

`CodeEval`

Use a CodeEval for Python-based scoring logic.

quality_eval = bench.CodeEval(
    name="quality_eval",
    description="Score whether the answer is acceptable.",
    input_schema={},
    fn=score_answer,
)

Requirements:

name
description
fn

Notes:

fn can be sync or async.
If the eval returns a non-dict value, runtime wraps it as {"score": value}.
model_response_format defaults to "JSON" and can also be set to "TEXT" or "YAML".

`LLMEval`

Use an LLMEval for prompt-defined scoring.

tone_eval = bench.LLMEval(
    name="tone_eval",
    description="Check response tone.",
    instructions="Return a score from 0.0 to 1.0 and a short reason.",
    model="your-model-id",
    output_schema={"score": float, "reason": str},
)

Requirements:

name
description
instructions
model

Notes:

output_schema is optional.
Local CLI execution is currently placeholder-based.

`Checklist`

Use a Checklist to combine multiple evals with optional weights.

answer_quality = bench.Checklist(
    name="answer_quality",
    description="Combine correctness and tone scores.",
    evals=[quality_eval, tone_eval],
    weights=[0.8, 0.2],
)

Notes:

If weights is omitted, each eval gets weight 1.0.
Weights must be positive and must match the number of evals.

`EvalTree`

Use an EvalTree for score-based branching.

adaptive_eval = bench.EvalTree(
    name="adaptive_eval",
    description="Route to different evals based on an initial score.",
    root=bench.Branch(
        eval=quality_eval,
        threshold=0.8,
        if_pass=[bench.Edge(weight=1.0, node=bench.Leaf(eval=quality_eval))],
        if_fail=[bench.Edge(weight=1.0, node=bench.Leaf(eval=tone_eval))],
    ),
)

Notes:

Branch thresholds must be between 0.0 and 1.0.
Edge weights must be positive.

Discovery And CLI Usage

The CLI discovers components by importing your Python file and scanning module-level attributes for bench component instances.

That means:

your file is executed at import time
module-level side effects will run during discovery
components should usually be defined at module scope
if a file contains multiple runnable components, use file.py:component_name
evals are referenced separately with --eval file.py:eval_name

Example:

bench run workflow.py:support_pipeline --dataset tickets.json --eval evals.py:answer_quality

Validation Rules

Before execution, the CLI validates component structure.

Tool must provide a callable fn and a non-None output_schema
CodeEval must provide a callable fn
Skill, Agent, LLMProcessor, and LLMEval must provide instructions
Agent, LLMProcessor, and LLMEval must provide model
Orchestration steps must reference valid targets and must not contain cycles
Checklist weights must match the eval count and all be positive
EvalTree branch thresholds must stay in [0.0, 1.0] and edge weights must be positive

Local Execution Caveats

The SDK defines the component model. The current local CLI runtime dispatches as follows:

Tool — executes fn
CodeEval — executes fn
Skill — raises an error; Skills are prompt-only and cannot be executed directly
Agent — requires a configured runner and model; calls the runner to generate a response
LLMProcessor — requires a configured runner and model; calls the runner to generate a response
LLMEval — requires a configured runner and model; calls the runner to score the response
Orchestration — executes its steps locally and returns the last step output

When a runner is set and a wrapped client is used, actual LLM calls are made and traced.

Version Hashing

Every component exposes version_hash() for stable content-based versioning.

tool.version_hash()
support_agent.version_hash()
answer_quality.version_hash()

Highlights:

descriptions do not affect the hash
Tool hashes function source and output_schema
Agent hashes instructions, model, referenced component names, and output_schema
Checklist hashes referenced eval names and weights
EvalTree hashes its tree structure

Parsing Model Output

Use bench.parse_json_response(text) to extract JSON from raw model text, including fenced code blocks.

import zhanla as bench

text = client.messages.create(...).content[0].text
result = bench.parse_json_response(text)

This handles responses wrapped in ```json fences as well as bare JSON strings.

LLM Call Observability

`bench.wrap(client)`

Wrap an LLM client so every call made through it is recorded against the current eval run.

import anthropic
import openai
import zhanla as bench

# Anthropic
client = bench.wrap(anthropic.Anthropic())

# OpenAI (also covers OpenRouter via base_url)
client = bench.wrap(openai.OpenAI())

The wrapped client is identical to the original. bench.wrap() only observes — it does not re-implement any LLM logic.

Supported clients:

Client	Import
`anthropic.Anthropic`	`pip install anthropic`
`anthropic.AsyncAnthropic`	`pip install anthropic`
`openai.OpenAI`	`pip install openai`
`openai.AsyncOpenAI`	`pip install openai`
`google.genai.Client`	`pip install google-genai`

When bench.wrap() is active and llm_function is called by the CLI, each LLM call captures:

provider, model
input messages, output, tool calls, raw response
input/output token counts
latency
stop reason

Trace context

The CLI sets a TraceContext before calling llm_function. The wrapped client reads the active context automatically. You do not need to manage the context directly.

If you need to access the trace context in your own code:

from zhanla.trace_store import get_trace_context

ctx = get_trace_context()  # None outside a CLI run
if ctx:
    print(ctx.trace_id)

Advanced Utilities

Most users only need the component classes above. The SDK also exposes a few lower-level helpers:

import zhanla as bench
from zhanla.registry import registry

bench.ComponentType enum for component categories
bench.EvalTrace for runtime trace records
bench.parse_json_response(text) for extracting JSON from model text responses
bench.get_all() and bench.clear() for the execution-local trace store
registry.register(...), registry.get(...), registry.get_by_name(...), registry.discover(), and registry.clear() for the global component registry

In normal CLI usage, you do not need to register components manually.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.2.3

May 16, 2026

0.1.2.2

May 14, 2026

This version

0.1.2.1

May 9, 2026

0.1.2

May 7, 2026

0.1.1

May 3, 2026

0.1.0

May 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zhanla_sdk_py-0.1.2.1.tar.gz (21.6 kB view details)

Uploaded May 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

zhanla_sdk_py-0.1.2.1-py3-none-any.whl (25.4 kB view details)

Uploaded May 9, 2026 Python 3

File details

Details for the file zhanla_sdk_py-0.1.2.1.tar.gz.

File metadata

Download URL: zhanla_sdk_py-0.1.2.1.tar.gz
Upload date: May 9, 2026
Size: 21.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for zhanla_sdk_py-0.1.2.1.tar.gz
Algorithm	Hash digest
SHA256	`d93a0db3c719964de8d926b704bda4e500b61b3f7fa558c6d89089c8a0a7a3d9`
MD5	`12f6685fe42f2565b0d0c09d4cef5f6e`
BLAKE2b-256	`d0dfd9a42cc01e36a84f408815d53ff8995526e9185e5112363efe193b86f278`

See more details on using hashes here.

File details

Details for the file zhanla_sdk_py-0.1.2.1-py3-none-any.whl.

File metadata

Download URL: zhanla_sdk_py-0.1.2.1-py3-none-any.whl
Upload date: May 9, 2026
Size: 25.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for zhanla_sdk_py-0.1.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`be829edcf522ab426e532df290eebc596215651ac9058664110485d6fc0d68c0`
MD5	`ebf420898d40597dc268774aa631f47e`
BLAKE2b-256	`383d693688082e96ff124eb6fe659e36d8be16bd1b27c9b12cd568bac5d301e8`

See more details on using hashes here.

zhanla-sdk-py 0.1.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

zhanla-sdk-py

Installation

Quick Start

Core Concepts

Runnable Components

Tool

Skill

Agent

LLMProcessor

Orchestration

Conditional

Eval Components

CodeEval

LLMEval

Checklist

EvalTree

Discovery And CLI Usage

Validation Rules

Local Execution Caveats

Version Hashing

Parsing Model Output

LLM Call Observability

bench.wrap(client)

Trace context

Advanced Utilities

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`Tool`

`Skill`

`Agent`

`LLMProcessor`

`Orchestration`

`Conditional`

`CodeEval`

`LLMEval`

`Checklist`

`EvalTree`

`bench.wrap(client)`