Skip to main content

Benchmark SDK for instrumenting AI components

Project description

zhanla-sdk-py

zhanla-sdk-py is the Python SDK for defining Benchmark components in code.

You use it to declare tools, skills, agents, orchestrations, and evals as Python objects, then run them with zhanla.

Installation

Install the SDK:

pip install zhanla-sdk-py

Requires Python >=3.10.

The SDK itself has no runtime dependencies. Provider packages (anthropic, openai, google-genai) are optional and only required if you use bench.wrap().

If you want to execute components from the command line, install the CLI too:

pip install zhanla

Quick Start

Create a Python file with module-level component instances:

import anthropic
import zhanla

client = bench.wrap(anthropic.Anthropic())


def _classify(message: str, customer_tier: str = "standard", **_) -> dict:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=64,
        system='Reply with JSON: {"priority": "high|normal|low"}.',
        messages=[{"role": "user", "content": message}],
    )
    import json
    result = json.loads(response.content[0].text)
    result["customer_tier"] = customer_tier
    return result


priority_tool = bench.Tool(
    name="priority_tool",
    description="Classify the priority of a support message.",
    input_schema={},
    fn=_classify,
    input_schema={"message": str, "customer_tier": str},
    output_schema={
        "priority": str,
        "customer_tier": str,
    },
)


def priority_eval(model_response, expected_output, **_) -> dict:
    return {
        "score": 1.0 if model_response["priority"] == expected_output["priority"] else 0.0
    }


priority_eval_component = bench.CodeEval(
    name="priority_eval",
    description="Check whether the predicted priority matches the expected value.",
    input_schema={},
    fn=priority_eval,
)

Run it with the CLI:

bench run components.py:priority_tool --dataset tickets.json --eval components.py:priority_eval

If a file contains exactly one runnable component, :component_name is optional.

Core Concepts

The SDK is class-based. The public import is:

import zhanla as bench

Define components as module-level objects so the CLI can discover them when it imports your file.

Runnable Components

Tool

Use a Tool for deterministic Python logic.

lookup_customer = bench.Tool(
    name="lookup_customer",
    description="Fetch a customer record by ID.",
    input_schema={},
    fn=get_customer,
    input_schema={"customer_id": str},
    output_schema={"id": str, "email": str},
)

Requirements:

  • name
  • description
  • fn
  • input_schema
  • output_schema

Notes:

  • fn can be sync or async.
  • input_schema can be a simple dict like {"field": str}, a JSON-Schema-shaped dict, or a Pydantic model class.
  • If fn returns a non-dict value at runtime, the CLI wraps it as {"result": value}.
  • The CLI validates the first produced output against output_schema.
  • output_schema can be a simple dict like {"field": str} or a Pydantic model class.

Skill

Use a Skill for reusable instructions, optionally backed by Python code.

summarize_ticket = bench.Skill(
    name="summarize_ticket",
    description="Summarize a support ticket.",
    instructions="Summarize the ticket in one short paragraph.",
)

With tools and an output schema:

summarize_ticket = bench.Skill(
    name="summarize_ticket",
    description="Summarize a support ticket.",
    instructions="Summarize the ticket in one short paragraph.",
    tools=[lookup_customer],
    output_schema={"summary": str},
)

Requirements:

  • name
  • description
  • instructions

Notes:

  • tools and output_schema are optional.
  • Skills are prompt-only definitions. They cannot be executed directly in the local CLI; they are composed into Agents and Orchestrations.

Agent

Use an Agent to define an LLM-backed component with instructions and references to other components.

import zhanla as bench

support_agent = bench.Agent(
    name="support_agent",
    description="Respond to support requests.",
    instructions="Answer clearly and use available tools when needed.",
    model="claude-sonnet-4-6",
    tools=[lookup_customer],
    skills=[summarize_ticket],
    output_schema={"answer": str},
)

Requirements:

  • name
  • description
  • instructions
  • model

Notes:

  • tools, skills, agents, and output_schema are optional.
  • Local CLI execution requires a configured runner on the component. Without a runner, the CLI raises an error.

LLMProcessor

Use an LLMProcessor when you want a prompt-defined LLM transformation step.

import zhanla as bench

intent_classifier = bench.LLMProcessor(
    name="intent_classifier",
    description="Classify the user's intent.",
    instructions="Return the intent as billing, technical, or other.",
    model="claude-haiku-4-5",
    output_schema={"intent": str},
)

Requirements:

  • name
  • description
  • instructions
  • model

Notes:

  • output_schema is optional.
  • Local CLI execution requires a configured runner on the component. Without a runner, the CLI raises an error.

Orchestration

Use an Orchestration to compose multiple steps into a DAG.

support_pipeline = bench.Orchestration(
    name="support_pipeline",
    description="Classify intent, then draft a reply.",
    steps=[
        bench.Step(component=intent_classifier, name="classify", next=["reply"]),
        bench.Step(component=support_agent, name="reply"),
    ],
)

Requirements:

  • name
  • description
  • steps

Notes:

  • bench.Step is an alias for bench.OrchestrationStep.
  • Step names must be unique.
  • next targets must point to existing steps.
  • Cycles are rejected by CLI validation.
  • During execution, each step receives the accumulated state dictionary.

Conditional

Use Conditional inside an orchestration to route between steps.

bench.Step(
    component=bench.Conditional(
        condition=lambda state: state["classify"]["intent"] == "billing",
        if_true="billing_reply",
        if_false="general_reply",
    ),
    name="route",
)

Conditional does not emit output. It only chooses the next step.

Eval Components

CodeEval

Use a CodeEval for Python-based scoring logic.

quality_eval = bench.CodeEval(
    name="quality_eval",
    description="Score whether the answer is acceptable.",
    input_schema={},
    fn=score_answer,
)

Requirements:

  • name
  • description
  • fn

Notes:

  • fn can be sync or async.
  • If the eval returns a non-dict value, runtime wraps it as {"score": value}.
  • model_response_format defaults to "JSON" and can also be set to "TEXT" or "YAML".

LLMEval

Use an LLMEval for prompt-defined scoring.

tone_eval = bench.LLMEval(
    name="tone_eval",
    description="Check response tone.",
    instructions="Return a score from 0.0 to 1.0 and a short reason.",
    model="your-model-id",
    output_schema={"score": float, "reason": str},
)

Requirements:

  • name
  • description
  • instructions
  • model

Notes:

  • output_schema is optional.
  • Local CLI execution is currently placeholder-based.

Checklist

Use a Checklist to combine multiple evals with optional weights.

answer_quality = bench.Checklist(
    name="answer_quality",
    description="Combine correctness and tone scores.",
    evals=[quality_eval, tone_eval],
    weights=[0.8, 0.2],
)

Notes:

  • If weights is omitted, each eval gets weight 1.0.
  • Weights must be positive and must match the number of evals.

EvalTree

Use an EvalTree for score-based branching.

adaptive_eval = bench.EvalTree(
    name="adaptive_eval",
    description="Route to different evals based on an initial score.",
    root=bench.Branch(
        eval=quality_eval,
        threshold=0.8,
        if_pass=[bench.Edge(weight=1.0, node=bench.Leaf(eval=quality_eval))],
        if_fail=[bench.Edge(weight=1.0, node=bench.Leaf(eval=tone_eval))],
    ),
)

Notes:

  • Branch thresholds must be between 0.0 and 1.0.
  • Edge weights must be positive.

Discovery And CLI Usage

The CLI discovers components by importing your Python file and scanning module-level attributes for bench component instances.

That means:

  • your file is executed at import time
  • module-level side effects will run during discovery
  • components should usually be defined at module scope
  • if a file contains multiple runnable components, use file.py:component_name
  • evals are referenced separately with --eval file.py:eval_name

Example:

bench run workflow.py:support_pipeline --dataset tickets.json --eval evals.py:answer_quality

Validation Rules

Before execution, the CLI validates component structure.

  • Tool must provide a callable fn and a non-None output_schema
  • CodeEval must provide a callable fn
  • Skill, Agent, LLMProcessor, and LLMEval must provide instructions
  • Agent, LLMProcessor, and LLMEval must provide model
  • Orchestration steps must reference valid targets and must not contain cycles
  • Checklist weights must match the eval count and all be positive
  • EvalTree branch thresholds must stay in [0.0, 1.0] and edge weights must be positive

Local Execution Caveats

The SDK defines the component model. The current local CLI runtime dispatches as follows:

  • Tool — executes fn
  • CodeEval — executes fn
  • Skill — raises an error; Skills are prompt-only and cannot be executed directly
  • Agent — requires a configured runner and model; calls the runner to generate a response
  • LLMProcessor — requires a configured runner and model; calls the runner to generate a response
  • LLMEval — requires a configured runner and model; calls the runner to score the response
  • Orchestration — executes its steps locally and returns the last step output

When a runner is set and a wrapped client is used, actual LLM calls are made and traced.

Version Hashing

Every component exposes version_hash() for stable content-based versioning.

tool.version_hash()
support_agent.version_hash()
answer_quality.version_hash()

Highlights:

  • descriptions do not affect the hash
  • Tool hashes function source and output_schema
  • Agent hashes instructions, model, referenced component names, and output_schema
  • Checklist hashes referenced eval names and weights
  • EvalTree hashes its tree structure

Parsing Model Output

Use bench.parse_json_response(text) to extract JSON from raw model text, including fenced code blocks.

import zhanla as bench

text = client.messages.create(...).content[0].text
result = bench.parse_json_response(text)

This handles responses wrapped in ```json fences as well as bare JSON strings.

LLM Call Observability

bench.wrap(client)

Wrap an LLM client so every call made through it is recorded against the current eval run.

import anthropic
import openai
import zhanla as bench

# Anthropic
client = bench.wrap(anthropic.Anthropic())

# OpenAI (also covers OpenRouter via base_url)
client = bench.wrap(openai.OpenAI())

The wrapped client is identical to the original. bench.wrap() only observes — it does not re-implement any LLM logic.

Supported clients:

Client Import
anthropic.Anthropic pip install anthropic
anthropic.AsyncAnthropic pip install anthropic
openai.OpenAI pip install openai
openai.AsyncOpenAI pip install openai
google.genai.Client pip install google-genai

When bench.wrap() is active and llm_function is called by the CLI, each LLM call captures:

  • provider, model
  • input messages, output, tool calls, raw response
  • input/output token counts
  • latency
  • stop reason

Trace context

The CLI sets a TraceContext before calling llm_function. The wrapped client reads the active context automatically. You do not need to manage the context directly.

If you need to access the trace context in your own code:

from zhanla.trace_store import get_trace_context

ctx = get_trace_context()  # None outside a CLI run
if ctx:
    print(ctx.trace_id)

Advanced Utilities

Most users only need the component classes above. The SDK also exposes a few lower-level helpers:

import zhanla as bench
from zhanla.registry import registry
  • bench.ComponentType enum for component categories
  • bench.EvalTrace for runtime trace records
  • bench.parse_json_response(text) for extracting JSON from model text responses
  • bench.get_all() and bench.clear() for the execution-local trace store
  • registry.register(...), registry.get(...), registry.get_by_name(...), registry.discover(), and registry.clear() for the global component registry

In normal CLI usage, you do not need to register components manually.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zhanla_sdk_py-0.1.0.tar.gz (28.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

zhanla_sdk_py-0.1.0-py3-none-any.whl (24.7 kB view details)

Uploaded Python 3

File details

Details for the file zhanla_sdk_py-0.1.0.tar.gz.

File metadata

  • Download URL: zhanla_sdk_py-0.1.0.tar.gz
  • Upload date:
  • Size: 28.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for zhanla_sdk_py-0.1.0.tar.gz
Algorithm Hash digest
SHA256 acefcdf02e1e1f3035cf4620a8111933c44d395cb130a17bad0ae3c8c066cbdb
MD5 b3b813a0d92cf666ab1da69a3e2692b2
BLAKE2b-256 49f8d58da96ec1bd0fe496da73ce16f337b72cabe5e09a9ee7fbacd48b06a087

See more details on using hashes here.

File details

Details for the file zhanla_sdk_py-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: zhanla_sdk_py-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 24.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for zhanla_sdk_py-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2e5dcadfc883b5ef65bca97442c28bf244656336031d9958ca10fec355cd59cd
MD5 f6098a67c9ce72d903a3a445253c88c4
BLAKE2b-256 286e83dbf1dd56b32730ba9f8281e70e27e286bd0a3b3a0574a6108d986b105b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page