Benchmark SDK for instrumenting AI components
Project description
zhanla-sdk-py
zhanla-sdk-py is the Python SDK for defining Benchmark components in code.
You use it to declare tools, skills, agents, orchestrations, and evals as Python objects, then run them with zhanla.
Installation
Install the SDK:
pip install zhanla-sdk-py
Requires Python >=3.10.
The SDK itself has no runtime dependencies. Provider packages (anthropic, openai, google-genai) are optional and only required if you use bench.wrap().
If you want to execute components from the command line, install the CLI too:
pip install zhanla
Quick Start
Create a Python file with module-level component instances:
import anthropic
import zhanla
client = bench.wrap(anthropic.Anthropic())
def _classify(message: str, customer_tier: str = "standard", **_) -> dict:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=64,
system='Reply with JSON: {"priority": "high|normal|low"}.',
messages=[{"role": "user", "content": message}],
)
import json
result = json.loads(response.content[0].text)
result["customer_tier"] = customer_tier
return result
priority_tool = bench.Tool(
name="priority_tool",
description="Classify the priority of a support message.",
input_schema={},
fn=_classify,
input_schema={"message": str, "customer_tier": str},
output_schema={
"priority": str,
"customer_tier": str,
},
)
def priority_eval(model_response, expected_output, **_) -> dict:
return {
"score": 1.0 if model_response["priority"] == expected_output["priority"] else 0.0
}
priority_eval_component = bench.CodeEval(
name="priority_eval",
description="Check whether the predicted priority matches the expected value.",
input_schema={},
fn=priority_eval,
)
Run it with the CLI:
bench run components.py:priority_tool --dataset tickets.json --eval components.py:priority_eval
If a file contains exactly one runnable component, :component_name is optional.
Core Concepts
The SDK is class-based. The public import is:
import zhanla as bench
Define components as module-level objects so the CLI can discover them when it imports your file.
Runnable Components
Tool
Use a Tool for deterministic Python logic.
lookup_customer = bench.Tool(
name="lookup_customer",
description="Fetch a customer record by ID.",
input_schema={},
fn=get_customer,
input_schema={"customer_id": str},
output_schema={"id": str, "email": str},
)
Requirements:
namedescriptionfninput_schemaoutput_schema
Notes:
fncan be sync or async.input_schemacan be a simple dict like{"field": str}, a JSON-Schema-shaped dict, or a Pydantic model class.- If
fnreturns a non-dict value at runtime, the CLI wraps it as{"result": value}. - The CLI validates the first produced output against
output_schema. output_schemacan be a simple dict like{"field": str}or a Pydantic model class.
Skill
Use a Skill for reusable instructions, optionally backed by Python code.
summarize_ticket = bench.Skill(
name="summarize_ticket",
description="Summarize a support ticket.",
instructions="Summarize the ticket in one short paragraph.",
)
With tools and an output schema:
summarize_ticket = bench.Skill(
name="summarize_ticket",
description="Summarize a support ticket.",
instructions="Summarize the ticket in one short paragraph.",
tools=[lookup_customer],
output_schema={"summary": str},
)
Requirements:
namedescriptioninstructions
Notes:
toolsandoutput_schemaare optional.- Skills are prompt-only definitions. They cannot be executed directly in the local CLI; they are composed into Agents and Orchestrations.
Agent
Use an Agent to define an LLM-backed component with instructions and references to other components.
import zhanla as bench
support_agent = bench.Agent(
name="support_agent",
description="Respond to support requests.",
instructions="Answer clearly and use available tools when needed.",
model="claude-sonnet-4-6",
tools=[lookup_customer],
skills=[summarize_ticket],
output_schema={"answer": str},
)
Requirements:
namedescriptioninstructionsmodel
Notes:
tools,skills,agents, andoutput_schemaare optional.- Local CLI execution requires a configured
runneron the component. Without a runner, the CLI raises an error.
LLMProcessor
Use an LLMProcessor when you want a prompt-defined LLM transformation step.
import zhanla as bench
intent_classifier = bench.LLMProcessor(
name="intent_classifier",
description="Classify the user's intent.",
instructions="Return the intent as billing, technical, or other.",
model="claude-haiku-4-5",
output_schema={"intent": str},
)
Requirements:
namedescriptioninstructionsmodel
Notes:
output_schemais optional.- Local CLI execution requires a configured
runneron the component. Without a runner, the CLI raises an error.
Orchestration
Use an Orchestration to compose multiple steps into a DAG.
support_pipeline = bench.Orchestration(
name="support_pipeline",
description="Classify intent, then draft a reply.",
steps=[
bench.Step(component=intent_classifier, name="classify", next=["reply"]),
bench.Step(component=support_agent, name="reply"),
],
)
Requirements:
namedescriptionsteps
Notes:
bench.Stepis an alias forbench.OrchestrationStep.- Step names must be unique.
nexttargets must point to existing steps.- Cycles are rejected by CLI validation.
- During execution, each step receives the accumulated state dictionary.
Conditional
Use Conditional inside an orchestration to route between steps.
bench.Step(
component=bench.Conditional(
condition=lambda state: state["classify"]["intent"] == "billing",
if_true="billing_reply",
if_false="general_reply",
),
name="route",
)
Conditional does not emit output. It only chooses the next step.
Eval Components
CodeEval
Use a CodeEval for Python-based scoring logic.
quality_eval = bench.CodeEval(
name="quality_eval",
description="Score whether the answer is acceptable.",
input_schema={},
fn=score_answer,
)
Requirements:
namedescriptionfn
Notes:
fncan be sync or async.- If the eval returns a non-dict value, runtime wraps it as
{"score": value}. model_response_formatdefaults to"JSON"and can also be set to"TEXT"or"YAML".
LLMEval
Use an LLMEval for prompt-defined scoring.
tone_eval = bench.LLMEval(
name="tone_eval",
description="Check response tone.",
instructions="Return a score from 0.0 to 1.0 and a short reason.",
model="your-model-id",
output_schema={"score": float, "reason": str},
)
Requirements:
namedescriptioninstructionsmodel
Notes:
output_schemais optional.- Local CLI execution is currently placeholder-based.
Checklist
Use a Checklist to combine multiple evals with optional weights.
answer_quality = bench.Checklist(
name="answer_quality",
description="Combine correctness and tone scores.",
evals=[quality_eval, tone_eval],
weights=[0.8, 0.2],
)
Notes:
- If
weightsis omitted, each eval gets weight1.0. - Weights must be positive and must match the number of evals.
EvalTree
Use an EvalTree for score-based branching.
adaptive_eval = bench.EvalTree(
name="adaptive_eval",
description="Route to different evals based on an initial score.",
root=bench.Branch(
eval=quality_eval,
threshold=0.8,
if_pass=[bench.Edge(weight=1.0, node=bench.Leaf(eval=quality_eval))],
if_fail=[bench.Edge(weight=1.0, node=bench.Leaf(eval=tone_eval))],
),
)
Notes:
- Branch thresholds must be between
0.0and1.0. - Edge weights must be positive.
Discovery And CLI Usage
The CLI discovers components by importing your Python file and scanning module-level attributes for bench component instances.
That means:
- your file is executed at import time
- module-level side effects will run during discovery
- components should usually be defined at module scope
- if a file contains multiple runnable components, use
file.py:component_name - evals are referenced separately with
--eval file.py:eval_name
Example:
bench run workflow.py:support_pipeline --dataset tickets.json --eval evals.py:answer_quality
Validation Rules
Before execution, the CLI validates component structure.
Toolmust provide a callablefnand a non-Noneoutput_schemaCodeEvalmust provide a callablefnSkill,Agent,LLMProcessor, andLLMEvalmust provideinstructionsAgent,LLMProcessor, andLLMEvalmust providemodelOrchestrationsteps must reference valid targets and must not contain cyclesChecklistweights must match the eval count and all be positiveEvalTreebranch thresholds must stay in[0.0, 1.0]and edge weights must be positive
Local Execution Caveats
The SDK defines the component model. The current local CLI runtime dispatches as follows:
Tool— executesfnCodeEval— executesfnSkill— raises an error; Skills are prompt-only and cannot be executed directlyAgent— requires a configuredrunnerandmodel; calls the runner to generate a responseLLMProcessor— requires a configuredrunnerandmodel; calls the runner to generate a responseLLMEval— requires a configuredrunnerandmodel; calls the runner to score the responseOrchestration— executes its steps locally and returns the last step output
When a runner is set and a wrapped client is used, actual LLM calls are made and traced.
Version Hashing
Every component exposes version_hash() for stable content-based versioning.
tool.version_hash()
support_agent.version_hash()
answer_quality.version_hash()
Highlights:
- descriptions do not affect the hash
Toolhashes function source andoutput_schemaAgenthashes instructions, model, referenced component names, andoutput_schemaChecklisthashes referenced eval names and weightsEvalTreehashes its tree structure
Parsing Model Output
Use bench.parse_json_response(text) to extract JSON from raw model text, including fenced code blocks.
import zhanla as bench
text = client.messages.create(...).content[0].text
result = bench.parse_json_response(text)
This handles responses wrapped in ```json fences as well as bare JSON strings.
LLM Call Observability
bench.wrap(client)
Wrap an LLM client so every call made through it is recorded against the current eval run.
import anthropic
import openai
import zhanla as bench
# Anthropic
client = bench.wrap(anthropic.Anthropic())
# OpenAI (also covers OpenRouter via base_url)
client = bench.wrap(openai.OpenAI())
The wrapped client is identical to the original. bench.wrap() only observes — it does not re-implement any LLM logic.
Supported clients:
| Client | Import |
|---|---|
anthropic.Anthropic |
pip install anthropic |
anthropic.AsyncAnthropic |
pip install anthropic |
openai.OpenAI |
pip install openai |
openai.AsyncOpenAI |
pip install openai |
google.genai.Client |
pip install google-genai |
When bench.wrap() is active and llm_function is called by the CLI, each LLM call captures:
- provider, model
- input messages, output, tool calls, raw response
- input/output token counts
- latency
- stop reason
Trace context
The CLI sets a TraceContext before calling llm_function. The wrapped client reads the active context automatically. You do not need to manage the context directly.
If you need to access the trace context in your own code:
from zhanla.trace_store import get_trace_context
ctx = get_trace_context() # None outside a CLI run
if ctx:
print(ctx.trace_id)
Advanced Utilities
Most users only need the component classes above. The SDK also exposes a few lower-level helpers:
import zhanla as bench
from zhanla.registry import registry
bench.ComponentTypeenum for component categoriesbench.EvalTracefor runtime trace recordsbench.parse_json_response(text)for extracting JSON from model text responsesbench.get_all()andbench.clear()for the execution-local trace storeregistry.register(...),registry.get(...),registry.get_by_name(...),registry.discover(), andregistry.clear()for the global component registry
In normal CLI usage, you do not need to register components manually.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file zhanla_sdk_py-0.1.2.1.tar.gz.
File metadata
- Download URL: zhanla_sdk_py-0.1.2.1.tar.gz
- Upload date:
- Size: 21.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d93a0db3c719964de8d926b704bda4e500b61b3f7fa558c6d89089c8a0a7a3d9
|
|
| MD5 |
12f6685fe42f2565b0d0c09d4cef5f6e
|
|
| BLAKE2b-256 |
d0dfd9a42cc01e36a84f408815d53ff8995526e9185e5112363efe193b86f278
|
File details
Details for the file zhanla_sdk_py-0.1.2.1-py3-none-any.whl.
File metadata
- Download URL: zhanla_sdk_py-0.1.2.1-py3-none-any.whl
- Upload date:
- Size: 25.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
be829edcf522ab426e532df290eebc596215651ac9058664110485d6fc0d68c0
|
|
| MD5 |
ebf420898d40597dc268774aa631f47e
|
|
| BLAKE2b-256 |
383d693688082e96ff124eb6fe659e36d8be16bd1b27c9b12cd568bac5d301e8
|