CLI from Zhanla for running and uploading AI component evaluations
Project description
zhanla
Command-line interface for discovering Benchmark SDK components, running them against datasets, and syncing results to the Benchmark web app.
The CLI can be installed on its own for login, listing web resources, and fully web-backed runs. Local component execution requires the matching language SDK.
Installation
pip install zhanla
Requires Python >=3.10.
Runtime dependencies:
typerrichsupabasehttpx
After installation, the zhanla command is available in your shell.
Optional local execution dependencies:
- Python components:
pip install zhanla-sdk-py - TypeScript components:
npm install @zhanla/sdk-ts
Authentication
zhanla login
Authenticate with your SDK API key:
bench login
The CLI prompts for a composite key in this format:
bm_kid_XXXX.bm_sec_XXXX
On success, the CLI exchanges that key for token metadata and stores credentials at ~/.benchmark/credentials.json.
Saved fields:
| Field | Description |
|---|---|
key_id |
The bm_kid_... portion of the API key |
secret |
The bm_sec_... portion of the API key |
supabase_url |
Supabase project URL |
supabase_anon_key |
Public anon key used to initialize the client |
org_id |
Organization ID |
zhanla logout
Remove saved credentials:
bench logout
Token Refresh
When a command needs authenticated access, the CLI exchanges the saved API key for a fresh short-lived token. If you are not logged in, authenticated commands exit with:
Not logged in. Run `zhanla login` first.
zhanla run
zhanla run supports three execution modes:
- Local component + local eval
- Local component + web autorater
- Fully web-backed run
The command enforces exactly one source for each required input:
- exactly one component source: local target or
--web-config - exactly one dataset source:
--datasetor--web-dataset - exactly one evaluation source:
--evalor--web-eval
Local component + local eval
Run a component discovered from a Python file and score it with a local eval component:
bench run components.py:my_tool --dataset data.json --eval evals.py:my_eval
Notes:
component.py:nametargets a runnable component defined as a module-levelzhanla.*instance--eval eval.py:nametargets an eval component such asCodeEval,LLMEval,Checklist, orEvalTree- If the file contains exactly one matching component,
:nameis optional
Local component + web autorater
Run a local component, upload the dataset and outputs, then start a managed web autorater run:
bench run components.py:my_tool --dataset data.json --web-eval autorater-uuid
This flow:
- executes the local component over your dataset
- uploads dataset rows and local outputs
- starts a remote evaluate-only run for the specified autorater
- polls until the remote run completes or fails
Eval contract note:
- the CLI sends
model_input,model_response, andexpected_outputto managed evals as serialized strings - before your eval function runs, the platform parses those strings according to the eval item's
model_response_format CodeEvaldefaults tomodel_response_format="JSON"— your function receives a pre-parsed dict/object- set
model_response_format="TEXT"to receive the raw string instead (useful for keyword or regex checks) LLMEvalalways receives plain text regardless ofmodel_response_format
model_response_format defaults
| Eval type | Default | Received by function |
|---|---|---|
CodeEval |
"JSON" |
Parsed dict/object |
CodeEval(model_response_format="TEXT") |
"TEXT" |
Raw string |
CodeEval(model_response_format="YAML") |
"YAML" |
Parsed dict/object |
LLMEval |
"TEXT" (fixed) |
Raw string |
Example — structured eval (default JSON):
bench.CodeEval(
name="check_priority",
description="Verify the priority field is correct.",
fn=lambda model_response, expected_output, **_: {
"score": 1.0 if model_response.get("priority") == expected_output.get("expected_priority") else 0.0
},
)
Example — text-based eval (explicit TEXT):
bench.CodeEval(
name="contains_priority_label",
description="Check that the response mentions a priority level.",
fn=lambda model_response, **_: {
"score": 1.0 if any(label in model_response.lower() for label in ["critical", "high", "medium", "low"]) else 0.0
},
model_response_format="TEXT",
)
`--dry-run` is not supported with `--web-eval`.
### Fully web-backed run
Start a run entirely against web-managed resources:
```bash
bench run --web-config prompt-uuid --web-dataset dataset-uuid --web-eval autorater-uuid
Optional model override:
bench run \
--web-config prompt-uuid \
--web-dataset dataset-uuid \
--web-eval autorater-uuid \
--model-endpoint openai:gpt-4.1-mini
Notes:
--web-configrequires both--web-datasetand--web-eval--model-endpointis only valid with--web-config--dry-runis not supported for fully web-backed runs
Options
| Flag | Short | Description |
|---|---|---|
--eval <spec> |
-e |
Local eval target in file.py:name form |
--dataset <path> |
-d |
Local dataset file (.json or .csv) |
--web-eval <id> |
Managed autorater ID | |
--web-dataset <id> |
Dataset ID from the web app | |
--web-config <id> |
Prompt ID for fully web-backed runs | |
--model-endpoint <value> |
Model endpoint override for fully web-backed runs | |
--dry-run |
Execute locally and skip sync/upload |
What zhanla run does
For local component runs, the CLI:
- Discovers the component from the target file.
- Prints a discovered-components table.
- Validates the component structure.
- Resolves a local eval or web autorater target.
- Checks auth when the chosen mode needs it.
- Loads the dataset from disk or from the web app.
- Executes each dataset row in sequence with a progress bar.
- Validates the first component output against
output_schemawhen present. - Uploads local definitions and run results unless
--dry-runis set. - Prints either a local score summary or a remote evaluation summary.
If sync/upload fails, the run exits with status 1.
Dataset Formats
JSON
JSON datasets should be a top-level array.
The preferred format is a leading schema row followed by ordinary data rows:
[
{
"_schema": {
"revenue": {"type": "integer"},
"cost": {"type": "integer"}
}
},
{"revenue": 100, "cost": 40},
{"revenue": 200, "cost": 120}
]
The first items may optionally be metadata rows:
[
{
"_schema": {
"revenue": {"type": "integer"},
"cost": {"type": "integer"}
}
},
{"_config": {"name": "finance", "description": "Margin checks"}},
{"revenue": 100, "cost": 40},
{"revenue": 200, "cost": 120}
]
Leading rows with _schema or _config are treated as metadata and are not executed.
Legacy object-shaped JSON datasets with schema and rows are still accepted.
CSV
CSV datasets use the header row as field names:
revenue,cost
100,40
200,120
CSV values are loaded as strings.
Empty datasets
If the resolved dataset has no rows, the CLI exits with:
Dataset is empty.
Local Execution Semantics
The CLI executes SDK components according to their current implementation:
Tool: runsfn(**row)and normalizes non-dict output to{"result": value}CodeEval: runsfn(**row_and_component_output)and normalizes non-dict output to{"score": value}Skill: raises an error — Skills are prompt-only definitions and cannot be executed directly in the CLIAgent: requires a configuredrunnerandmodel; calls the runner to generate a responseLLMProcessor: requires a configuredrunnerandmodel; calls the runner to generate a responseLLMEval: requires a configuredrunnerandmodel; calls the runner to score the responseChecklist: runs all child evals and computes a weighted averageEvalTree: routes through branches and computes weighted leaf scoresOrchestration: executes the DAG and returns the final executed step output
For local evals, dataset row fields and component output fields are merged into the eval kwargs.
Discovery
The CLI discovers local components by importing your Python file and scanning module-level attributes for zhanla component instances.
That means:
- your file is executed during discovery
- module-level side effects will run
- components should usually be defined at module scope
- local imports from the same directory are supported
Name resolution rules:
- if a file has exactly one runnable component,
zhanla run file.pyauto-selects it - if a file has multiple runnable components, use
file.py:component_name - if a file has exactly one eval,
--eval evals.pyauto-selects it - if a referenced name is ambiguous or missing, the CLI exits with an error
Validation
The CLI validates component structure before execution.
Toolmust provide a callablefnToolmust declare a non-Noneoutput_schemaCodeEvalmust provide a callablefnSkill,Agent,LLMProcessor, andLLMEvalmust defineinstructionsAgent,LLMProcessor, andLLMEvalmust definemodelOrchestrationstep targets must exist and the graph must be acyclicChecklistweights must match the number of evals and be positiveEvalTreebranch thresholds must be in[0.0, 1.0]EvalTreeedge weights must be positive
For local component runs with an output_schema, the CLI also validates the first produced output for:
- missing keys
- extra keys
- simple
isinstancetype mismatches
Schema mismatches stop the run immediately.
Output And Summaries
Local eval summary
For local evals, the CLI prints a score table and a mean score summary.
Without upload:
Mean score: 0.900
Dry run — nothing uploaded
With upload:
- prints
Results uploaded - prints
Autorater IDandAutorater Run IDwhen available - prints a result URL when available
Remote evaluation summary
For --web-eval and --web-config flows, the CLI prints a remote summary table including:
- run status
- overall score
- items completed vs total
- remote error, if any
zhanla list datasets
List datasets available in the web app:
bench list datasets
Optional filters:
bench list datasets --component-type tool
bench list datasets --component-id component-uuid
bench list datasets --name support
Supported --component-type values:
agentskilltoolorchestration
The command requires login and prints dataset IDs that can be used with --web-dataset.
zhanla list autoraters
List managed autoraters available in the web app:
bench list autoraters
Optional filters:
bench list autoraters --component-type tool
bench list autoraters --component-id component-uuid
bench list autoraters --name quality
This command also requires login and prints autorater IDs that can be used with --web-eval.
Environment Variables
| Variable | Default | Description |
|---|---|---|
BENCH_BASE_URL |
https://benchmark-black.vercel.app |
Base URL for auth, evaluation APIs, and dashboard links |
File Locations
| Path | Purpose |
|---|---|
~/.benchmark/credentials.json |
Saved SDK credentials and org metadata |
Upload Behavior
For authenticated local runs, the CLI syncs component definitions and run results to Supabase using the short-lived access token obtained from your API key.
At a high level it:
- creates an authenticated Supabase client
- syncs or updates component definitions using content-based version hashes
- creates or reuses datasets and dataset items for local datasets
- links datasets back to the component definition
- writes component and eval run records
For local component + web autorater runs, the CLI additionally starts a remote evaluate-only run after upload.
Structured component outputs are still preserved in traces and uploaded run data. Only the eval-function boundary is normalized to plain text.
Examples
Run a local tool against a local eval and skip upload:
bench run component.py:my_tool --dataset data.json --eval eval.py:my_eval --dry-run
Run a local component and send outputs to a managed autorater:
bench run component.py:my_tool --dataset data.json --web-eval 123e4567-e89b-12d3-a456-426614174000
Run fully against web-managed prompt, dataset, and autorater resources:
bench run --web-config prompt-123 --web-dataset dataset-123 --web-eval autorater-123
List available datasets:
bench list datasets
List available autoraters:
bench list autoraters
Running Tests
python3 -m pytest packages/cli -v
The test suite covers:
- auth credential persistence
- component and eval discovery
- component validation
- dataset loading and schema validation
- local execution and eval execution
- run command behavior for dry runs, uploads, and web-backed flows
- writer upload behavior
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file zhanla-0.1.2.2.tar.gz.
File metadata
- Download URL: zhanla-0.1.2.2.tar.gz
- Upload date:
- Size: 80.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7208ffa32ffcd7bbea5ecc03d8bb061c771a53eefea358c48d61e40e72e23c86
|
|
| MD5 |
e9dc6fc6bd4a559b56f25133fec4fccc
|
|
| BLAKE2b-256 |
79ca19802193851ba1e8a6f2a1a50d486f8bccce73de34a53182d0f9b269e416
|
File details
Details for the file zhanla-0.1.2.2-py3-none-any.whl.
File metadata
- Download URL: zhanla-0.1.2.2-py3-none-any.whl
- Upload date:
- Size: 57.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cb7e46bcb07ba6e0595e84a3a1ffea7617f1bd944cd992de4c9eed3aabc020f8
|
|
| MD5 |
184e20cd6c581511d10df856de88c6b8
|
|
| BLAKE2b-256 |
796e9baf0c756f722857cd7e8407311f2cc8f58f48dea025b4b3c0a41c4aa0ae
|