CLI from Zhanla for running and uploading AI component evaluations

Project description

zhanla

Command-line interface for discovering Benchmark SDK components, running them against datasets, and syncing results to the Benchmark web app.

The CLI can be installed on its own for login, listing web resources, and fully web-backed runs. Local component execution requires the matching language SDK.

Installation

pip install zhanla

Requires Python >=3.10.

Runtime dependencies:

typer
rich
supabase
httpx

After installation, the zhanla command is available in your shell.

Optional local execution dependencies:

Python components: pip install zhanla-sdk-py
TypeScript components: npm install @zhanla/sdk-ts

Authentication

`zhanla login`

Authenticate with your SDK API key:

bench login

The CLI prompts for a composite key in this format:

bm_kid_XXXX.bm_sec_XXXX

On success, the CLI exchanges that key for token metadata and stores credentials at ~/.benchmark/credentials.json.

Saved fields:

Field	Description
`key_id`	The `bm_kid_...` portion of the API key
`secret`	The `bm_sec_...` portion of the API key
`supabase_url`	Supabase project URL
`supabase_anon_key`	Public anon key used to initialize the client
`org_id`	Organization ID

`zhanla logout`

Remove saved credentials:

bench logout

Token Refresh

When a command needs authenticated access, the CLI exchanges the saved API key for a fresh short-lived token. If you are not logged in, authenticated commands exit with:

Not logged in. Run `zhanla login` first.

`zhanla run`

zhanla run supports three execution modes:

Local component + local eval
Local component + web autorater
Fully web-backed run

The command enforces exactly one source for each required input:

exactly one component source: local target or --web-config
exactly one dataset source: --dataset or --web-dataset
exactly one evaluation source: --eval or --web-eval

Local component + local eval

Run a component discovered from a Python file and score it with a local eval component:

bench run components.py:my_tool --dataset data.json --eval evals.py:my_eval

Notes:

component.py:name targets a runnable component defined as a module-level zhanla.* instance
--eval eval.py:name targets an eval component such as CodeEval, LLMEval, Checklist, or EvalTree
If the file contains exactly one matching component, :name is optional

Local component + web autorater

Run a local component, upload the dataset and outputs, then start a managed web autorater run:

bench run components.py:my_tool --dataset data.json --web-eval autorater-uuid

This flow:

executes the local component over your dataset
uploads dataset rows and local outputs
starts a remote evaluate-only run for the specified autorater
polls until the remote run completes or fails

Eval contract note:

the CLI sends model_input, model_response, and expected_output to managed evals as serialized strings
before your eval function runs, the platform parses those strings according to the eval item's model_response_format
CodeEval defaults to model_response_format="JSON" — your function receives a pre-parsed dict/object
set model_response_format="TEXT" to receive the raw string instead (useful for keyword or regex checks)
LLMEval always receives plain text regardless of model_response_format

`model_response_format` defaults

Eval type	Default	Received by function
`CodeEval`	`"JSON"`	Parsed dict/object
`CodeEval(model_response_format="TEXT")`	`"TEXT"`	Raw string
`CodeEval(model_response_format="YAML")`	`"YAML"`	Parsed dict/object
`LLMEval`	`"TEXT"` (fixed)	Raw string

Example — structured eval (default JSON):

bench.CodeEval(
    name="check_priority",
    description="Verify the priority field is correct.",
    fn=lambda model_response, expected_output, **_: {
        "score": 1.0 if model_response.get("priority") == expected_output.get("expected_priority") else 0.0
    },
)

Example — text-based eval (explicit TEXT):

bench.CodeEval(
    name="contains_priority_label",
    description="Check that the response mentions a priority level.",
    fn=lambda model_response, **_: {
        "score": 1.0 if any(label in model_response.lower() for label in ["critical", "high", "medium", "low"]) else 0.0
    },
    model_response_format="TEXT",
)

`--dry-run` is not supported with `--web-eval`.

### Fully web-backed run

Start a run entirely against web-managed resources:

```bash
bench run --web-config prompt-uuid --web-dataset dataset-uuid --web-eval autorater-uuid

Optional model override:

bench run \
  --web-config prompt-uuid \
  --web-dataset dataset-uuid \
  --web-eval autorater-uuid \
  --model-endpoint openai:gpt-4.1-mini

Notes:

--web-config requires both --web-dataset and --web-eval
--model-endpoint is only valid with --web-config
--dry-run is not supported for fully web-backed runs

Options

Flag	Short	Description
`--eval <spec>`	`-e`	Local eval target in `file.py:name` form
`--dataset <path>`	`-d`	Local dataset file (`.json` or `.csv`)
`--web-eval <id>`		Managed autorater ID
`--web-dataset <id>`		Dataset ID from the web app
`--web-config <id>`		Prompt ID for fully web-backed runs
`--model-endpoint <value>`		Model endpoint override for fully web-backed runs
`--dry-run`		Execute locally and skip sync/upload

What `zhanla run` does

For local component runs, the CLI:

Discovers the component from the target file.
Prints a discovered-components table.
Validates the component structure.
Resolves a local eval or web autorater target.
Checks auth when the chosen mode needs it.
Loads the dataset from disk or from the web app.
Executes each dataset row in sequence with a progress bar.
Validates the first component output against output_schema when present.
Uploads local definitions and run results unless --dry-run is set.
Prints either a local score summary or a remote evaluation summary.

If sync/upload fails, the run exits with status 1.

Dataset Formats

JSON

JSON datasets should be a top-level array.

The preferred format is a leading schema row followed by ordinary data rows:

[
  {
    "_schema": {
      "revenue": {"type": "integer"},
      "cost": {"type": "integer"}
    }
  },
  {"revenue": 100, "cost": 40},
  {"revenue": 200, "cost": 120}
]

The first items may optionally be metadata rows:

[
  {
    "_schema": {
      "revenue": {"type": "integer"},
      "cost": {"type": "integer"}
    }
  },
  {"_config": {"name": "finance", "description": "Margin checks"}},
  {"revenue": 100, "cost": 40},
  {"revenue": 200, "cost": 120}
]

Leading rows with _schema or _config are treated as metadata and are not executed. Legacy object-shaped JSON datasets with schema and rows are still accepted.

CSV

CSV datasets use the header row as field names:

revenue,cost
100,40
200,120

CSV values are loaded as strings.

Empty datasets

If the resolved dataset has no rows, the CLI exits with:

Dataset is empty.

Local Execution Semantics

The CLI executes SDK components according to their current implementation:

Tool: runs fn(**row) and normalizes non-dict output to {"result": value}
CodeEval: runs fn(**row_and_component_output) and normalizes non-dict output to {"score": value}
Skill: raises an error — Skills are prompt-only definitions and cannot be executed directly in the CLI
Agent: requires a configured runner and model; calls the runner to generate a response
LLMProcessor: requires a configured runner and model; calls the runner to generate a response
LLMEval: requires a configured runner and model; calls the runner to score the response
Checklist: runs all child evals and computes a weighted average
EvalTree: routes through branches and computes weighted leaf scores
Orchestration: executes the DAG and returns the final executed step output

For local evals, dataset row fields and component output fields are merged into the eval kwargs.

Discovery

The CLI discovers local components by importing your Python file and scanning module-level attributes for zhanla component instances.

That means:

your file is executed during discovery
module-level side effects will run
components should usually be defined at module scope
local imports from the same directory are supported

Name resolution rules:

if a file has exactly one runnable component, zhanla run file.py auto-selects it
if a file has multiple runnable components, use file.py:component_name
if a file has exactly one eval, --eval evals.py auto-selects it
if a referenced name is ambiguous or missing, the CLI exits with an error

Validation

The CLI validates component structure before execution.

Tool must provide a callable fn
Tool must declare a non-None output_schema
CodeEval must provide a callable fn
Skill, Agent, LLMProcessor, and LLMEval must define instructions
Agent, LLMProcessor, and LLMEval must define model
Orchestration step targets must exist and the graph must be acyclic
Checklist weights must match the number of evals and be positive
EvalTree branch thresholds must be in [0.0, 1.0]
EvalTree edge weights must be positive

For local component runs with an output_schema, the CLI also validates the first produced output for:

missing keys
extra keys
simple isinstance type mismatches

Schema mismatches stop the run immediately.

Output And Summaries

Local eval summary

For local evals, the CLI prints a score table and a mean score summary.

Without upload:

Mean score: 0.900
Dry run — nothing uploaded

With upload:

prints Results uploaded
prints Autorater ID and Autorater Run ID when available
prints a result URL when available

Remote evaluation summary

For --web-eval and --web-config flows, the CLI prints a remote summary table including:

run status
overall score
items completed vs total
remote error, if any

`zhanla list datasets`

List datasets available in the web app:

bench list datasets

Optional filters:

bench list datasets --component-type tool
bench list datasets --component-id component-uuid
bench list datasets --name support

Supported --component-type values:

agent
skill
tool
orchestration

The command requires login and prints dataset IDs that can be used with --web-dataset.

`zhanla list autoraters`

List managed autoraters available in the web app:

bench list autoraters

Optional filters:

bench list autoraters --component-type tool
bench list autoraters --component-id component-uuid
bench list autoraters --name quality

This command also requires login and prints autorater IDs that can be used with --web-eval.

Environment Variables

Variable	Default	Description
`BENCH_BASE_URL`	`https://benchmark-black.vercel.app`	Base URL for auth, evaluation APIs, and dashboard links

File Locations

Path	Purpose
`~/.benchmark/credentials.json`	Saved SDK credentials and org metadata

Upload Behavior

For authenticated local runs, the CLI syncs component definitions and run results to Supabase using the short-lived access token obtained from your API key.

At a high level it:

creates an authenticated Supabase client
syncs or updates component definitions using content-based version hashes
creates or reuses datasets and dataset items for local datasets
links datasets back to the component definition
writes component and eval run records

For local component + web autorater runs, the CLI additionally starts a remote evaluate-only run after upload.

Structured component outputs are still preserved in traces and uploaded run data. Only the eval-function boundary is normalized to plain text.

Examples

Run a local tool against a local eval and skip upload:

bench run component.py:my_tool --dataset data.json --eval eval.py:my_eval --dry-run

Run a local component and send outputs to a managed autorater:

bench run component.py:my_tool --dataset data.json --web-eval 123e4567-e89b-12d3-a456-426614174000

Run fully against web-managed prompt, dataset, and autorater resources:

bench run --web-config prompt-123 --web-dataset dataset-123 --web-eval autorater-123

List available datasets:

bench list datasets

List available autoraters:

bench list autoraters

Running Tests

python3 -m pytest packages/cli -v

The test suite covers:

auth credential persistence
component and eval discovery
component validation
dataset loading and schema validation
local execution and eval execution
run command behavior for dry runs, uploads, and web-backed flows
writer upload behavior

Project details

Release history Release notifications | RSS feed

0.1.2.5

May 16, 2026

0.1.2.4

May 14, 2026

0.1.2.3

May 9, 2026

This version

0.1.2.2

May 7, 2026

0.1.2.1

May 4, 2026

0.1.2

May 4, 2026

0.1.1

May 3, 2026

0.1.0

May 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zhanla-0.1.2.2.tar.gz (80.1 kB view details)

Uploaded May 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

zhanla-0.1.2.2-py3-none-any.whl (57.5 kB view details)

Uploaded May 7, 2026 Python 3

File details

Details for the file zhanla-0.1.2.2.tar.gz.

File metadata

Download URL: zhanla-0.1.2.2.tar.gz
Upload date: May 7, 2026
Size: 80.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for zhanla-0.1.2.2.tar.gz
Algorithm	Hash digest
SHA256	`7208ffa32ffcd7bbea5ecc03d8bb061c771a53eefea358c48d61e40e72e23c86`
MD5	`e9dc6fc6bd4a559b56f25133fec4fccc`
BLAKE2b-256	`79ca19802193851ba1e8a6f2a1a50d486f8bccce73de34a53182d0f9b269e416`

See more details on using hashes here.

File details

Details for the file zhanla-0.1.2.2-py3-none-any.whl.

File metadata

Download URL: zhanla-0.1.2.2-py3-none-any.whl
Upload date: May 7, 2026
Size: 57.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for zhanla-0.1.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cb7e46bcb07ba6e0595e84a3a1ffea7617f1bd944cd992de4c9eed3aabc020f8`
MD5	`184e20cd6c581511d10df856de88c6b8`
BLAKE2b-256	`796e9baf0c756f722857cd7e8407311f2cc8f58f48dea025b4b3c0a41c4aa0ae`

See more details on using hashes here.

zhanla 0.1.2.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

zhanla

Installation

Authentication

zhanla login

zhanla logout

Token Refresh

zhanla run

Local component + local eval

Local component + web autorater

model_response_format defaults

Options

What zhanla run does

Dataset Formats

JSON

CSV

Empty datasets

Local Execution Semantics

Discovery

Validation

Output And Summaries

Local eval summary

Remote evaluation summary

zhanla list datasets

zhanla list autoraters

Environment Variables

File Locations

Upload Behavior

Examples

Running Tests

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`zhanla login`

`zhanla logout`

`zhanla run`

`model_response_format` defaults

What `zhanla run` does

`zhanla list datasets`

`zhanla list autoraters`