Skip to main content

CLI from Zhanla for running and uploading AI component evaluations

Project description

zhanla

Command-line interface for discovering Benchmark SDK components, running them against datasets, and syncing results to the Benchmark web app.

The CLI can be installed on its own for login, listing web resources, and fully web-backed runs. Local component execution requires the matching language SDK.

Installation

pip install zhanla

Requires Python >=3.10.

Runtime dependencies:

  • typer
  • rich
  • supabase
  • httpx

After installation, the zhanla command is available in your shell.

Optional local execution dependencies:

  • Python components: pip install zhanla-sdk-py
  • TypeScript components: npm install @zhanla/sdk-ts

Authentication

zhanla login

Authenticate with your SDK API key:

bench login

The CLI prompts for a composite key in this format:

bm_kid_XXXX.bm_sec_XXXX

On success, the CLI exchanges that key for token metadata and stores credentials at ~/.benchmark/credentials.json.

Saved fields:

Field Description
key_id The bm_kid_... portion of the API key
secret The bm_sec_... portion of the API key
supabase_url Supabase project URL
supabase_anon_key Public anon key used to initialize the client
org_id Organization ID

zhanla logout

Remove saved credentials:

bench logout

Token Refresh

When a command needs authenticated access, the CLI exchanges the saved API key for a fresh short-lived token. If you are not logged in, authenticated commands exit with:

Not logged in. Run `zhanla login` first.

zhanla run

zhanla run supports three execution modes:

  1. Local component + local eval
  2. Local component + web autorater
  3. Fully web-backed run

The command enforces exactly one source for each required input:

  • exactly one component source: local target or --web-config
  • exactly one dataset source: --dataset or --web-dataset
  • exactly one evaluation source: --eval or --web-eval

Local component + local eval

Run a component discovered from a Python file and score it with a local eval component:

bench run components.py:my_tool --dataset data.json --eval evals.py:my_eval

Notes:

  • component.py:name targets a runnable component defined as a module-level zhanla.* instance
  • --eval eval.py:name targets an eval component such as CodeEval, LLMEval, Checklist, or EvalTree
  • If the file contains exactly one matching component, :name is optional

Local component + web autorater

Run a local component, upload the dataset and outputs, then start a managed web autorater run:

bench run components.py:my_tool --dataset data.json --web-eval autorater-uuid

This flow:

  • executes the local component over your dataset
  • uploads dataset rows and local outputs
  • starts a remote evaluate-only run for the specified autorater
  • polls until the remote run completes or fails

Eval contract note:

  • the CLI sends model_input, model_response, and expected_output to managed evals as serialized strings
  • before your eval function runs, the platform parses those strings according to the eval item's model_response_format
  • CodeEval defaults to model_response_format="JSON" — your function receives a pre-parsed dict/object
  • set model_response_format="TEXT" to receive the raw string instead (useful for keyword or regex checks)
  • LLMEval always receives plain text regardless of model_response_format

model_response_format defaults

Eval type Default Received by function
CodeEval "JSON" Parsed dict/object
CodeEval(model_response_format="TEXT") "TEXT" Raw string
CodeEval(model_response_format="YAML") "YAML" Parsed dict/object
LLMEval "TEXT" (fixed) Raw string

Example — structured eval (default JSON):

bench.CodeEval(
    name="check_priority",
    description="Verify the priority field is correct.",
    fn=lambda model_response, expected_output, **_: {
        "score": 1.0 if model_response.get("priority") == expected_output.get("expected_priority") else 0.0
    },
)

Example — text-based eval (explicit TEXT):

bench.CodeEval(
    name="contains_priority_label",
    description="Check that the response mentions a priority level.",
    fn=lambda model_response, **_: {
        "score": 1.0 if any(label in model_response.lower() for label in ["critical", "high", "medium", "low"]) else 0.0
    },
    model_response_format="TEXT",
)

`--dry-run` is not supported with `--web-eval`.

### Fully web-backed run

Start a run entirely against web-managed resources:

```bash
bench run --web-config prompt-uuid --web-dataset dataset-uuid --web-eval autorater-uuid

Optional model override:

bench run \
  --web-config prompt-uuid \
  --web-dataset dataset-uuid \
  --web-eval autorater-uuid \
  --model-endpoint openai:gpt-4.1-mini

Notes:

  • --web-config requires both --web-dataset and --web-eval
  • --model-endpoint is only valid with --web-config
  • --dry-run is not supported for fully web-backed runs

Options

Flag Short Description
--eval <spec> -e Local eval target in file.py:name form
--dataset <path> -d Local dataset file (.json or .csv)
--web-eval <id> Managed autorater ID
--web-dataset <id> Dataset ID from the web app
--web-config <id> Prompt ID for fully web-backed runs
--model-endpoint <value> Model endpoint override for fully web-backed runs
--dry-run Execute locally and skip sync/upload

What zhanla run does

For local component runs, the CLI:

  1. Discovers the component from the target file.
  2. Prints a discovered-components table.
  3. Validates the component structure.
  4. Resolves a local eval or web autorater target.
  5. Checks auth when the chosen mode needs it.
  6. Loads the dataset from disk or from the web app.
  7. Executes each dataset row in sequence with a progress bar.
  8. Validates the first component output against output_schema when present.
  9. Uploads local definitions and run results unless --dry-run is set.
  10. Prints either a local score summary or a remote evaluation summary.

If sync/upload fails, the run exits with status 1.

Dataset Formats

JSON

JSON datasets should be a top-level array.

The preferred format is a leading schema row followed by ordinary data rows:

[
  {
    "_schema": {
      "revenue": {"type": "integer"},
      "cost": {"type": "integer"}
    }
  },
  {"revenue": 100, "cost": 40},
  {"revenue": 200, "cost": 120}
]

The first items may optionally be metadata rows:

[
  {
    "_schema": {
      "revenue": {"type": "integer"},
      "cost": {"type": "integer"}
    }
  },
  {"_config": {"name": "finance", "description": "Margin checks"}},
  {"revenue": 100, "cost": 40},
  {"revenue": 200, "cost": 120}
]

Leading rows with _schema or _config are treated as metadata and are not executed. Legacy object-shaped JSON datasets with schema and rows are still accepted.

CSV

CSV datasets use the header row as field names:

revenue,cost
100,40
200,120

CSV values are loaded as strings.

Empty datasets

If the resolved dataset has no rows, the CLI exits with:

Dataset is empty.

Local Execution Semantics

The CLI executes SDK components according to their current implementation:

  • Tool: runs fn(**row) and normalizes non-dict output to {"result": value}
  • CodeEval: runs fn(**row_and_component_output) and normalizes non-dict output to {"score": value}
  • Skill: raises an error — Skills are prompt-only definitions and cannot be executed directly in the CLI
  • Agent: requires a configured runner and model; calls the runner to generate a response
  • LLMProcessor: requires a configured runner and model; calls the runner to generate a response
  • LLMEval: requires a configured runner and model; calls the runner to score the response
  • Checklist: runs all child evals and computes a weighted average
  • EvalTree: routes through branches and computes weighted leaf scores
  • Orchestration: executes the DAG and returns the final executed step output

For local evals, dataset row fields and component output fields are merged into the eval kwargs.

Discovery

The CLI discovers local components by importing your Python file and scanning module-level attributes for zhanla component instances.

That means:

  • your file is executed during discovery
  • module-level side effects will run
  • components should usually be defined at module scope
  • local imports from the same directory are supported

Name resolution rules:

  • if a file has exactly one runnable component, zhanla run file.py auto-selects it
  • if a file has multiple runnable components, use file.py:component_name
  • if a file has exactly one eval, --eval evals.py auto-selects it
  • if a referenced name is ambiguous or missing, the CLI exits with an error

Validation

The CLI validates component structure before execution.

  • Tool must provide a callable fn
  • Tool must declare a non-None output_schema
  • CodeEval must provide a callable fn
  • Skill, Agent, LLMProcessor, and LLMEval must define instructions
  • Agent, LLMProcessor, and LLMEval must define model
  • Orchestration step targets must exist and the graph must be acyclic
  • Checklist weights must match the number of evals and be positive
  • EvalTree branch thresholds must be in [0.0, 1.0]
  • EvalTree edge weights must be positive

For local component runs with an output_schema, the CLI also validates the first produced output for:

  • missing keys
  • extra keys
  • simple isinstance type mismatches

Schema mismatches stop the run immediately.

Output And Summaries

Local eval summary

For local evals, the CLI prints a score table and a mean score summary.

Without upload:

Mean score: 0.900
Dry run — nothing uploaded

With upload:

  • prints Results uploaded
  • prints Autorater ID and Autorater Run ID when available
  • prints a result URL when available

Remote evaluation summary

For --web-eval and --web-config flows, the CLI prints a remote summary table including:

  • run status
  • overall score
  • items completed vs total
  • remote error, if any

zhanla list datasets

List datasets available in the web app:

bench list datasets

Optional filters:

bench list datasets --component-type tool
bench list datasets --component-id component-uuid
bench list datasets --name support

Supported --component-type values:

  • agent
  • skill
  • tool
  • orchestration

The command requires login and prints dataset IDs that can be used with --web-dataset.

zhanla list autoraters

List managed autoraters available in the web app:

bench list autoraters

Optional filters:

bench list autoraters --component-type tool
bench list autoraters --component-id component-uuid
bench list autoraters --name quality

This command also requires login and prints autorater IDs that can be used with --web-eval.

Environment Variables

Variable Default Description
BENCH_BASE_URL https://benchmark-black.vercel.app Base URL for auth, evaluation APIs, and dashboard links

File Locations

Path Purpose
~/.benchmark/credentials.json Saved SDK credentials and org metadata

Upload Behavior

For authenticated local runs, the CLI syncs component definitions and run results to Supabase using the short-lived access token obtained from your API key.

At a high level it:

  • creates an authenticated Supabase client
  • syncs or updates component definitions using content-based version hashes
  • creates or reuses datasets and dataset items for local datasets
  • links datasets back to the component definition
  • writes component and eval run records

For local component + web autorater runs, the CLI additionally starts a remote evaluate-only run after upload.

Structured component outputs are still preserved in traces and uploaded run data. Only the eval-function boundary is normalized to plain text.

Examples

Run a local tool against a local eval and skip upload:

bench run component.py:my_tool --dataset data.json --eval eval.py:my_eval --dry-run

Run a local component and send outputs to a managed autorater:

bench run component.py:my_tool --dataset data.json --web-eval 123e4567-e89b-12d3-a456-426614174000

Run fully against web-managed prompt, dataset, and autorater resources:

bench run --web-config prompt-123 --web-dataset dataset-123 --web-eval autorater-123

List available datasets:

bench list datasets

List available autoraters:

bench list autoraters

Running Tests

python3 -m pytest packages/cli -v

The test suite covers:

  • auth credential persistence
  • component and eval discovery
  • component validation
  • dataset loading and schema validation
  • local execution and eval execution
  • run command behavior for dry runs, uploads, and web-backed flows
  • writer upload behavior

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zhanla-0.1.2.2.tar.gz (80.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

zhanla-0.1.2.2-py3-none-any.whl (57.5 kB view details)

Uploaded Python 3

File details

Details for the file zhanla-0.1.2.2.tar.gz.

File metadata

  • Download URL: zhanla-0.1.2.2.tar.gz
  • Upload date:
  • Size: 80.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for zhanla-0.1.2.2.tar.gz
Algorithm Hash digest
SHA256 7208ffa32ffcd7bbea5ecc03d8bb061c771a53eefea358c48d61e40e72e23c86
MD5 e9dc6fc6bd4a559b56f25133fec4fccc
BLAKE2b-256 79ca19802193851ba1e8a6f2a1a50d486f8bccce73de34a53182d0f9b269e416

See more details on using hashes here.

File details

Details for the file zhanla-0.1.2.2-py3-none-any.whl.

File metadata

  • Download URL: zhanla-0.1.2.2-py3-none-any.whl
  • Upload date:
  • Size: 57.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for zhanla-0.1.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 cb7e46bcb07ba6e0595e84a3a1ffea7617f1bd944cd992de4c9eed3aabc020f8
MD5 184e20cd6c581511d10df856de88c6b8
BLAKE2b-256 796e9baf0c756f722857cd7e8407311f2cc8f58f48dea025b4b3c0a41c4aa0ae

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page