Skip to main content

Find the cheapest LLM that gets your task 100% right

Project description

SkillEval

English | 中文

License: MIT Python 3.11+

Find the cheapest LLM that gets your task 100% right.

SkillEval is a CLI tool that automates LLM evaluation for deterministic tasks. It runs your task across multiple models in parallel, compares outputs against expected results, and recommends the most cost-effective option.

Quick Start

pip install -e .

# Set at least one provider API key
export DASHSCOPE_API_KEY="sk-..."   # Qwen (Alibaba DashScope)

# Create a task folder
skilleval init my-task

# Add your input files, expected output, and skill prompt
# Then run the evaluation
skilleval run my-task/

# Machine-readable output
skilleval run my-task/ --json | jq '.recommendation'

Supported Providers

Provider Platform Env Variable
Qwen Alibaba Cloud / DashScope DASHSCOPE_API_KEY
GLM Zhipu AI / BigModel ZHIPU_API_KEY
MiniMax MiniMax MINIMAX_API_KEY

Evaluation Modes

  • Mode 1 (run) — You write the prompt (skill), SkillEval tests it across models.
  • Mode 2 (matrix) — One model writes the prompt, another executes it. Tests all creator x executor combinations.
  • Mode 3 (chain) — A meta-skill guides prompt creation, then another model executes it. Full pipeline evaluation.

Additional Features

  • Ad-hoc endpoints — Use any OpenAI-compatible API without editing the catalog: --endpoint, --api-key, --model-name.
  • Skill linting (lint) — Validate Claude Code skill structure (frontmatter, phases, references, code blocks).
  • Skill testing (skill-test) — Test a skill's core prompt logic against expected outputs.
  • Run comparison (compare) — Diff two runs to detect improvements or regressions.
  • HTML reports (report --html) — Generate self-contained HTML reports for sharing.
  • JSON output (--json) — Machine-readable JSON on run, matrix, chain, catalog, and report commands for piping into other tools.
  • Verbose logging (-v / -vv)-v for INFO, -vv for DEBUG. Logs go to stderr so they don't interfere with --json output.
  • Auto-confirm (--yes / -y) — Skip the confirmation prompt on chain (replaces the old --confirm flag).
  • Config validation — Warns on unknown keys in config.yaml and validates comparator names at load time.
  • Circuit breaker — Automatically skips a provider after 5 consecutive failures, avoiding wasted time and cost.
  • Ctrl+C handling — Saves partial results on interrupt so you never lose a half-finished run.
  • Friendly errors — No raw tracebacks by default; use -vv to see full stack traces when debugging.
  • Progress bar — Now shows elapsed time and ETA alongside the completion percentage.

Documentation

See the User Manual (中文) for detailed setup instructions, configuration options, comparator reference, and walkthroughs.

Development

pip install -e ".[dev,docs]"
pytest
ruff check src/ tests/

See CONTRIBUTING.md (中文) for full contributor guidelines.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

skilleval-0.1.0.tar.gz (109.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

skilleval-0.1.0-py3-none-any.whl (53.1 kB view details)

Uploaded Python 3

File details

Details for the file skilleval-0.1.0.tar.gz.

File metadata

  • Download URL: skilleval-0.1.0.tar.gz
  • Upload date:
  • Size: 109.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for skilleval-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ba48757b19300bb1da97b21845e1eb409aada23f528015ed53ae92765f428ca2
MD5 00eb39b64b43aac573205bd6091c7a92
BLAKE2b-256 0e4b23f530483b5ec9898f4db32ddfe894d61d37ec6d5d0bb519bb2e75ed117b

See more details on using hashes here.

File details

Details for the file skilleval-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: skilleval-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 53.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for skilleval-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3d7188d498e8ef4190554ac062c0448f0b0585110aaca5a20742cf2260b0b20b
MD5 01a21de3e63b62b5af02a249fae61f5f
BLAKE2b-256 2cce0fc7f373760b319a0707e208de8326e799e2348750b6cd3a7954ec698dfe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page