Find the cheapest LLM that gets your task 100% right
Project description
SkillEval
Find the cheapest LLM that gets your task 100% right.
SkillEval is a CLI tool that automates LLM evaluation for deterministic tasks. It runs your task across multiple models in parallel, compares outputs against expected results, and recommends the most cost-effective option.
Quick Start
pip install -e .
# Set at least one provider API key
export DASHSCOPE_API_KEY="sk-..." # Qwen (Alibaba DashScope)
# Create a task folder
skilleval init my-task
# Add your input files, expected output, and skill prompt
# Then run the evaluation
skilleval run my-task/
# Machine-readable output
skilleval run my-task/ --json | jq '.recommendation'
Supported Providers
| Provider | Platform | Env Variable |
|---|---|---|
| Qwen | Alibaba Cloud / DashScope | DASHSCOPE_API_KEY |
| GLM | Zhipu AI / BigModel | ZHIPU_API_KEY |
| MiniMax | MiniMax | MINIMAX_API_KEY |
Evaluation Modes
- Mode 1 (
run) — You write the prompt (skill), SkillEval tests it across models. - Mode 2 (
matrix) — One model writes the prompt, another executes it. Tests all creator x executor combinations. - Mode 3 (
chain) — A meta-skill guides prompt creation, then another model executes it. Full pipeline evaluation.
Additional Features
- Ad-hoc endpoints — Use any OpenAI-compatible API without editing the catalog:
--endpoint,--api-key,--model-name. - Skill linting (
lint) — Validate Claude Code skill structure (frontmatter, phases, references, code blocks). - Skill testing (
skill-test) — Test a skill's core prompt logic against expected outputs. - Run comparison (
compare) — Diff two runs to detect improvements or regressions. - HTML reports (
report --html) — Generate self-contained HTML reports for sharing. - JSON output (
--json) — Machine-readable JSON onrun,matrix,chain,catalog, andreportcommands for piping into other tools. - Verbose logging (
-v/-vv) —-vfor INFO,-vvfor DEBUG. Logs go to stderr so they don't interfere with--jsonoutput. - Auto-confirm (
--yes/-y) — Skip the confirmation prompt onchain(replaces the old--confirmflag). - Config validation — Warns on unknown keys in
config.yamland validates comparator names at load time. - Circuit breaker — Automatically skips a provider after 5 consecutive failures, avoiding wasted time and cost.
- Ctrl+C handling — Saves partial results on interrupt so you never lose a half-finished run.
- Friendly errors — No raw tracebacks by default; use
-vvto see full stack traces when debugging. - Progress bar — Now shows elapsed time and ETA alongside the completion percentage.
Documentation
See the User Manual (中文) for detailed setup instructions, configuration options, comparator reference, and walkthroughs.
Development
pip install -e ".[dev,docs]"
pytest
ruff check src/ tests/
See CONTRIBUTING.md (中文) for full contributor guidelines.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file skilleval-0.1.0.tar.gz.
File metadata
- Download URL: skilleval-0.1.0.tar.gz
- Upload date:
- Size: 109.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ba48757b19300bb1da97b21845e1eb409aada23f528015ed53ae92765f428ca2
|
|
| MD5 |
00eb39b64b43aac573205bd6091c7a92
|
|
| BLAKE2b-256 |
0e4b23f530483b5ec9898f4db32ddfe894d61d37ec6d5d0bb519bb2e75ed117b
|
File details
Details for the file skilleval-0.1.0-py3-none-any.whl.
File metadata
- Download URL: skilleval-0.1.0-py3-none-any.whl
- Upload date:
- Size: 53.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3d7188d498e8ef4190554ac062c0448f0b0585110aaca5a20742cf2260b0b20b
|
|
| MD5 |
01a21de3e63b62b5af02a249fae61f5f
|
|
| BLAKE2b-256 |
2cce0fc7f373760b319a0707e208de8326e799e2348750b6cd3a7954ec698dfe
|