Skip to main content

separates precious LLMs from base LLMs. works with any OpenAI/Anthropic compatible API

Project description

cupel cupel

separates precious LLMs from base LLMs

score local and cloud LLMs with custom prompts and a configurable judge

PyPI - Version Python - Version LICENSE

cupel leaderboard — models ranked by score
a cupel is the small dish used in a fire assay to separate precious metal from base metal

install

curl -fsSL https://cupel.run/install | bash

or

pip install cupel

the UI is bundled in the package

quick start

cupel

opens a browser at localhost:8042

first time cupel is started

ships with example data (8 models scored by Claude Opus 4.6 on 8 prompts) — the dashboard is populated on first launch

  • LLM-assisted authoring — describe what you want to test, an LLM drafts the prompt and 0–3 rubric
  • local + cloud — oMLX, Ollama, LM Studio, SGLang, OpenRouter, Anthropic, OpenAI
  • configurable judge — any model can score responses on a 0–3 rubric with reasoning
  • thinking model support — separates <think> blocks from answers, only judges the response
  • multi-turn + tool calling — multi-step conversations with injected tool results
  • speed tracking — tok/s and response times per model
  • auto-discovery — probes known ports for local inference servers

leaderboard

leaderboard with score vs. speed, overall accuracy, and per-category breakdowns

cupel dashboard — score vs speed scatter plot with leaderboard overall accuracy — horizontal bar chart of all models

category fingerprint — radar chart comparing models across categories

score models

select models from discovered providers, filter by prompt category, choose a judge model, and start the run

bench it — run evals from the browser

progress updates via SSE as each prompt completes

running multiple models

author prompts

describe what to test, select a category and difficulty — an LLM generates the title, prompt text, and 0–3 rubric. edit before saving

authoring a prompt authored prompt with rubric

results

each run is saved as JSON with the model, judge, timestamp, and per-prompt scores with judge reasoning
results can be sorted, tagged, muted and expanded to inspect individual evaluations

results — browse, tag, and manage all scored runs

judge

set a default judge in UI settings or in config.yml:

judge:
  model: claude-opus-4-6

scores are 0–3:

score meaning
3 correct and insightful
2 correct but shallow
1 partially correct
0 wrong or hallucinated

prompt format

{
  "id": 14,
  "category": "math_estimation",
  "title": "Model Memory from Quantization",
  "prompt": "A model has 70B parameters. Estimate memory for FP16, 8-bit, and 4-bit.",
  "rubric": {
    "3": "FP16: ~140GB, 8-bit: ~70GB, 4-bit: ~35GB. Shows the math.",
    "2": "Correct for 2 of 3, or all correct but no explanation.",
    "1": "Gets the direction right but wrong numbers.",
    "0": "Wrong math or doesn't understand quantization."
  }
}

multi-turn prompts

for tool calling and conversations, use turns instead of prompt:

{
  "id": 21,
  "title": "Tool Calling — School Status Check",
  "turns": [
    {
      "messages": [
        {"role": "system", "content": "You have tools: get_grades(name), ..."},
        {"role": "user", "content": "How are both kids doing?"}
      ]
    },
    {
      "inject_after": [
        {"role": "user", "content": "Tool results: get_grades(\"phoebe\") => ..."}
      ]
    }
  ],
  "rubric": { "3": "Emits correct tool calls, synthesizes results...", "..." : "..." }
}

thinking models

cupel handles <think> blocks automatically — separates thinking from the answer, only judges the response:

thinking: null   # model default (recommended)
thinking: 0      # disable
thinking: 4096   # explicit budget

providers

cloud providers can be added from presets (Anthropic, OpenRouter, OpenAI) or as custom endpoints. the settings page fetches model lists from a provider's API (includes per-token pricing for OpenRouter), validates API keys, and tests connections

settings — add cloud providers, fetch models with pricing

cupel auto-discovers local servers on known ports:

port server
8000 oMLX / vLLM
11434 Ollama
1234 LM Studio
30000 SGLang
8080 llama.cpp

API keys

each provider gets its own env var. put them in .env or ~/.cupel/.env:

OMLX_API_KEY=4242
ANTHROPIC_API_KEY=sk-ant-...
OPENROUTER_API_KEY=sk-or-...
OPENAI_API_KEY=sk-proj-...

or configure in config.yml:

providers:
  - name: openrouter
    api_url: https://openrouter.ai/api/v1/chat/completions
    api_key_env: OPENROUTER_API_KEY
    models: [google/gemini-2.5-pro, deepseek/deepseek-r1]

  - name: anthropic
    api_url: https://api.anthropic.com/v1/messages
    api_key_env: ANTHROPIC_API_KEY
    models: [claude-opus-4-6, claude-sonnet-4-6]

CLI

cupel                                  # open dashboard
cupel run                              # collect responses
cupel run --models "Qwen3.5-27B-8bit"  # specific model
cupel run --prompts 18-22              # specific prompts
cupel judge eval-results/*.json        # score with judge
cupel judge eval-results/*.json --judge-model gemma-4-26b-a4b-it-4bit
cupel init                             # create config.yml + eval-set

development

git clone https://github.com/tolitius/cupel.git && cd cupel
pip install -e .
uvicorn cupel.server:app --reload --port 8042

vanilla JS frontend (Preact + HTM from CDN). no build step.

license

Copyright © 2026 tolitius

Distributed under the Apache 2.0 License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cupel-0.1.80.tar.gz (5.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cupel-0.1.80-py3-none-any.whl (222.3 kB view details)

Uploaded Python 3

File details

Details for the file cupel-0.1.80.tar.gz.

File metadata

  • Download URL: cupel-0.1.80.tar.gz
  • Upload date:
  • Size: 5.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.6

File hashes

Hashes for cupel-0.1.80.tar.gz
Algorithm Hash digest
SHA256 cbe81b8b7f1c310db3f6dc3b823195884a9fdb130e4ace6c09a483ed8c75ae3c
MD5 fe60b03b69afe463f7909f07e62f95c4
BLAKE2b-256 4f833d215bff40431565e32194c9e88f077ee2be78e489686ab1ee6a88c6d6db

See more details on using hashes here.

File details

Details for the file cupel-0.1.80-py3-none-any.whl.

File metadata

  • Download URL: cupel-0.1.80-py3-none-any.whl
  • Upload date:
  • Size: 222.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.6

File hashes

Hashes for cupel-0.1.80-py3-none-any.whl
Algorithm Hash digest
SHA256 f4ead0defe5f20427acfd0d8deba802aeecffe3e600e7d86cf91678f220fcf8e
MD5 dbcf0908704cf9a736f4e41afbd3bf92
BLAKE2b-256 16ca2db874476e1e9dbb017654760670a890dd68779bc0417281905d350df151

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page