Skip to main content

separates precious LLMs from base LLMs. works with any OpenAI/Anthropic compatible API

Project description

cupel cupel

separates precious LLMs from base LLMs

score local and cloud LLMs with custom prompts and a configurable judge

PyPI - Version Python - Version LICENSE

cupel leaderboard — models ranked by score
a cupel is the small dish used in a fire assay to separate precious metal from base metal

install

curl -fsSL https://cupel.run/install | bash

or

pip install cupel

the UI is bundled in the package

quick start

cupel

opens a browser at localhost:8042

first time cupel is started

ships with example data (8 models scored by Claude Opus 4.6 on 8 prompts) — the dashboard is populated on first launch

  • LLM-assisted authoring — describe what you want to test, an LLM drafts the prompt and 0–3 rubric
  • local + cloud — oMLX, Ollama, LM Studio, SGLang, OpenRouter, Anthropic, OpenAI
  • configurable judge — any model can score responses on a 0–3 rubric with reasoning
  • thinking model support — separates <think> blocks from answers, only judges the response
  • multi-turn + tool calling — multi-step conversations with injected tool results
  • speed tracking — tok/s and response times per model
  • auto-discovery — probes known ports for local inference servers

leaderboard

leaderboard with score vs. speed, overall accuracy, and per-category breakdowns

cupel dashboard — score vs speed scatter plot with leaderboard overall accuracy — horizontal bar chart of all models

category fingerprint — radar chart comparing models across categories

score models

select models from discovered providers, filter by prompt category, choose a judge model, and start the run

bench it — run evals from the browser

progress updates via SSE as each prompt completes

running multiple models

author prompts

describe what to test, select a category and difficulty — an LLM generates the title, prompt text, and 0–3 rubric. edit before saving

authoring a prompt authored prompt with rubric

results

each run is saved as JSON with the model, judge, timestamp, and per-prompt scores with judge reasoning
results can be sorted, tagged, muted and expanded to inspect individual evaluations

results — browse, tag, and manage all scored runs

judge

set a default judge in UI settings or in config.yml:

judge:
  model: claude-opus-4-6

scores are 0–3:

score meaning
3 correct and insightful
2 correct but shallow
1 partially correct
0 wrong or hallucinated

prompt format

{
  "id": 14,
  "category": "math_estimation",
  "title": "Model Memory from Quantization",
  "prompt": "A model has 70B parameters. Estimate memory for FP16, 8-bit, and 4-bit.",
  "rubric": {
    "3": "FP16: ~140GB, 8-bit: ~70GB, 4-bit: ~35GB. Shows the math.",
    "2": "Correct for 2 of 3, or all correct but no explanation.",
    "1": "Gets the direction right but wrong numbers.",
    "0": "Wrong math or doesn't understand quantization."
  }
}

multi-turn prompts

for tool calling and conversations, use turns instead of prompt:

{
  "id": 21,
  "title": "Tool Calling — School Status Check",
  "turns": [
    {
      "messages": [
        {"role": "system", "content": "You have tools: get_grades(name), ..."},
        {"role": "user", "content": "How are both kids doing?"}
      ]
    },
    {
      "inject_after": [
        {"role": "user", "content": "Tool results: get_grades(\"phoebe\") => ..."}
      ]
    }
  ],
  "rubric": { "3": "Emits correct tool calls, synthesizes results...", "..." : "..." }
}

thinking models

cupel handles <think> blocks automatically — separates thinking from the answer, only judges the response:

thinking: null   # model default (recommended)
thinking: 0      # disable
thinking: 4096   # explicit budget

providers

cloud providers can be added from presets (Anthropic, OpenRouter, OpenAI) or as custom endpoints. the settings page fetches model lists from a provider's API (includes per-token pricing for OpenRouter), validates API keys, and tests connections

settings — add cloud providers, fetch models with pricing

cupel auto-discovers local servers on known ports:

port server
8000 oMLX / vLLM
11434 Ollama
1234 LM Studio
30000 SGLang
8080 llama.cpp

API keys

each provider gets its own env var. put them in .env or ~/.cupel/.env:

OMLX_API_KEY=4242
ANTHROPIC_API_KEY=sk-ant-...
OPENROUTER_API_KEY=sk-or-...
OPENAI_API_KEY=sk-proj-...

or configure in config.yml:

providers:
  - name: openrouter
    api_url: https://openrouter.ai/api/v1/chat/completions
    api_key_env: OPENROUTER_API_KEY
    models: [google/gemini-2.5-pro, deepseek/deepseek-r1]

  - name: anthropic
    api_url: https://api.anthropic.com/v1/messages
    api_key_env: ANTHROPIC_API_KEY
    models: [claude-opus-4-6, claude-sonnet-4-6]

CLI

cupel                                  # open dashboard
cupel run                              # collect responses
cupel run --models "Qwen3.5-27B-8bit"  # specific model
cupel run --prompts 18-22              # specific prompts
cupel judge eval-results/*.json        # score with judge
cupel judge eval-results/*.json --judge-model gemma-4-26b-a4b-it-4bit
cupel init                             # create config.yml + eval-set

development

git clone https://github.com/tolitius/cupel.git && cd cupel
pip install -e .
uvicorn cupel.server:app --reload --port 8042

vanilla JS frontend (Preact + HTM from CDN). no build step.

license

Copyright © 2026 tolitius

Distributed under the Apache 2.0 License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cupel-0.1.71.tar.gz (5.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cupel-0.1.71-py3-none-any.whl (216.0 kB view details)

Uploaded Python 3

File details

Details for the file cupel-0.1.71.tar.gz.

File metadata

  • Download URL: cupel-0.1.71.tar.gz
  • Upload date:
  • Size: 5.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.6

File hashes

Hashes for cupel-0.1.71.tar.gz
Algorithm Hash digest
SHA256 84caa212e30ea0e257d0b311ee1219ca02d3e2e1233f820b08de2eba59360ab8
MD5 17ac2f46ed82f5a31b542c1f16017bc3
BLAKE2b-256 e86643e8981b7b4400d73776d0f05566b01ca343c50d0d36d3f6bc87dbbb1d0c

See more details on using hashes here.

File details

Details for the file cupel-0.1.71-py3-none-any.whl.

File metadata

  • Download URL: cupel-0.1.71-py3-none-any.whl
  • Upload date:
  • Size: 216.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.6

File hashes

Hashes for cupel-0.1.71-py3-none-any.whl
Algorithm Hash digest
SHA256 a8aeed01c12de23c3ac712b366b1bc65d4b9d87e7db39826b37984d7a34d2fd9
MD5 e2d2dfa88a865b4c467ba900ef61e622
BLAKE2b-256 178d10666b5d868fa20b5ad6679c5afaec1275d5aee6d91d3b9a35f8a2d78848

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page