LLM benchmark that proves optimization gains — correctness validation + performance measurement for any OpenAI-compatible endpoint
Project description
Artemis LLM Benchmark
A Python CLI for correctness validation and performance benchmarking of LLM serving endpoints. Works with any OpenAI-compatible server — vLLM, Ollama, llama.cpp, and more.
Install
pip install artemisllmbench # core
pip install "artemisllmbench[dashboard]" # + Streamlit dashboard
pip install "artemisllmbench[full]" # + dashboard + semantic similarity
[full]addssentence-transformersfor semantic similarity checks (Layer 3 validity).
What It Does
Artemis answers two questions after you optimize an LLM endpoint:
- Did the optimization preserve correctness? — multi-layer validity checks on every response
- What is the publishable performance number? — reproducible latency, throughput, and goodput metrics
Quick Start
Validate a single endpoint — correctness + performance in one command:
artemisllmbench validate \
--endpoint http://localhost:9000 \
--model Qwen/Qwen2.5-7B-Instruct \
--hardware a100
Compare stock vs optimized — sequential benchmarking with full GPU resources for each:
artemisllmbench compare \
--endpoint-a http://localhost:9000 \
--endpoint-b http://localhost:9001 \
--model Qwen/Qwen2.5-7B-Instruct \
--hardware a100
Split-session compare — when the stock endpoint is already torn down:
# Session 1: save the stock baseline
artemisllmbench baseline --endpoint http://localhost:9000 --model <model> --hardware a100
# Session 2: run the optimized candidate
artemisllmbench candidate --endpoint http://localhost:9000 --model <model> --hardware a100
The dashboard launches automatically after each run. Open it at http://<your-ip>:8501.
Key Features
- Multi-layer validity — sanity, structural, semantic (embedding similarity ≥ 0.92), and exact-match checks catch regressions that latency numbers alone miss
- Reproducible metrics — TTFT, P95/P99 latency, ITL (inter-token latency), throughput, CV, drift, and spike detection
- SLO / goodput tracking — set
--slo-ttftand--slo-latencythresholds; get the % of requests that met them - Streamlit dashboard — live progress, side-by-side results, analytics charts, and a live response comparison panel
- Fast mode —
--fastcuts runtime by ~75% for quick iteration checks - Cross-machine support — endpoints can be on different hosts or different hardware
Common Flags
| Flag | Description |
|---|---|
--fast |
Reduced runs (~75% faster). For quick checks only. |
--production |
Full 50-run sequential + concurrent load (default). |
--slo-ttft <ms> |
TTFT SLO threshold — enables goodput reporting. |
--slo-latency <ms> |
End-to-end latency SLO threshold. |
--plots |
ASCII charts inline in terminal output. |
--live |
Rich terminal live view during concurrent phases. |
--no-dashboard |
Skip auto-launching Streamlit. |
--port N |
Streamlit port (default: 8501). |
Validity Layers
| Layer | Check | On failure |
|---|---|---|
| 1 Sanity | Non-empty, complete sentence, token bounds | Hard fail |
| 2 Structural | JSON/Python syntax where required | Hard fail |
| 3 Semantic | Cosine similarity ≥ 0.92 vs. reference | Hard fail / warning |
| 4 Exact match | String equality (control_prompt_v1 only) |
Warning |
Pre-flight Conformance Check
artemisllmbench check-conformance --endpoint http://localhost:9000
Verifies your endpoint speaks the required OpenAI-compatible SSE format before a full benchmark run.
Full documentation and source: artemisllmbench --help or artemisllmbench <command> --help
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file artemisllmbench-0.1.0.tar.gz.
File metadata
- Download URL: artemisllmbench-0.1.0.tar.gz
- Upload date:
- Size: 112.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9e1e1203f6c9ea7d55e1a60bfea2462a971b1d239be62e3b1c3608759b1d767e
|
|
| MD5 |
20d12ee42a4f6492bfdb3aa1fb9aacc5
|
|
| BLAKE2b-256 |
87b65c75705e90e8cf589618f6f74a04eeb56293f2c273938659352bf50043dd
|
File details
Details for the file artemisllmbench-0.1.0-py3-none-any.whl.
File metadata
- Download URL: artemisllmbench-0.1.0-py3-none-any.whl
- Upload date:
- Size: 118.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1b0e634b479182c2a180e22600b8728dbf7dbce5715e868c3bd3605807444ae0
|
|
| MD5 |
df10a4b30e6061fd7ac1107199de12ad
|
|
| BLAKE2b-256 |
376144c6508ed417ccaa7f3c5fa576951873e3c48127b8212a300f5ebc7ae75c
|