Skip to main content

LLM benchmark that proves optimization gains — correctness validation + performance measurement for any OpenAI-compatible endpoint

Project description

Artemis LLM Benchmark

A Python CLI for correctness validation and performance benchmarking of LLM serving endpoints. Works with any OpenAI-compatible server — vLLM, Ollama, llama.cpp, and more.


Install

pip install artemisllmbench                    # core
pip install "artemisllmbench[dashboard]"       # + Streamlit dashboard
pip install "artemisllmbench[full]"            # + dashboard + semantic similarity

[full] adds sentence-transformers for semantic similarity checks (Layer 3 validity).


What It Does

Artemis answers two questions after you optimize an LLM endpoint:

  • Did the optimization preserve correctness? — multi-layer validity checks on every response
  • What is the publishable performance number? — reproducible latency, throughput, and goodput metrics

Quick Start

Validate a single endpoint — correctness + performance in one command:

artemisllmbench validate \
  --endpoint http://localhost:9000 \
  --model Qwen/Qwen2.5-7B-Instruct \
  --hardware a100

Compare stock vs optimized — sequential benchmarking with full GPU resources for each:

artemisllmbench compare \
  --endpoint-a http://localhost:9000 \
  --endpoint-b http://localhost:9001 \
  --model Qwen/Qwen2.5-7B-Instruct \
  --hardware a100

Split-session compare — when the stock endpoint is already torn down:

# Session 1: save the stock baseline
artemisllmbench baseline --endpoint http://localhost:9000 --model <model> --hardware a100

# Session 2: run the optimized candidate
artemisllmbench candidate --endpoint http://localhost:9000 --model <model> --hardware a100

The dashboard launches automatically after each run. Open it at http://<your-ip>:8501.


Key Features

  • Multi-layer validity — sanity, structural, semantic (embedding similarity ≥ 0.92), and exact-match checks catch regressions that latency numbers alone miss
  • Reproducible metrics — TTFT, P95/P99 latency, ITL (inter-token latency), throughput, CV, drift, and spike detection
  • SLO / goodput tracking — set --slo-ttft and --slo-latency thresholds; get the % of requests that met them
  • Streamlit dashboard — live progress, side-by-side results, analytics charts, and a live response comparison panel
  • Fast mode--fast cuts runtime by ~75% for quick iteration checks
  • Cross-machine support — endpoints can be on different hosts or different hardware

Common Flags

Flag Description
--fast Reduced runs (~75% faster). For quick checks only.
--production Full 50-run sequential + concurrent load (default).
--slo-ttft <ms> TTFT SLO threshold — enables goodput reporting.
--slo-latency <ms> End-to-end latency SLO threshold.
--plots ASCII charts inline in terminal output.
--live Rich terminal live view during concurrent phases.
--no-dashboard Skip auto-launching Streamlit.
--port N Streamlit port (default: 8501).

Validity Layers

Layer Check On failure
1 Sanity Non-empty, complete sentence, token bounds Hard fail
2 Structural JSON/Python syntax where required Hard fail
3 Semantic Cosine similarity ≥ 0.92 vs. reference Hard fail / warning
4 Exact match String equality (control_prompt_v1 only) Warning

Pre-flight Conformance Check

artemisllmbench check-conformance --endpoint http://localhost:9000

Verifies your endpoint speaks the required OpenAI-compatible SSE format before a full benchmark run.


Full documentation and source: artemisllmbench --help or artemisllmbench <command> --help

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

artemisllmbench-0.1.0.tar.gz (112.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

artemisllmbench-0.1.0-py3-none-any.whl (118.7 kB view details)

Uploaded Python 3

File details

Details for the file artemisllmbench-0.1.0.tar.gz.

File metadata

  • Download URL: artemisllmbench-0.1.0.tar.gz
  • Upload date:
  • Size: 112.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for artemisllmbench-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9e1e1203f6c9ea7d55e1a60bfea2462a971b1d239be62e3b1c3608759b1d767e
MD5 20d12ee42a4f6492bfdb3aa1fb9aacc5
BLAKE2b-256 87b65c75705e90e8cf589618f6f74a04eeb56293f2c273938659352bf50043dd

See more details on using hashes here.

File details

Details for the file artemisllmbench-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: artemisllmbench-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 118.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for artemisllmbench-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1b0e634b479182c2a180e22600b8728dbf7dbce5715e868c3bd3605807444ae0
MD5 df10a4b30e6061fd7ac1107199de12ad
BLAKE2b-256 376144c6508ed417ccaa7f3c5fa576951873e3c48127b8212a300f5ebc7ae75c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page