Artemis LLM Benchmark — correctness validation and performance benchmarking for any OpenAI-compatible LLM serving endpoint

These details have not been verified by PyPI

Project links

Bug Tracker

Project description

Artemis LLM Benchmark

A Python CLI for correctness validation and performance benchmarking of LLM serving endpoints. Works with any OpenAI-compatible server — vLLM, Ollama, llama.cpp, and more.

Install

pip install artemisllmbench                    # core
pip install "artemisllmbench[dashboard]"       # + Streamlit dashboard
pip install "artemisllmbench[full]"            # + dashboard + semantic similarity

[full] adds sentence-transformers for semantic similarity checks (Layer 3 validity).

What It Does

Artemis answers two questions after you optimize an LLM endpoint:

Did the optimization preserve correctness? — multi-layer validity checks on every response
What is the publishable performance number? — reproducible latency, throughput, and goodput metrics

Quick Start

Validate a single endpoint — correctness + performance in one command:

artemisllmbench validate \
  --endpoint http://localhost:9000 \
  --model Qwen/Qwen2.5-7B-Instruct \
  --hardware a100

Compare stock vs optimized — sequential benchmarking with full GPU resources for each:

artemisllmbench compare \
  --endpoint-a http://localhost:9000 \
  --endpoint-b http://localhost:9001 \
  --model Qwen/Qwen2.5-7B-Instruct \
  --hardware a100

Split-session compare — when the stock endpoint is already torn down:

# Session 1: save the stock baseline
artemisllmbench baseline --endpoint http://localhost:9000 --model <model> --hardware a100

# Session 2: run the optimized candidate
artemisllmbench candidate --endpoint http://localhost:9000 --model <model> --hardware a100

The dashboard launches automatically after each run. Open it at http://<your-ip>:8501.

Key Features

Multi-layer validity — sanity, structural, semantic (embedding similarity ≥ 0.92), and exact-match checks catch regressions that latency numbers alone miss
Reproducible metrics — TTFT, P95/P99 latency, ITL (inter-token latency), throughput, CV, drift, and spike detection
SLO / goodput tracking — set --slo-ttft and --slo-latency thresholds; get the % of requests that met them
Streamlit dashboard — live progress, side-by-side results, analytics charts, and a live response comparison panel
Fast mode — --fast cuts runtime by ~75% for quick iteration checks
Cross-machine support — endpoints can be on different hosts or different hardware

Common Flags

Flag	Description
`--fast`	Reduced runs (~75% faster). For quick checks only.
`--production`	Full 50-run sequential + concurrent load (default).
`--slo-ttft <ms>`	TTFT SLO threshold — enables goodput reporting.
`--slo-latency <ms>`	End-to-end latency SLO threshold.
`--plots`	ASCII charts inline in terminal output.
`--live`	Rich terminal live view during concurrent phases.
`--no-dashboard`	Skip auto-launching Streamlit.
`--port N`	Streamlit port (default: 8501).

Validity Layers

Layer	Check	On failure
1 Sanity	Non-empty, complete sentence, token bounds	Hard fail
2 Structural	JSON/Python syntax where required	Hard fail
3 Semantic	Cosine similarity ≥ 0.92 vs. reference	Hard fail / warning
4 Exact match	String equality (`control_prompt_v1` only)	Warning

Pre-flight Conformance Check

artemisllmbench check-conformance --endpoint http://localhost:9000

Verifies your endpoint speaks the required OpenAI-compatible SSE format before a full benchmark run.

Full documentation and source: artemisllmbench --help or artemisllmbench <command> --help

Project details

These details have not been verified by PyPI

Project links

Bug Tracker

Release history Release notifications | RSS feed

0.1.4

Jun 16, 2026

0.1.3

Jun 16, 2026

This version

0.1.2

Jun 16, 2026

0.1.1

Jun 15, 2026

0.1.0

Jun 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

artemisllmbench-0.1.2.tar.gz (112.4 kB view details)

Uploaded Jun 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

artemisllmbench-0.1.2-py3-none-any.whl (118.1 kB view details)

Uploaded Jun 16, 2026 Python 3

File details

Details for the file artemisllmbench-0.1.2.tar.gz.

File metadata

Download URL: artemisllmbench-0.1.2.tar.gz
Upload date: Jun 16, 2026
Size: 112.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for artemisllmbench-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`128f5c5ae7c3d81f4872634086903ee01aeb1080839780ecde3709eb2eeeca18`
MD5	`4bd3354fe9ee2140c4ad20e9ec1d14e1`
BLAKE2b-256	`eecf8ade19a9c35a9ca6af8b9eca744114949999d4403c79a1bf7992b543103e`

See more details on using hashes here.

File details

Details for the file artemisllmbench-0.1.2-py3-none-any.whl.

File metadata

Download URL: artemisllmbench-0.1.2-py3-none-any.whl
Upload date: Jun 16, 2026
Size: 118.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for artemisllmbench-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0107216b4cd8c2989450cabf4b1c3889f8561d049209c749d13078e2c715d683`
MD5	`dfcd97a12a871599a53ded2e1a6ed484`
BLAKE2b-256	`b307c3a139c54eb91343b6ac5a2f45e5a7757a0a7ad55afa0031998855d28df4`

See more details on using hashes here.

artemisllmbench 0.1.2

Navigation

Verified details

Project links

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Artemis LLM Benchmark

Install

What It Does

Quick Start

Key Features

Common Flags

Validity Layers

Pre-flight Conformance Check

Project details

Verified details

Project links

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes