LLM benchmarking tools for the LLM CLI
Project description
LLM Benchmarking Plugin
This is a plugin for the llm tool that adds a benchmark command to compare the performance of different language models.
The commands runs a prompt with optional system prompt for several models and compares the performance between models.
Installation
You can install the plugin using pip:
pip install llm-profile
or using llm
llm install llm-profile
Metrics
- Total time - The time taken from the request to the end of the final chunk
- Time to First Chunk - The time taken from the request to the first chunk of the response
- Length of Response - The length of the response text
- Number of Chunks - The number of chunks in the response
- Chunks per Second - The number of chunks divided by the total time taken
Benchmark Usage
To run a benchmark, provide the prompt along with any number of models using the llm alias (from llm models):
$ llm benchmark -m azure/ant-grok-3-mini -m azure/ants-gpt-4.1-mini -s "Respond in emoji" "Give me a friendly hello message" --markdown
For a single pass (no repeats) you will get a summary table:
| Benchmark | Total Time | Time to First Chunk | Length of Response | Number of Chunks | Chunks per Second |
|---|---|---|---|---|---|
| azure/ant-grok-3-mini | 7.79 | 7.76 | 112 | 30 | 3.85 |
| azure/ants-gpt-4.1-mini | 2.99 | 2.80 | 78 | 19 | 6.36 |
To repeat each benchmark and get an average of times, use the --repeat argument:
| Benchmark | Total Time | Time to First Chunk | Length of Response | Number of Chunks | Chunks per Second |
|---|---|---|---|---|---|
| azure/ant-grok-3-mini | 2.59 <-> 8.39 (x̄=5.49) | 2.57 <-> 8.36 (x̄=5.47) | 65 <-> 109 (x̄=87.00) | 18 <-> 30 (x̄=24.00) | 2.15 <-> 11.58 (x̄=6.86) |
| azure/ants-gpt-4.1-mini | 0.54 <-> 2.88 (x̄=1.71) | 0.26 <-> 2.69 (x̄=1.47) | 76 <-> 78 (x̄=77.00) | 19 <-> 19 (x̄=19.00) | 6.60 <-> 35.17 (x̄=20.89) |
The printout is a range (min <-> max (x̄=mean))
Providing options
You can provide key/value options for all models using the --option flag. This can be useful for setting parameters like temperature, max tokens, etc.
Example:
$ llm benchmark -m gpt-4.1-mini -m gpt-4.1-nano --option temperature 0.7 --option max_tokens 100 "Give me a friendly hello message"
This feature is also helpful for setting the seed option for reproducibility and isolating variances in time to first chunk and time to completion with the same prompt and result.
Markdown formatted results
By default, tables are printed with color showing the fastest and slowest metric in a benchmark:
If you want to customize the output, you can use the --markdown flag to get the results in a Markdown-friendly format.
Non-Streaming models
If you want to benchmark models that do not support streaming, you can use the --no-stream flag. This will disable streaming and provide a single response time.
Warmup
By default, the first test will be repeated and only the second timing data will be used. This is to allow the model to setup authentication or anything else which will skew the performance metrics.
You can disable this behavior by using the --no-repeat-first flag.
Graphs
The benchmark tool can produce a PNG graph like this:
To get a graph, add the --graph file.png with the path to the results graph file. You will need to install matplotlib to generate the graph:
$ pip install llm-profile[graph]
matplotlib isn't installed by default to keep the dependencies for this plugin smaller.
Saving results
You can save the benchmark results, including the text from each prompt to a YAML file using the --output option:
$ llm benchmark --output results.yaml
All of the timing data for each request as well as the text response will be saved in the YAML file:
GPT-5-chat (Azerbaijan):
- chunks_per_sec: 19.549590468140675
length_of_response: 34
model_name: GPT-5-chat (Azerbaijan)
n_chunks: 9
text: The capital of Azerbaijan is Baku.
time_to_first_chunk: 0.36127520000445656
total_time: 0.46036770001228433
- chunks_per_sec: 20.64352283058553
length_of_response: 34
model_name: GPT-5-chat (Azerbaijan)
n_chunks: 9
text: The capital of Azerbaijan is Baku.
time_to_first_chunk: 0.3545127999968827
total_time: 0.43597210000734776
Test Plans
Instead of providing tests and scenarios on the command line you can provide a YAML file with the benchmark plan:
$ llm benchmark --plan plan.yaml
To get a printout of exactly what will be tested, use the --verbose option.
A single-plan YAML file (one plan mapping):
name: Capital Cities # required
repeat: 3 # optional, defaults to 1
system: "You are a helpful assistant." # optional
prompt: "Answer concisely." # optional, defaults to model.prompt
options: # optional, must be options which apply to all models
max_output_tokens: 50
models:
- name: GPT-5-mini (Azerbaijan) # required, used for graphs and tables
model: azure/gpt-5-mini # required, llm ID from `llm models
prompt: "What is the capital of Azerbaijan?" # Optional, defaults to plan prompt
# Example of second model test in plan
- name: GPT-5-mini (France, short)
model: azure/gpt-5-mini
prompt: "What is the capital of France?"
options:
max_output_tokens: 10 # overrides plan options by key
Field reference
name(string, required) — Plan label used in output.models(list of mappings, required) — Each entry is aTestModelwith the fields below.repeat(int, optional) — Number of times to repeat the plan (default: 1).system(string, optional) — Global system prompt for all models unlessmodel.systemis set.prompt(string, optional) — Global prompt for all models unlessmodel.promptis set.options(mapping, optional) — Plan-level option defaults (merged with each model’soptions).
model entries (per item in models):
name(string, required) — Label shown in tables/plots.model(string, required) — Model identifier passed tollm.get_model(...).system(string, optional) — Model-specific system prompt (overrides plan-levelsystem).prompt(string, optional) — Model-specific prompt (overrides plan-levelprompt).options(mapping, optional) — Model-specific options (override plan-leveloptions).
Behavior notes
- Per-model
optionsoverride plan-leveloptions. The code merges plan options then model options before prompting. - If a model doesn’t set
prompt/system, the plan-levelprompt/systemare used. - Option keys and types are validated at runtime against the model’s
Optionsschema (seebuild_options/model.Options).
Example 1: Comparing prompts
This test will test the same model with different prompts:
name: Prompt Comparison
models:
- name: Capital of Azerbaijan
model: azure/gpt-5-chat
prompt: "What is the capital of Azerbaijan?"
- name: Capital of France
model: azure/gpt-5-chat
prompt: "What is the capital of France?"
repeat: 5
Example 2: Several Models, one with overrides on options
name: Capital Cities
models:
- name: GPT-5-chat (Azerbaijan)
model: azure/gpt-5-chat
prompt: "What is the capital of Azerbaijan?"
- name: GPT-5-chat (France)
model: azure/gpt-5-chat
prompt: "What is the capital of France?"
- name: GPT-5-chat (one word)
model: azure/gpt-5-chat
system: "Only respond with 1 word answers"
prompt: "What is the capital of France?"
- name: GPT-5-chat (low temp)
model: azure/gpt-5-chat
prompt: "What is the capital of France?"
options:
temperature: 0.5
system: "You are a helpful little robot"
repeat: 3
options:
temperature: 0.9
Embedding Benchmarks
As well as language models, you can benchmark embedding models using a similar approach. The main difference is in the input data and the model selection.
To see available embedding models, use the llm embed-models command.
Give this list to the llm embed-benchmark command like so:
$ llm embed-benchmark "I'm on the red eye flight to nowhere. How about you?" -m azure/text-embedding-3-small-512 -m azure/text-embedding-3-small -m azure/text-embedding-ada-002 -m nomic-embed-text:latest --repeat 2 --markdown --graph embed-graph.png
This provides a comparison table:
| Benchmark | Total Time |
|---|---|
| azure/text-embedding-3-small-512 | 0.26 <-> 0.93 (x̄=0.59) |
| azure/text-embedding-3-small | 0.91 <-> 3.45 (x̄=2.18) |
| azure/text-embedding-ada-002 | 0.92 <-> 3.49 (x̄=2.20) |
| nomic-embed-text:latest | 1.03 <-> 1.15 (x̄=1.09) |
As well as a graph:
Embedding Model Plans
You can provide a YAML file with the benchmark configuration for embedding models in a similar way to language models:
name: Embedding Benchmark
data: "The input to the embedding model"
models:
- name: Azure Text Embedding 3 Small 512
model: azure/text-embedding-3-small-512
- name: Azure Text Embedding 3 Small
model: azure/text-embedding-3-small
- name: Azure Text Embedding ADA 002
model: azure/text-embedding-ada-002
- name: Nomic Embed Text Latest
model: nomic-embed-text:latest
repeat: 2
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_profile-0.6.0.tar.gz.
File metadata
- Download URL: llm_profile-0.6.0.tar.gz
- Upload date:
- Size: 11.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f32399dadb59aa98564c6ad645a1f7d6959a37023ceb048656b98c761dbd0a41
|
|
| MD5 |
ef7a2d7d1ad281d02fb2cb53aef07049
|
|
| BLAKE2b-256 |
7d34fb9818f592fa1afe1325968ec1d0f72b5c56ce12b92fb1df8a3ad94fdea5
|
Provenance
The following attestation bundles were made for llm_profile-0.6.0.tar.gz:
Publisher:
python-publish.yml on tonybaloney/llm-profile
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llm_profile-0.6.0.tar.gz -
Subject digest:
f32399dadb59aa98564c6ad645a1f7d6959a37023ceb048656b98c761dbd0a41 - Sigstore transparency entry: 437906892
- Sigstore integration time:
-
Permalink:
tonybaloney/llm-profile@45e6d4733d87b4c17fa2e840aa2ff23467689d0e -
Branch / Tag:
refs/tags/0.6.0 - Owner: https://github.com/tonybaloney
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@45e6d4733d87b4c17fa2e840aa2ff23467689d0e -
Trigger Event:
release
-
Statement type:
File details
Details for the file llm_profile-0.6.0-py3-none-any.whl.
File metadata
- Download URL: llm_profile-0.6.0-py3-none-any.whl
- Upload date:
- Size: 11.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e04d24c97dfc2404391076d65e8a0b5b28c66e8edf13e7f4c7265cf75d54d61d
|
|
| MD5 |
184037932c92ecea85b6e6922b377427
|
|
| BLAKE2b-256 |
890fd6a548b5e5acd7b01d26f42e37abeb79edb344b67576927d341506061f07
|
Provenance
The following attestation bundles were made for llm_profile-0.6.0-py3-none-any.whl:
Publisher:
python-publish.yml on tonybaloney/llm-profile
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llm_profile-0.6.0-py3-none-any.whl -
Subject digest:
e04d24c97dfc2404391076d65e8a0b5b28c66e8edf13e7f4c7265cf75d54d61d - Sigstore transparency entry: 437906923
- Sigstore integration time:
-
Permalink:
tonybaloney/llm-profile@45e6d4733d87b4c17fa2e840aa2ff23467689d0e -
Branch / Tag:
refs/tags/0.6.0 - Owner: https://github.com/tonybaloney
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@45e6d4733d87b4c17fa2e840aa2ff23467689d0e -
Trigger Event:
release
-
Statement type: