VLLM performance testing actuator
Project description
This repository contains the vLLM ado actuator for benchmarking LLM inference
performance with vLLM. (For more about Actuators, what they represent, how to
create them etc., see the ado
docs).
The actuator implements a set of functionalities to deploy and run serving
benchmarks for different LLMs for vLLM. This actuator deploys
vLLM on to an
OpenShift
cluster to serve
IBM Granite-3.3-8b
and runs an experiment that utilises the
vLLM serving benchmark.
The the actuator is named vllm_performance and features two experiments:
performance-testing-full and performance-testing-endpoint.
Getting Started
This guide has two parts:
- Getting Started
- Exploring Further
After running the exercise, please feel free to explore further and try a larger experiment.
[!NOTE]
These prerequisites must be fulfilled before you start with this actuator
- Access to an OpenShift cluster with at least 1 node with 1 available NVIDIA GPU. You will need access to a namespace with permissions for GPU-based deployments
- You will need to have downloaded and installed
adoaccording to this guide.
Installing and configuring the vLLM actuator
Installation
Ensure the virtual environment you installed ado into is active. Then, run:
pip install -e plugins/actuators/vllm_performance
from the root of the ado source repository.
You can clone the repository with
git clone https://github.com/IBM/ado.git
Confirm that the actuator is installed:
ado get actuators --details
You should see an output like below:
ACTUATOR ID CATALOG ID EXPERIMENT ID SUPPORTED
0 mock mock test-experiment True
1 mock mock test-experiment-two True
2 vllm_performance vllm_performance performance-testing-full True
3 vllm_performance vllm_performance performance-testing-endpoint True
On the last two lines you can see the new actuator and the experiments. You can understand the constitutive properties required for the experiment and the target and observed properties measured by an experiment by running:
ado describe experiment performance-testing-full
The experiment protocol for the vLLM actuator is defined in this YAML file. You will need to update this if you want to modify the values that can be accepted as valid for the input properties.
Configuring the actuator
Before using the vLLM actuator to execute experiments, you must configure its parameters. First, get the template for the configuration:
ado template actuatorconfiguration --actuator-identifier vllm_performance \
-o actuatorconfiguration.yaml
This will create the vllm_performance_actuatorconfiguration.yaml file, which
will look like:
actuatorIdentifier: vllm_performance
metadata:
description: null
labels: null
name: null
parameters:
benchmark_retries: 3
deployment_template: deployment.yaml
hf_token: ""
image_secret: ""
in_cluster: true
interpreter: python3
max_environments: 1
namespace: null
node_selector: ""
pvc_template: pvc.yaml
retries_timeout: 5
service_template: service.yaml
verify_ssl: false
The three key parameters we have to set here are hf_token, namespace, and
node_selector.
-
hf_token: Access token from HuggingFace. -
namespace: The namespace you have access to in your OpenShift cluster. -
node_selector: JSON dictionary representing a Kubernetes selector for a node with available GPUs. Make sure it is formatted correctly, for example:node_selector: '{"kubernetes.io/hostname":"cpu16"}'
We will discuss the other parameters later. Once you have put in the parameters, create the actuator configuration with:
ado create actuatorconfiguration -f `vllm_performance_actuatorconfiguration.yaml`
Note: You can have multiple configurations for an actuator.
A Simple Benchmarking Exercise
To get started, we have provided an exercise to run a benchmarking experiment
for a single vLLM deployment configuration. The instructions for this exercise
assume you are running ado from a machine outside of the target
Kubernetes/OpenShift cluster.
Creating a Discovery Space to describe the vLLM configurations to test
[!NOTE]
Since this is an example exercise, we will use the
localcontext and thedefaultsample store.
Activating the local context
To ensure the local context is active, run:
ado context local
Defining a Discovery Space of vLLM configurations
ado uses the concept of
Discovery Spaces to
describe what to test (in this case vLLM workload configurations) and how to
test them (the vLLM benchmark(s) to run).
The set of configurations to test is defined by the entity space, and the set of experiments to perform by the measurement space.
An example discoveryspace for vLLM inference benchmarking can be found in
yamls/discoveryspace_override_defaults.yaml.
This defines a simple discovery space with a single entity.
Our sample space will benchmark vLLM serving the LLM specified by model_name,
on a node (determined through node_selector) with a specific GPU
(NVIDIA-A100-80GB-PCIe) specified in gpu_type.
[!NOTE]
Ensure that the GPU specified in
gpu_typeis present on the node. To find out the gpu model of your selected node, try the following command:oc describe node <node name> | grep "nvidia.com/gpu.product"If this returns a different GPU model, then you must update the experiment protocol.
Create the discoveryspace:
ado create space -f yamls/discoveryspace_override_defaults.yaml \
--use-default-sample-store
Querying the Discovery Space
Before we run any experiment, we can see that the discoveryspace is empty:
ado show entities space --use-latest
Will output:
Nothing was returned for entity type matching and property format observed in space space-c81773-df57a3.
To see all the entities (parameter combinations) that are waiting to be measured, try executing:
ado show entities space --include missing --use-latest
The output will look like:
model image n_cpus memory dtype num_prompts request_rate max_concurrency gpu_memory_utilization cpu_offload max_batch_tokens max_num_seq n_gpus gpu_type
0 ibm-granite/granite-3.3-8b-instruct quay.io/dataprep1/data-prep-kit/vllm_image:0.1 8.0 128Gi auto 500.0 -1.0 -1.0 0.9 0.0 16384.0 256.0 1.0 NVIDIA-A100-80GB-PCIe
Which is the entity we want to measure.
Exploring the vLLM workload configuration space
First, log in to your OpenShift cluster and select your assigned namespace
oc login <your OpenShift API endpoint>
oc project <your assigned namespace>
Next, we'll set up the operation to measure our entity defined above.
In ado parlance, measurements are executed through operations which
represent the executions of experiments on entities.
An example of an operation can be found in
yamls/random_walk_operation.yaml.
You can run the operation using the actuator configuration and space that we
have created earlier with:
ado create operation -f yamls/random_walk_operation.yaml \
--use-latest space --use-latest actuatorconfiguration
ado will initialise a local Ray cluster and starts the measurement at the
point where these lines appear:
...
=========== Starting Discovery Operation ===========
(RandomWalk pid=2780) 'all' specified for number of entities to sample. This is 1 entities - the size of the entity space
...
The actuator uses the entity to create a vLLM deployment, followed by execution of the benchmark script. This process will take some time as it involves downloading the container image from Quay and the model from HuggingFace, both of which are network-intensive. You can monitor if the deployment is ready by executing the following in another shell:
oc get deployments --watch
The experiment is successfully completed if the ado output is similar to the
following:
(RandomWalk pid=46852) Continuous Batching: EXPERIMENT COMPLETION. Received finished notification for experiment in measurement request in group 0: request-4332aa-experiment-performance-testing-entities-model.ibm-granite/granite-3.3-8b-instruct-image.quay.io/dataprep1/data-prep-kit/vllm_image:0.1-n_cpus.8-memory.128Gi-dtype.auto-num_prompts.500-request_rate.-1-max_concurrency.-1-gpu_memory_utilization.0.9-cpu_offload.0-max_batch_tokens.16384-max_num_seq.256-n_gpus.1-gpu_type.NVIDIA-A100-80GB-PCIe (explicit_grid_sample_generator)-requester-randomwalk-0.9.7.dev10+b7a010dd.dirty-42ad60-time-2025-08-11 15:53:54.137571+01:00
(RandomWalk pid=46852) Continuous batching: GET EXPERIMENT. No new experiments in queue. Requests made: 1. Experiments Completed: 1
If the output contains EXPERIMENT FAILURE, then something has gone wrong.
Verify that the entity has been measured by running:
ado show entities space --use-latest --output-format csv
The csv file will have one line representing the entity featuring values for all
its measured properties
(performance-testing-output_throughput,performance-testing-total_token_throughput,performance-testing-mean_ttft_ms,
etc.)
Congratulations! You have successfully executed the vLLM benchmark on a vLLM
workload configuration using ado!
Exploring Further
vLLM testing approach
vLLM testing implementation is based on this
guide which is using
benchmark_serving.py
to implement the actual benchmarking. The benchmarking is done using HTTP
requests using vLLM OpenAI API server.
To use this approach it is necessary to:
- Create a docker image: Existing docker images for VLLM project are not directly suitable for this purpose, as they are hard to use on Openshift clusters and not directly extensible. We have provided a Docker image to get started but if you want to customize it for your installation, then you will need to rebuild it. We provide a slightly different build, described here
- Create automation for vLLM deployment for running experiments. A simple implementation of such an automation is presented here
- Create a vLLM performance test. Here we are directly reusing performance test provided by the vLLM project. The required code is here
This figure shows the outline of the components and the parameters available for configuring each of them
The test results in the figure are the measurements recorded for the entity. The deployment parameters form the configuration space. Test parameters are partially inferred from the configuration space and partially from the context (Kubernetes endpoints, etc.)
The Actuator Package: Key Files
The actuator package is under ado_actuators/vllm_performance. Note all
actuator packages should be placed under a directory called ado_actuators as
this is the name of package that contains all ado plugins.
The key files are:
- actuator_definitions.yaml
- This defines which classes in which modules of your package contain Actuators.
- actuators.py
- Implementation of the actuator logic.
- It just needs to be the same name as in
actuator_definitions.yaml
- experiments.yaml
- This file contains the definitions of the experiments the actuator defines as YAML
- experiment_executor.py (OPTIONAL)
- This file contains the code that
- determines the values for the experiment parameters from the passed Entity and Experiment
- execute the experiment and get measured property values
- sends the measured property values back to the orchestrator
- This file contains the code that
Customising Actuator Configurations
The actuator is configured using VLLMPerformanceTestParameters class
You can customise deployment_template, service_template and pvc_template
for your OpenShift/K8s cluster. Refer to the
default yamls for the
templates referred to in Configuring the actuator
and modify them appropriately
If you create a custom Docker image and upload it to a repository, please do not
forget to create a corresponding Image pull secret in your assigned namespace.
You must also update the value of the image_secret parameter of the actuator
configuration.
Customising Experiment Protocol
The values for the parameters in the entity space must be a subset of the acceptable values defined for the experiment (the experiment protocol). Therefore, depending on your environment and use case, you may need to update the set of values to expand the configuration space being studied.
For example, you may want to benchmark a different LLM or you may want to change
the GPU type to the one installed in your cluster. In the former case, you will
add values to model_name and in the latter case, you will have to modify the
domain of the gpu_type parameter to avoid validation errors.
To do this, open the
experiment definition YAML file
in a text editor, and add your GPU model to the list of values of gpu_type.
Then, reinstall this actuator by running:
pip install .
After that, you can use the new value of gpu_type in your experiments. For
example, in
the sample space definition file,
the location to update will be:
- identifier: "gpu_type"
propertyDomain:
values: ["NVIDIA-A100-80GB-PCIe"]
Notes on the Random walk operation
VLLM testing is using external environment (deployment + service) to run tests. Creating such an environment is resource-intensive. To speed up experiments execution it is recommended to use group samplers for running VLLM testing. This allows to create an environment once and use it for all experiments that can be used for it. In this case the group definition looks as follows:
grouping:
- model
- image
- n_gpus
- gpu_type
- n_cpus
- memory
- max_batch_tokens
- gpu_memory_utilization
- dtype
- cpu_offload
- max_num_seq
For the complete example of configuring random walk operation for the group samplers, look here
A few ideas for further exploration
Try:
- Testing throughput for different sequence length for multiple models using
this actuator (See
discoveryspace_override_defaults_small.yaml
for an example with multiple values for
max_batch_tokens)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ado_vllm_performance-1.2.1.tar.gz.
File metadata
- Download URL: ado_vllm_performance-1.2.1.tar.gz
- Upload date:
- Size: 475.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
52e928dfe140cc729c4716d097eaca4cfadddd9eb7314f28d44004de802199b6
|
|
| MD5 |
8ca6412811434eadf8218b898d4179ee
|
|
| BLAKE2b-256 |
4c133c88a4b7480bf5f3e61e5dd61c3cc67024260b081b1d959d9526fae20a6e
|
File details
Details for the file ado_vllm_performance-1.2.1-py3-none-any.whl.
File metadata
- Download URL: ado_vllm_performance-1.2.1-py3-none-any.whl
- Upload date:
- Size: 34.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7ba521734cf822cdb2a7222274562e9afd7eef12678367e7d648db282eddb5da
|
|
| MD5 |
c5600f03d004eef533e9d7481016025e
|
|
| BLAKE2b-256 |
c445ede57d3c74be09eb9a804cb4ecaa25f1946d90371e82f8fedc34ac899ea4
|