Skip to main content

VLLM performance testing actuator

Project description

This repository contains the vLLM ado actuator for benchmarking LLM inference performance with vLLM. (For more about Actuators, what they represent, how to create them etc., see the ado docs).

The actuator implements a set of functionalities to deploy and run serving benchmarks for different LLMs for vLLM. This actuator deploys vLLM on to an OpenShift cluster to serve IBM Granite-3.3-8b and runs an experiment that utilises the vLLM serving benchmark. The the actuator is named vllm_performance and features two experiments: performance-testing-full and performance-testing-endpoint.

Getting Started

This guide has two parts:

After running the exercise, please feel free to explore further and try a larger experiment.

[!NOTE]

These prerequisites must be fulfilled before you start with this actuator

  1. Access to an OpenShift cluster with at least 1 node with 1 available NVIDIA GPU. You will need access to a namespace with permissions for GPU-based deployments
  2. You will need to have downloaded and installed ado according to this guide.

Installing and configuring the vLLM actuator

Installation

Ensure the virtual environment you installed ado into is active. Then, run:

pip install ado-vllm-performance

This will automatically install both vLLM and GuideLLM benchmarking tools, enabling all experiments:

  • test-deployment-v1 and test-endpoint-v1 (vLLM benchmarks)
  • test-deployment-guidellm-v1 and test-endpoint-guidellm-v1 (GuideLLM benchmarks)

For development from source:

pip install -e plugins/actuators/vllm_performance

from the root of the ado source repository. You can clone the repository with:

git clone https://github.com/IBM/ado.git

Confirm that the actuator is installed:

ado get actuators --details

You should see an output like below:

        ACTUATOR ID        CATALOG ID                 EXPERIMENT ID  SUPPORTED
0              mock              mock               test-experiment       True
1              mock              mock           test-experiment-two       True
2  vllm_performance  vllm_performance            test-deployment-v1       True
3  vllm_performance  vllm_performance              test-endpoint-v1       True

On the last two lines you can see the new actuator and the experiments. You can understand the constitutive properties required for the experiment and the target and observed properties measured by an experiment by running:

ado describe experiment test-deployment-v1

The experiment protocol for the vLLM actuator is defined in this YAML file. You will need to update this if you want to modify the values that can be accepted as valid for the input properties.

Configuring the actuator

Before using the vLLM actuator to execute experiments, you must configure its parameters. First, get the template for the configuration:

ado template actuatorconfiguration --actuator-identifier vllm_performance \
                                   -o actuatorconfiguration.yaml

This will create the vllm_performance_actuatorconfiguration.yaml file, which will look like:

actuatorIdentifier: vllm_performance
metadata:
  description: null
  labels: null
  name: null
parameters:
  benchmark_retries: 3
  hf_token: ""
  image_secret: ""
  in_cluster: true
  interpreter: python3
  max_environments: 1
  namespace: null
  node_selector: ""
  retries_timeout: 5
  verify_ssl: false

The three key parameters we have to set here are hf_token, namespace, and node_selector.

  • hf_token: Access token from HuggingFace.

  • namespace: The namespace you have access to in your OpenShift cluster.

  • node_selector: JSON dictionary representing a Kubernetes selector for a node with available GPUs. Make sure it is formatted correctly, for example:

    node_selector: '{"kubernetes.io/hostname":"cpu16"}'
    

We will discuss the other parameters later. Once you have put in the parameters, create the actuator configuration with:

ado create actuatorconfiguration -f `vllm_performance_actuatorconfiguration.yaml`

Note: You can have multiple configurations for an actuator.

A Simple Benchmarking Exercise

To get started, we have provided an exercise to run a benchmarking experiment for a single vLLM deployment configuration. The instructions for this exercise assume you are running ado from a machine outside of the target Kubernetes/OpenShift cluster.

Creating a Discovery Space to describe the vLLM configurations to test

[!NOTE]

Since this is an example exercise, we will use the local context and the default sample store.

Activating the local context

To ensure the local context is active, run:

ado context local

Defining a Discovery Space of vLLM configurations

ado uses the concept of Discovery Spaces to describe what to test (in this case vLLM workload configurations) and how to test them (the vLLM benchmark(s) to run).

The set of configurations to test is defined by the entity space, and the set of experiments to perform by the measurement space.

An example discoveryspace for vLLM inference benchmarking can be found in yamls/discoveryspace_override_defaults.yaml. This defines a simple discovery space with a single entity.

Our sample space will benchmark vLLM serving the LLM specified by model_name, on a node (determined through node_selector) with a specific GPU (NVIDIA-A100-80GB-PCIe) specified in gpu_type.

[!NOTE]

Ensure that the GPU specified in gpu_type is present on the node. To find out the gpu model of your selected node, try the following command:

oc describe node <node name> | grep "nvidia.com/gpu.product"

If this returns a different GPU model, then you must update the experiment protocol.

Create the discoveryspace:

ado create space -f yamls/discoveryspace_override_defaults.yaml \
                 --use-default-sample-store

Querying the Discovery Space

Before we run any experiment, we can see that the discoveryspace is empty:

ado show entities space --use-latest

Will output:

Nothing was returned for entity type matching and property format observed in space space-c81773-df57a3.

To see all the entities (parameter combinations) that are waiting to be measured, try executing:

ado show entities space --include missing --use-latest

The output will look like:

   model                             image                                           n_cpus  memory dtype  num_prompts  request_rate  max_concurrency  gpu_memory_utilization  cpu_offload  max_batch_tokens  max_num_seq  n_gpus  gpu_type
0  ibm-granite/granite-3.3-8b-instruct  quay.io/dataprep1/data-prep-kit/vllm_image:0.1  8.0     128Gi  auto   500.0        -1.0          -1.0             0.9                     0.0          16384.0           256.0        1.0     NVIDIA-A100-80GB-PCIe

Which is the entity we want to measure.

Exploring the vLLM workload configuration space

First, log in to your OpenShift cluster and select your assigned namespace

oc login <your OpenShift API endpoint>
oc project <your assigned namespace>

Next, we'll set up the operation to measure our entity defined above.

In ado parlance, measurements are executed through operations which represent the executions of experiments on entities.

An example of an operation can be found in yamls/random_walk_operation.yaml. You can run the operation using the actuator configuration and space that we have created earlier with:

ado create operation -f yamls/random_walk_operation.yaml \
                     --use-latest space --use-latest actuatorconfiguration

ado will initialise a local Ray cluster and starts the measurement at the point where these lines appear:

...
=========== Starting Discovery Operation ===========

(RandomWalk pid=2780) 'all' specified for number of entities to sample. This is 1 entities - the size of the entity space
...

The actuator uses the entity to create a vLLM deployment, followed by execution of the benchmark script. This process will take some time as it involves downloading the container image from Quay and the model from HuggingFace, both of which are network-intensive. You can monitor if the deployment is ready by executing the following in another shell:

oc get deployments --watch

The experiment is successfully completed if the ado output is similar to the following:

(RandomWalk pid=46852) Continuous Batching: EXPERIMENT COMPLETION. Received finished notification for experiment in measurement request in group 0: request-4332aa-experiment-performance-testing-entities-model.ibm-granite/granite-3.3-8b-instruct-image.quay.io/dataprep1/data-prep-kit/vllm_image:0.1-n_cpus.8-memory.128Gi-dtype.auto-num_prompts.500-request_rate.-1-max_concurrency.-1-gpu_memory_utilization.0.9-cpu_offload.0-max_batch_tokens.16384-max_num_seq.256-n_gpus.1-gpu_type.NVIDIA-A100-80GB-PCIe (explicit_grid_sample_generator)-requester-randomwalk-0.9.7.dev10+b7a010dd.dirty-42ad60-time-2025-08-11 15:53:54.137571+01:00
(RandomWalk pid=46852) Continuous batching: GET EXPERIMENT. No new experiments in queue. Requests made: 1. Experiments Completed: 1

If the output contains EXPERIMENT FAILURE, then something has gone wrong.

Verify that the entity has been measured by running:

ado show entities space --use-latest --output-format csv

The csv file will have one line representing the entity featuring values for all its measured properties (performance-testing-output_throughput,performance-testing-total_token_throughput,performance-testing-mean_ttft_ms, etc.)

Congratulations! You have successfully executed the vLLM benchmark on a vLLM workload configuration using ado!

Exploring Further

vLLM testing approach

vLLM testing implementation is based on this guide which is using benchmark_serving.py to implement the actual benchmarking. The benchmarking is done using HTTP requests using vLLM OpenAI API server.

To use this approach it is necessary to:

  • Create a docker image: Existing docker images for VLLM project are not directly suitable for this purpose, as they are hard to use on Openshift clusters and not directly extensible. We have provided a Docker image to get started but if you want to customize it for your installation, then you will need to rebuild it. We provide a slightly different build, described here
  • Create automation for vLLM deployment for running experiments. A simple implementation of such an automation is presented here
  • Create a vLLM performance test. Here we are directly reusing performance test provided by the vLLM project. The required code is here

This figure shows the outline of the components and the parameters available for configuring each of them

vLLM_testing

The test results in the figure are the measurements recorded for the entity. The deployment parameters form the configuration space. Test parameters are partially inferred from the configuration space and partially from the context (Kubernetes endpoints, etc.)

The Actuator Package: Key Files

The actuator package is under ado_actuators/vllm_performance. Note all actuator packages should be placed under a directory called ado_actuators as this is the name of package that contains all ado plugins.

The key files are:

  • actuator_definitions.yaml
    • This defines which classes in which modules of your package contain Actuators.
  • actuators.py
    • Implementation of the actuator logic.
    • It just needs to be the same name as in actuator_definitions.yaml
  • experiments.yaml
    • This file contains the definitions of the experiments the actuator defines as YAML
  • experiment_executor.py (OPTIONAL)
    • This file contains the code that
      • determines the values for the experiment parameters from the passed Entity and Experiment
      • execute the experiment and get measured property values
      • sends the measured property values back to the orchestrator

Customising Actuator Configurations

The actuator is configured using VLLMPerformanceTestParameters class

You can customise deployment_template, service_template and pvc_template for your OpenShift/K8s cluster. Refer to the default yamls for the templates referred to in Configuring the actuator and modify them appropriately

If you create a custom Docker image and upload it to a repository, please do not forget to create a corresponding Image pull secret in your assigned namespace. You must also update the value of the image_secret parameter of the actuator configuration.

Customising Experiment Protocol

The values for the parameters in the entity space must be a subset of the acceptable values defined for the experiment (the experiment protocol). Therefore, depending on your environment and use case, you may need to update the set of values to expand the configuration space being studied.

For example, you may want to benchmark a different LLM or you may want to change the GPU type to the one installed in your cluster. In the former case, you will add values to model_name and in the latter case, you will have to modify the domain of the gpu_type parameter to avoid validation errors.

To do this, open the experiment definition YAML file in a text editor, and add your GPU model to the list of values of gpu_type.

Then, reinstall this actuator by running:

pip install .

After that, you can use the new value of gpu_type in your experiments. For example, in the sample space definition file, the location to update will be:

- identifier: "gpu_type"
  propertyDomain:
    values: ["NVIDIA-A100-80GB-PCIe"]

Notes on the Random walk operation

VLLM testing is using external environment (deployment + service) to run tests. Creating such an environment is resource-intensive. To speed up experiments execution it is recommended to use group samplers for running VLLM testing. This allows to create an environment once and use it for all experiments that can be used for it. In this case the group definition looks as follows:

grouping:
  - model
  - image
  - n_gpus
  - gpu_type
  - n_cpus
  - memory
  - max_batch_tokens
  - gpu_memory_utilization
  - dtype
  - cpu_offload
  - max_num_seq

For the complete example of configuring random walk operation for the group samplers, look here

A few ideas for further exploration

Try:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ado_vllm_performance-1.4.1.tar.gz (593.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ado_vllm_performance-1.4.1-py3-none-any.whl (53.4 kB view details)

Uploaded Python 3

File details

Details for the file ado_vllm_performance-1.4.1.tar.gz.

File metadata

  • Download URL: ado_vllm_performance-1.4.1.tar.gz
  • Upload date:
  • Size: 593.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Red Hat Enterprise Linux","version":"9.6","id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for ado_vllm_performance-1.4.1.tar.gz
Algorithm Hash digest
SHA256 a3be1f31f7b13a0c8feb6d4d530e2cbe287acd2fb0fe90af8ac4b7b479880194
MD5 8d59f456b0cde51eb6992b76c22f212a
BLAKE2b-256 27c0d0b9557e85c505684e063d8b6e62f3ff42aeb079581b5626f3ada1997106

See more details on using hashes here.

File details

Details for the file ado_vllm_performance-1.4.1-py3-none-any.whl.

File metadata

  • Download URL: ado_vllm_performance-1.4.1-py3-none-any.whl
  • Upload date:
  • Size: 53.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Red Hat Enterprise Linux","version":"9.6","id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for ado_vllm_performance-1.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 17c1b4a940fda0cae5b7eee9fc67821307d3acd2c0641c9e9dcd2ec412a28ff3
MD5 1f309f8eb6e755de41f7d55dadd830e3
BLAKE2b-256 b4013224d994eae49df60328e520fd04398b3adb58fd305db1975dd568b09a45

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page