A library to benchmark LLMs via their API exposure
Project description
Benchmark LLM serving
benchmark_llm_serving is a script aimed at benchmarking the serving API of LLMs. For now, two backends are implemented : mistral and vLLM (via happy-vllm which is an API layer on vLLM adding new endpoints and permitting a configuration via environment variables).
Installation
It is advised to clone the repository in order to get the datasets used for the benchmarks (you can find them in src/benchmark_llm_serving/datasets) and build it from source:
git clone https://github.com/France-Travail/benchmark_llm_serving.git
cd benchmark_llm_serving
pip install -e .
You can also install benchmark_llm_serving using pip:
pip install benchmark_llm_serving
and download the datasets directly from the repository
Quickstart
Launch the script bench_suite.py via the entrypoint bench-suite if you want a complete benchmarking of your API deployed via happy_vllm. This will launch several individual benchmarks which will be aggregated to draw graphs used to compare the models. All the results will be saved in the output folder (by default results).
You can specify the launch arguments either via the CLI or a .env (see the .env.example for an example). If you cloned this repo and are benchmarking an API deployed via happy_vllm, you only need to specify the arguments base-url or the couple host/port. For example you can write :
bench-suite --host 127.0.0.1 --port 5000
Be careful, with the default arguments (those written in .env.example) the whole bench suite can be quite long (around 15 hours).
Results
After the bench suite ends, you obtain a folder containing :
- The results of all the benchmarks (in the zip file
raw_results.zip) - A folder
reportcontaining the aggregation of all the individual benchmarks. More specifically:parameters.jsoncontaining all the parameters for the bench, in particular, the arguments used to launch thehappy_vllmAPIprompt_ingestion_graph.pngcontaining the graph of the speed of prompt ingestion by the model. It is the time taken to produce the first token vs the length of the prompt. The speed is the slope of this line and is indicated in the title of the graph. The data used for this graph is contained in thedatafolder.thresholds.csvis a .csv containing, for each couple of input length/output length, the number of parallel requests such that : the kv cache usage is inferior to 100% and the speed generation is above a specified threshold (by default, 20 tokens per second)total_speed_generation_graph.pngis a graph containing, for each couple of input length/output length, the total speed generation vs the number of parallel requests. So, for example, if the model can answer to 10 parallel requests each with a speed of 20 tokens per second, the value on the graph will be 200 tokens per second (20 x 10). The data used for this graph is contained in thedatafolder.- If the backend is
happy_vllm: a folderkv_cache_profilecontaining, for each couple of input length/output length, a graph showing the response of the LLMs to n requests launched at the same time. On the y-axis, you have the kv cache usage, the number of requests running and the number of requests waiting. On the x-axis, you have the time. The graph is obtained by sending one request, watching the response of the LLM then two requests, then three requests, ... - A folder
speed_generationcontaining, for each couple of input length/output length, a graph showing the speed generation (per request) in token per second vs the number of parallel requests. The graph also shows the time to the first token generated in milliseconds. If the backend ishappy_vllmit also shows the max kv cache usage for this number of parallel requests. The corresponding data is in thedatafolder
Note that the various input lengths are "32", "1024" and "4096" to simulate small, medium and long prompt. These length are to be understood as roughly this size (and generally speaking a bit above this size). The various output lengths are 16, 128 and 1024. Contrary to the input lengths, these are exact : the model produced exactly this number of tokens.
Launch arguments
Here is a list of the arguments:
model: The name of the model you need to query the model. If you are using happy_vllm, you don't need to give it since it will automatically fetch itbase-url: The base url for the API you want to benchmarkhost: The host of the API (if you specify a base-url, you don't need to specify a host)port: The port of the API (if you specify a base-url, you don't need to specify a port)dataset-folder: The folder where the datasets for querying the API are (by default, it is insrc/benchmark_llm_serving/datasets)output-folder: The folder where the results will be written (by default in theresultsfolder)gpu-name: The name of the GPU on which the model is (defaultNone)step-live-metrics: The time, in second, between two querying of the/metrics/endpoint of happy_vllm (default0.01)max-queries: The maximal number of query for each bench (default1000)max-duration-prompt-ingestion: The max duration (in seconds) for the execution of an individual script benchmarking the prompt ingestion ( default900)max-duration-kv-cache-profile: The max duration (in seconds) for the execution of an individual script benchmarking the KV cache usage ( default900)max-duration-speed-generation: The max duration (in seconds) for the execution of an individual script benchmarking the speed generation ( default900). It is also the max duration permitted for the launch of all the scripts benchmarking the speed generation for a given couple of input length/output length.min-duration-speed-generation: For each individual script benchmarking the speed generation, if this min duration (in seconds) is reached and the target-queries-nb is also reached, the script will end (default60)target-queries-nb-speed-generation: For each individual script benchmarking the speed generation, if this target-queries-nb is reached and the min-duration is also reached, the script will end (default100)min-number-of-valid-queries: The minimal number of valid queries that should be present in a file to be considered for graph drawing (default50)backend: Onlyhappy_vllmandmistralare supported.completions-endpoint: The endpoint for completions (default/v1/completions)metrics-endpoint: The endpoint for the metrics (default/metrics/)info-endpoint: The info endpoint (default/v1/info)launch-arguments-endpoint: The endpoint for getting the launch arguments of the API (default/v1/launch_arguments)speed-threshold: The speed generation above which the model is considered ok (default value20). It is only useful when writingthresholds.csvmodel-name: The name that should be displayed on the graph (default value :None). If it isNone, the name displayed will be the one of the argumentmodel
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file benchmark_llm_serving-1.0.3.tar.gz.
File metadata
- Download URL: benchmark_llm_serving-1.0.3.tar.gz
- Upload date:
- Size: 12.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.8.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
75ef46b506a7b5d107199362f49566d51de5eb03d41f6c4da8c871ded37e42e0
|
|
| MD5 |
9561132a1bf9e536ca2402d597a6cdfb
|
|
| BLAKE2b-256 |
682d963035518e182d376f15554b2df93b18f7c654d9f4a7c8c876667a3dd893
|
File details
Details for the file benchmark_llm_serving-1.0.3-py3-none-any.whl.
File metadata
- Download URL: benchmark_llm_serving-1.0.3-py3-none-any.whl
- Upload date:
- Size: 12.2 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.8.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
220a3b48ba5adfd75b16b9b209575cbb7d51d6097b4915a49874579b3ff96103
|
|
| MD5 |
54a4f3953b29a30e2ec600d7661216f2
|
|
| BLAKE2b-256 |
5ef6ed174fe991e3f8a3300cbe08bb351f5cf1ad4c2f2e868ae96823d13cefcb
|