LLM Inference Benchmarking Tool
Project description
ScaleBench_AI: LLM Inference Benchmarking Tool by Infobell IT
scalebench is a CLI-based tool designed to benchmark LLM (Large Language Model) inference endpoints. It helps evaluate performance using real-world prompts with configurable parameters and visualized results.
Features
- Easy-to-use CLI interface
- Benchmark LLM inference across multiple Inference Servers
- Measures key performance metrics: latency, throughput, and TTFT (Time to First Token)
- Support for varying input and output token lengths
- Simulate concurrent users to test scalability
- Determine the optimal number of concurrent users the server can handle while maintaining: TTFT < 2000 ms and Token latency < 200 ms
- Detailed logging and progress tracking
Supported Inference Servers
- TGI
- vLLM
- Ollama
- Llamacpp
- NIMS
- SGLang
Performance metrics:
The performance metrics captured for varying input and output tokens and parallel users while running the benchmark includes
- Latency (ms/token)
- TTFT(ms)
- Throughput(tokens/sec)
Installation
You can install scalebench using pip:
pip install scalebench
Alternatively, you can install from source:
git clone https://github.com/Infobellit-Solutions-Pvt-Ltd/ScaleBench_AI
cd scalebench
pip install -e .
Usage
scalebench provides a simple CLI interface for running LLM Inference benchmarks.
Below are the steps to run a sample test, assuming the generation endpoint is active.
1. Download the Dataset and create a default config.json
Before running a benchmark, you need to download and filter the dataset:
scalebench dataprep
This command will:
- Download the filtered ShareGPT dataset from Huggingface
- Create a default
config.jsonfile in your working directory
2. Configure the Benchmark
Edit the generated config.json file to match your LLM server configuration. Below is a sample:
{
"_comment": "scalebench Configuration",
"out_dir": "Results",
"base_url": "http://localhost:8000/v1/completions",
"tokenizer_path": "/path/to/tokenizer/",
"inference_server": "vLLM",
"model": "/model",
"random_prompt": true,
"max_requests": 1,
"user_counts": [
10
],
"increment_user": [
100
],
"input_tokens": [
32
],
"output_tokens": [
256
]
}
Note: Modify base_url, tokenizer_path, model, and other fields according to your LLM deployment.
Prompt Configuration Modes
scalebench supports two input modes depending on your test requirements:
1. Fixed Input Tokens
If you want to run the benchmark with a fixed number of input tokens:
- Set
"random_prompt": false - Define both
input_tokensandoutput_tokensexplicitly
2. Random Input Length
If you prefer using randomized prompts from the dataset:
- Set
"random_prompt": true - Provide only
output_tokens— scalebench will choose random input lengths from the dataset
User Load Configuration (For optimaluserrun)
To perform optimal user benchmarking:
- Use
user_countsto set the initial number of concurrent users - Use
increment_userto define how many users to add per step
Example:
"user_counts": [10],
"increment_user": [100]
In this case, the benchmark will start with 10 users and increase by 100 in each iteration until performance thresholds are hit.
Tokenizer Configuration
scalebench allows two ways to configure the tokenizer used for benchmarking:
Option 1: Use a Custom Tokenizer
Set the TOKENIZER environment variable to the path of your desired tokenizer.
Option 2: Use Default Fallback
If TOKENIZER is not set or is empty, scalebench falls back to a built-in default tokenizer:
This ensures the tool remains functional, but the fallback tokenizer may not align with your model's behavior. Use it only for testing or when no tokenizer is specified.
💡 Best Practice: Always specify the correct tokenizer that matches your LLM model for accurate benchmarking results.
Use these combinations as per your requirement to effectively benchmark your LLM endpoint.
3. Run the Benchmark
Option A: Standard Benchmarking
Use the start command to run a basic benchmark:
scalebench start --config path/to/config.json
Option B: Optimal User Load Benchmarking
To find the optimal number of concurrent users for your LLM endpoint:
scalebench optimaluserrun --config path/to/config.json
4. Plot the Results
Visualize the benchmark results using the built-in plotting tool:
scalebench plot --results-dir path/to/your/results_dir --config path/to/config.json
Note: The --config parameter is optional and defaults to config.json. It is used to read the random_prompt setting from the configuration file.
Output
scalebench will create a results directory (or the directory specified in out_dir) containing:
- CSV files with raw benchmark data
- Averaged results for each combination of users, input tokens, and output tokens
- Log files for each Locust run
Analyzing Results
After the benchmark completes, you can find CSV files in the output directory. These files contain information about latency, throughput, and TTFT for each test configuration.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scalebench-0.1.4.tar.gz.
File metadata
- Download URL: scalebench-0.1.4.tar.gz
- Upload date:
- Size: 25.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fe85012373df7f2189f54ed8a2d7ab895f9d5501050e7fda0f383cabae71544b
|
|
| MD5 |
dbfbe944894f8d40a15c4c820432326b
|
|
| BLAKE2b-256 |
a50d620495e1347d6650d8221e495459a336e3775d7f4ca82d08d028dc631049
|
File details
Details for the file scalebench-0.1.4-py3-none-any.whl.
File metadata
- Download URL: scalebench-0.1.4-py3-none-any.whl
- Upload date:
- Size: 23.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f9989e55bfc100e08bc42bd29cd6c6762125d56b7728144e12c9adfcf46ccf07
|
|
| MD5 |
4c0d3bcab2233f6db98ddbc085aac782
|
|
| BLAKE2b-256 |
250e63c4c2b4e721a5b4650e474e13a6a02bf6ed5b0decb030f0d2ae2c312354
|