Skip to main content

LLM Inference Benchmarking Tool

Project description

ScaleBench_AI: LLM Inference Benchmarking Tool by Infobell IT

scalebench is a CLI-based tool designed to benchmark LLM (Large Language Model) inference endpoints. It helps evaluate performance using real-world prompts with configurable parameters and visualized results.

Features

  • Easy-to-use CLI interface
  • Benchmark LLM inference across multiple Inference Servers
  • Measures key performance metrics: latency, throughput, and TTFT (Time to First Token)
  • Support for varying input and output token lengths
  • Simulate concurrent users to test scalability
  • Determine the optimal number of concurrent users the server can handle while maintaining: TTFT < 2000 ms and Token latency < 200 ms
  • Detailed logging and progress tracking

Supported Inference Servers

  • TGI
  • vLLM
  • Ollama
  • Llamacpp
  • NIMS
  • SGLang

Performance metrics:

The performance metrics captured for varying input and output tokens and parallel users while running the benchmark includes

  • Latency (ms/token)
  • TTFT(ms)
  • Throughput(tokens/sec)

Installation

You can install scalebench using pip:

pip install scalebench

Alternatively, you can install from source:

git clone https://github.com/Infobellit-Solutions-Pvt-Ltd/ScaleBench_AI
cd scalebench
pip install -e .

Usage

scalebench provides a simple CLI interface for running LLM Inference benchmarks.

Below are the steps to run a sample test, assuming the generation endpoint is active.

1. Download the Dataset and create a default config.json

Before running a benchmark, you need to download and filter the dataset:

scalebench dataprep

This command will:

  • Download the filtered ShareGPT dataset from Huggingface
  • Create a default config.json file in your working directory

2. Configure the Benchmark

Edit the generated config.json file to match your LLM server configuration. Below is a sample:

{
    "_comment": "scalebench Configuration",
    "out_dir": "Results",
    "base_url": "http://localhost:8000/v1/completions",
    "tokenizer_path": "/path/to/tokenizer/",
    "inference_server": "vLLM",
    "model": "/model",
    "random_prompt": true,
    "max_requests": 1,
    "user_counts": [
        10
    ],
    "increment_user": [
        100
    ],
    "input_tokens": [
        32
    ],
    "output_tokens": [
        256
    ]
}

Note: Modify base_url, tokenizer_path, model, and other fields according to your LLM deployment.

Prompt Configuration Modes

scalebench supports two input modes depending on your test requirements:

1. Fixed Input Tokens

If you want to run the benchmark with a fixed number of input tokens:

  • Set "random_prompt": false
  • Define both input_tokens and output_tokens explicitly
2. Random Input Length

If you prefer using randomized prompts from the dataset:

  • Set "random_prompt": true
  • Provide only output_tokens — scalebench will choose random input lengths from the dataset

User Load Configuration (For optimaluserrun)

To perform optimal user benchmarking:

  • Use user_counts to set the initial number of concurrent users
  • Use increment_user to define how many users to add per step

Example:

"user_counts": [10],
"increment_user": [100]

In this case, the benchmark will start with 10 users and increase by 100 in each iteration until performance thresholds are hit.

Tokenizer Configuration

scalebench allows two ways to configure the tokenizer used for benchmarking:

Option 1: Use a Custom Tokenizer

Set the TOKENIZER environment variable to the path of your desired tokenizer.

Option 2: Use Default Fallback

If TOKENIZER is not set or is empty, scalebench falls back to a built-in default tokenizer:

This ensures the tool remains functional, but the fallback tokenizer may not align with your model's behavior. Use it only for testing or when no tokenizer is specified.


💡 Best Practice: Always specify the correct tokenizer that matches your LLM model for accurate benchmarking results.


Use these combinations as per your requirement to effectively benchmark your LLM endpoint.

3. Run the Benchmark

Option A: Standard Benchmarking

Use the start command to run a basic benchmark:

scalebench start --config path/to/config.json

Option B: Optimal User Load Benchmarking

To find the optimal number of concurrent users for your LLM endpoint:

scalebench optimaluserrun --config path/to/config.json

4. Plot the Results

Visualize the benchmark results using the built-in plotting tool:

scalebench plot --results-dir path/to/your/results_dir

Output

scalebench will create a results directory (or the directory specified in out_dir) containing:

  • CSV files with raw benchmark data
  • Averaged results for each combination of users, input tokens, and output tokens
  • Log files for each Locust run

Analyzing Results

After the benchmark completes, you can find CSV files in the output directory. These files contain information about latency, throughput, and TTFT for each test configuration.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scalebench-0.1.2.tar.gz (22.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scalebench-0.1.2-py3-none-any.whl (22.6 kB view details)

Uploaded Python 3

File details

Details for the file scalebench-0.1.2.tar.gz.

File metadata

  • Download URL: scalebench-0.1.2.tar.gz
  • Upload date:
  • Size: 22.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for scalebench-0.1.2.tar.gz
Algorithm Hash digest
SHA256 b25dd34a05827d5dabb15c17e98be911a9d947b30d27187bc7e07d484a182b25
MD5 26323b42b85220c5de4b51cf27b48af2
BLAKE2b-256 d9a4bdcc145932291cbd061ecc84e48f0740587610f4d8b8ff5d62bc219f9b35

See more details on using hashes here.

File details

Details for the file scalebench-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: scalebench-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 22.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for scalebench-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 d858dbe6eae820bda698382eb3ce138fac45965bd376a5a0f0aa59e0069f7166
MD5 c6f765ae1e92bdb10140c6539da758d5
BLAKE2b-256 ca03996b78c7ca01d4d597d4c916e5f8356b010fb544eee61564617cbef1f3a6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page