Skip to main content

LLM Inference Benchmarking Tool

Project description

ScaleBench_AI: LLM Inference Benchmarking Tool by Infobell IT

scalebench is a CLI-based tool designed to benchmark LLM (Large Language Model) inference endpoints. It helps evaluate performance using real-world prompts with configurable parameters and visualized results.

Features

  • Easy-to-use CLI interface
  • Benchmark LLM inference across multiple Inference Servers
  • Measures key performance metrics: latency, throughput, and TTFT (Time to First Token)
  • Support for varying input and output token lengths
  • Simulate concurrent users to test scalability
  • Determine the optimal number of concurrent users the server can handle while maintaining: TTFT < 2000 ms and Token latency < 200 ms
  • Detailed logging and progress tracking

Supported Inference Servers

  • TGI
  • vLLM
  • Ollama
  • Llamacpp
  • NIMS
  • SGLang

Performance metrics:

The performance metrics captured for varying input and output tokens and parallel users while running the benchmark includes

  • Latency (ms/token)
  • TTFT(ms)
  • Throughput(tokens/sec)

Installation

You can install scalebench using pip:

pip install scalebench

Alternatively, you can install from source:

git clone https://github.com/Infobellit-Solutions-Pvt-Ltd/ScaleBench_AI
cd scalebench
pip install -e .

Usage

scalebench provides a simple CLI interface for running LLM Inference benchmarks.

Below are the steps to run a sample test, assuming the generation endpoint is active.

1. Download the Dataset and create a default config.json

Before running a benchmark, you need to download and filter the dataset:

scalebench dataprep

This command will:

  • Download the filtered ShareGPT dataset from Huggingface
  • Create a default config.json file in your working directory

2. Configure the Benchmark

Edit the generated config.json file to match your LLM server configuration. Below is a sample:

{
    "_comment": "scalebench Configuration",
    "out_dir": "Results",
    "base_url": "http://localhost:8000/v1/completions",
    "tokenizer_path": "/path/to/tokenizer/",
    "inference_server": "vLLM",
    "model": "/model",
    "random_prompt": true,
    "max_requests": 1,
    "user_counts": [
        10
    ],
    "increment_user": [
        100
    ],
    "input_tokens": [
        32
    ],
    "output_tokens": [
        256
    ]
}

Note: Modify base_url, tokenizer_path, model, and other fields according to your LLM deployment.

Prompt Configuration Modes

scalebench supports two input modes depending on your test requirements:

1. Fixed Input Tokens

If you want to run the benchmark with a fixed number of input tokens:

  • Set "random_prompt": false
  • Define both input_tokens and output_tokens explicitly
2. Random Input Length

If you prefer using randomized prompts from the dataset:

  • Set "random_prompt": true
  • Provide only output_tokens — scalebench will choose random input lengths from the dataset

User Load Configuration (For optimaluserrun)

To perform optimal user benchmarking:

  • Use user_counts to set the initial number of concurrent users
  • Use increment_user to define how many users to add per step

Example:

"user_counts": [10],
"increment_user": [100]

In this case, the benchmark will start with 10 users and increase by 100 in each iteration until performance thresholds are hit.

Tokenizer Configuration

scalebench allows two ways to configure the tokenizer used for benchmarking:

Option 1: Use a Custom Tokenizer

Set the TOKENIZER environment variable to the path of your desired tokenizer.

Option 2: Use Default Fallback

If TOKENIZER is not set or is empty, scalebench falls back to a built-in default tokenizer:

This ensures the tool remains functional, but the fallback tokenizer may not align with your model's behavior. Use it only for testing or when no tokenizer is specified.


💡 Best Practice: Always specify the correct tokenizer that matches your LLM model for accurate benchmarking results.


Use these combinations as per your requirement to effectively benchmark your LLM endpoint.

3. Run the Benchmark

Option A: Standard Benchmarking

Use the start command to run a basic benchmark:

scalebench start --config path/to/config.json

Option B: Optimal User Load Benchmarking

To find the optimal number of concurrent users for your LLM endpoint:

scalebench optimaluserrun --config path/to/config.json

4. Plot the Results

Visualize the benchmark results using the built-in plotting tool:

scalebench plot --results-dir path/to/your/results_dir --config path/to/config.json

Note: The --config parameter is optional and defaults to config.json. It is used to read the random_prompt setting from the configuration file.

Output

scalebench will create a results directory (or the directory specified in out_dir) containing:

  • CSV files with raw benchmark data
  • Averaged results for each combination of users, input tokens, and output tokens
  • Log files for each Locust run

Analyzing Results

After the benchmark completes, you can find CSV files in the output directory. These files contain information about latency, throughput, and TTFT for each test configuration.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scalebench-0.1.3.tar.gz (22.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scalebench-0.1.3-py3-none-any.whl (22.8 kB view details)

Uploaded Python 3

File details

Details for the file scalebench-0.1.3.tar.gz.

File metadata

  • Download URL: scalebench-0.1.3.tar.gz
  • Upload date:
  • Size: 22.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for scalebench-0.1.3.tar.gz
Algorithm Hash digest
SHA256 c4f94ed1c756114a7a5524efac4d83a29d306fa3ef24a0043693221c2db273a8
MD5 65078423c1ef5f30eb1b11e20acc27fe
BLAKE2b-256 62e275a6982e084300969cd8316b2f8800c4b7660138832e6ce1d027553fb703

See more details on using hashes here.

File details

Details for the file scalebench-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: scalebench-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 22.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for scalebench-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 a089451dca5648c31a79b3307175e39bda48384185ed2e680680a91857cee2a2
MD5 083c1649a6066e6c30d09a2f4abe9dc6
BLAKE2b-256 e09815f65618b991254ad8f47dc99349d73b2792cb5401b2e06dbefbb5ec6779

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page