Comprehensive benchmark suite for semantic router vs direct vLLM evaluation across multiple reasoning datasets
Project description
vLLM Semantic Router Benchmark Suite
A comprehensive benchmark suite for evaluating semantic router performance against direct vLLM across multiple reasoning datasets. Perfect for researchers and developers working on LLM routing, evaluation, and performance optimization.
๐ฏ Key Features
- 6 Major Reasoning Datasets: MMLU-Pro, ARC, GPQA, TruthfulQA, CommonsenseQA, HellaSwag
- Router vs vLLM Comparison: Side-by-side performance evaluation
- Multiple Evaluation Modes: NR (neutral), XC (explicit CoT), NR_REASONING (auto-reasoning)
- Research-Ready Output: CSV files and publication-quality plots
- Dataset-Agnostic Architecture: Easy to extend with new datasets
- CLI Tools: Simple command-line interface for common operations
๐ Quick Start
Installation
pip install vllm-semantic-router-bench
Basic Usage
# Quick test on MMLU dataset
vllm-semantic-router-bench test --dataset mmlu --samples 5
# Full comparison between router and vLLM
vllm-semantic-router-bench compare --dataset arc --samples 10
# List available datasets
vllm-semantic-router-bench list-datasets
# Run comprehensive multi-dataset benchmark
vllm-semantic-router-bench comprehensive
Python API
from vllm_semantic_router_bench import DatasetFactory, list_available_datasets
# Load a dataset
factory = DatasetFactory()
dataset = factory.create_dataset("mmlu")
questions, info = dataset.load_dataset(samples_per_category=10)
print(f"Loaded {len(questions)} questions from {info.name}")
print(f"Categories: {info.categories}")
๐ Supported Datasets
| Dataset | Domain | Categories | Difficulty | CoT Support |
|---|---|---|---|---|
| MMLU-Pro | Academic Knowledge | 57 subjects | Undergraduate | โ |
| ARC | Scientific Reasoning | Science | Grade School | โ |
| GPQA | Graduate Q&A | Graduate-level | Graduate | โ |
| TruthfulQA | Truthfulness | Truthfulness | Hard | โ |
| CommonsenseQA | Common Sense | Common Sense | Hard | โ |
| HellaSwag | Commonsense NLI | ~50 activities | Moderate | โ |
๐ง Advanced Usage
Custom Evaluation Script
import subprocess
import sys
# Run detailed benchmark with custom parameters
cmd = [
"router-bench", # Main benchmark script
"--dataset", "mmlu",
"--samples-per-category", "20",
"--run-router", "--router-models", "auto",
"--run-vllm", "--vllm-models", "openai/gpt-oss-20b",
"--vllm-exec-modes", "NR", "NR_REASONING",
"--output-dir", "results/custom_test"
]
subprocess.run(cmd)
Plotting Results
# Generate plots from benchmark results
bench-plot --router-dir results/router_mmlu \
--vllm-dir results/vllm_mmlu \
--output-dir results/plots \
--dataset-name "MMLU-Pro"
๐ Research Output
The benchmark generates research-ready outputs:
- CSV Files: Detailed per-question results and aggregated metrics
- Master CSV: Combined results across all test runs
- Plots: Accuracy and token usage comparisons
- Summary Reports: Markdown reports with key findings
Example Output Structure
results/
โโโ research_results_master.csv # Main research data
โโโ comparison_20250115_143022/
โ โโโ router_mmlu/
โ โ โโโ detailed_results.csv
โ โโโ vllm_mmlu/
โ โ โโโ detailed_results.csv
โ โโโ plots/
โ โ โโโ accuracy_comparison.png
โ โ โโโ token_usage_comparison.png
โ โโโ RESEARCH_SUMMARY.md
๐ ๏ธ Development
Local Installation
git clone https://github.com/vllm-project/semantic-router
cd semantic-router/bench
pip install -e ".[dev]"
Adding New Datasets
- Create a new dataset implementation in
dataset_implementations/ - Inherit from
DatasetInterface - Register in
dataset_factory.py - Add tests and documentation
from vllm_semantic_router_bench import DatasetInterface, Question, DatasetInfo
class MyDataset(DatasetInterface):
def load_dataset(self, **kwargs):
# Implementation here
pass
def format_prompt(self, question, style="plain"):
# Implementation here
pass
๐ Requirements
- Python 3.8+
- OpenAI API access (for model evaluation)
- Hugging Face account (for dataset access)
- 4GB+ RAM (for larger datasets)
Dependencies
openai>=1.0.0- OpenAI API clientdatasets>=2.14.0- Hugging Face datasetspandas>=1.5.0- Data manipulationmatplotlib>=3.5.0- Plottingseaborn>=0.11.0- Advanced plottingtqdm>=4.64.0- Progress bars
๐ค Contributing
We welcome contributions! Please see our Contributing Guidelines for details.
Common Contributions
- Adding new datasets
- Improving evaluation metrics
- Enhancing visualization
- Performance optimizations
- Documentation improvements
๐ License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
๐ Links
- Documentation: https://vllm-semantic-router.com
- GitHub: https://github.com/vllm-project/semantic-router
- Issues: https://github.com/vllm-project/semantic-router/issues
- PyPI: https://pypi.org/project/vllm-semantic-router-bench/
๐ Support
- GitHub Issues: Bug reports and feature requests
- Documentation: Comprehensive guides and API reference
- Community: Join our discussions and get help from other users
Made with โค๏ธ by the vLLM Semantic Router Team
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vllm_semantic_router_bench-1.0.0.tar.gz.
File metadata
- Download URL: vllm_semantic_router_bench-1.0.0.tar.gz
- Upload date:
- Size: 53.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3e11ce635573814d1aebbc6a9f087edfdd044754dd8bc44b33ed17ad5d845f29
|
|
| MD5 |
04c17bc03a0a3c4cfd2dd15e730ebbc5
|
|
| BLAKE2b-256 |
7530268500ec3ea185d761f17e20504e8e2c6fcaa7d556e018a14a0c85611899
|
File details
Details for the file vllm_semantic_router_bench-1.0.0-py3-none-any.whl.
File metadata
- Download URL: vllm_semantic_router_bench-1.0.0-py3-none-any.whl
- Upload date:
- Size: 45.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7ab7d6ac106ab1169fa880ca86a19d9f680cdadcb546db437c1c5a33c4467b7d
|
|
| MD5 |
24705d06f5ffb7aaa703188629e28902
|
|
| BLAKE2b-256 |
114a28f3aa49c315f39919c9d7ae4be0ea07f8f8bf72e1972c070717a67a7e4a
|