Skip to main content

Comprehensive benchmark suite for semantic router vs direct vLLM evaluation across multiple reasoning datasets

Project description

vLLM Semantic Router Benchmark Suite

Python 3.8+ License: Apache 2.0

A comprehensive benchmark suite for evaluating semantic router performance against direct vLLM across multiple reasoning datasets. Perfect for researchers and developers working on LLM routing, evaluation, and performance optimization.

๐ŸŽฏ Key Features

  • 6 Major Reasoning Datasets: MMLU-Pro, ARC, GPQA, TruthfulQA, CommonsenseQA, HellaSwag
  • Router vs vLLM Comparison: Side-by-side performance evaluation
  • Multiple Evaluation Modes: NR (neutral), XC (explicit CoT), NR_REASONING (auto-reasoning)
  • Research-Ready Output: CSV files and publication-quality plots
  • Dataset-Agnostic Architecture: Easy to extend with new datasets
  • CLI Tools: Simple command-line interface for common operations

๐Ÿš€ Quick Start

Installation

pip install vllm-semantic-router-bench

Basic Usage

# Quick test on MMLU dataset
vllm-semantic-router-bench test --dataset mmlu --samples 5

# Full comparison between router and vLLM
vllm-semantic-router-bench compare --dataset arc --samples 10

# List available datasets
vllm-semantic-router-bench list-datasets

# Run comprehensive multi-dataset benchmark
vllm-semantic-router-bench comprehensive

Python API

from vllm_semantic_router_bench import DatasetFactory, list_available_datasets

# Load a dataset
factory = DatasetFactory()
dataset = factory.create_dataset("mmlu")
questions, info = dataset.load_dataset(samples_per_category=10)

print(f"Loaded {len(questions)} questions from {info.name}")
print(f"Categories: {info.categories}")

๐Ÿ“Š Supported Datasets

Dataset Domain Categories Difficulty CoT Support
MMLU-Pro Academic Knowledge 57 subjects Undergraduate โœ…
ARC Scientific Reasoning Science Grade School โŒ
GPQA Graduate Q&A Graduate-level Graduate โŒ
TruthfulQA Truthfulness Truthfulness Hard โŒ
CommonsenseQA Common Sense Common Sense Hard โŒ
HellaSwag Commonsense NLI ~50 activities Moderate โŒ

๐Ÿ”ง Advanced Usage

Custom Evaluation Script

import subprocess
import sys

# Run detailed benchmark with custom parameters
cmd = [
    "router-bench",  # Main benchmark script
    "--dataset", "mmlu",
    "--samples-per-category", "20", 
    "--run-router", "--router-models", "auto",
    "--run-vllm", "--vllm-models", "openai/gpt-oss-20b",
    "--vllm-exec-modes", "NR", "NR_REASONING",
    "--output-dir", "results/custom_test"
]

subprocess.run(cmd)

Plotting Results

# Generate plots from benchmark results
bench-plot --router-dir results/router_mmlu \
           --vllm-dir results/vllm_mmlu \
           --output-dir results/plots \
           --dataset-name "MMLU-Pro"

๐Ÿ“ˆ Research Output

The benchmark generates research-ready outputs:

  • CSV Files: Detailed per-question results and aggregated metrics
  • Master CSV: Combined results across all test runs
  • Plots: Accuracy and token usage comparisons
  • Summary Reports: Markdown reports with key findings

Example Output Structure

results/
โ”œโ”€โ”€ research_results_master.csv          # Main research data
โ”œโ”€โ”€ comparison_20250115_143022/
โ”‚   โ”œโ”€โ”€ router_mmlu/
โ”‚   โ”‚   โ””โ”€โ”€ detailed_results.csv
โ”‚   โ”œโ”€โ”€ vllm_mmlu/  
โ”‚   โ”‚   โ””โ”€โ”€ detailed_results.csv
โ”‚   โ”œโ”€โ”€ plots/
โ”‚   โ”‚   โ”œโ”€โ”€ accuracy_comparison.png
โ”‚   โ”‚   โ””โ”€โ”€ token_usage_comparison.png
โ”‚   โ””โ”€โ”€ RESEARCH_SUMMARY.md

๐Ÿ› ๏ธ Development

Local Installation

git clone https://github.com/vllm-project/semantic-router
cd semantic-router/bench
pip install -e ".[dev]"

Adding New Datasets

  1. Create a new dataset implementation in dataset_implementations/
  2. Inherit from DatasetInterface
  3. Register in dataset_factory.py
  4. Add tests and documentation
from vllm_semantic_router_bench import DatasetInterface, Question, DatasetInfo

class MyDataset(DatasetInterface):
    def load_dataset(self, **kwargs):
        # Implementation here
        pass
    
    def format_prompt(self, question, style="plain"):
        # Implementation here  
        pass

๐Ÿ“‹ Requirements

  • Python 3.8+
  • OpenAI API access (for model evaluation)
  • Hugging Face account (for dataset access)
  • 4GB+ RAM (for larger datasets)

Dependencies

  • openai>=1.0.0 - OpenAI API client
  • datasets>=2.14.0 - Hugging Face datasets
  • pandas>=1.5.0 - Data manipulation
  • matplotlib>=3.5.0 - Plotting
  • seaborn>=0.11.0 - Advanced plotting
  • tqdm>=4.64.0 - Progress bars

๐Ÿค Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Common Contributions

  • Adding new datasets
  • Improving evaluation metrics
  • Enhancing visualization
  • Performance optimizations
  • Documentation improvements

๐Ÿ“„ License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

๐Ÿ”— Links

๐Ÿ“ž Support

  • GitHub Issues: Bug reports and feature requests
  • Documentation: Comprehensive guides and API reference
  • Community: Join our discussions and get help from other users

Made with โค๏ธ by the vLLM Semantic Router Team

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vllm_semantic_router_bench-1.0.0.tar.gz (53.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vllm_semantic_router_bench-1.0.0-py3-none-any.whl (45.5 kB view details)

Uploaded Python 3

File details

Details for the file vllm_semantic_router_bench-1.0.0.tar.gz.

File metadata

File hashes

Hashes for vllm_semantic_router_bench-1.0.0.tar.gz
Algorithm Hash digest
SHA256 3e11ce635573814d1aebbc6a9f087edfdd044754dd8bc44b33ed17ad5d845f29
MD5 04c17bc03a0a3c4cfd2dd15e730ebbc5
BLAKE2b-256 7530268500ec3ea185d761f17e20504e8e2c6fcaa7d556e018a14a0c85611899

See more details on using hashes here.

File details

Details for the file vllm_semantic_router_bench-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for vllm_semantic_router_bench-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7ab7d6ac106ab1169fa880ca86a19d9f680cdadcb546db437c1c5a33c4467b7d
MD5 24705d06f5ffb7aaa703188629e28902
BLAKE2b-256 114a28f3aa49c315f39919c9d7ae4be0ea07f8f8bf72e1972c070717a67a7e4a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page