Skip to main content

A lightweight, high-performance benchmarking tool for NVIDIA NIM LLMs

Project description

nimbench cli

A lightweight, high-performance benchmarking tool for NVIDIA NIM LLMs.
Measure latency, throughput, and reliability with style.


🚀 Overview

nimbench is a surgical CLI tool designed to benchmark NVIDIA NIM (NVIDIA Inference Microservices) chat models. Powered by httpx for connection-pooled requests and rich for beautiful terminal presentation, it handles model discovery, intelligent filtering, and robust benchmarking, providing you with a clean, formatted performance report.

✨ Key Features

  • 🔍 Auto-Discovery: Automatically finds and ranks all available models from your NVIDIA NIM endpoint.
  • 📊 Precise Metrics: Measures Median, Min, Max latency and Tokens Per Second (TPS).
  • ⏱️ Progress & ETA: Live interactive progress bar with percentage and estimated time remaining.
  • 🌈 Rich Terminal UI: Beautiful, color-coded status tables and highlights using rich.
  • 🔌 Connection Pooling: Uses httpx to reuse TCP connections, minimizing handshake overhead for accurate latency comparisons.
  • 🛡️ Intelligent Retries: Automatically handles rate limits (429) by respecting Retry-After headers and applies temperature fallbacks when needed.
  • 📝 Failure Analysis: Detailed breakdown of failure reasons (Not Provisioned, Timeout, Unsupported, etc.).
  • 💾 Skip Cache: Remembers failed models to speed up subsequent runs.

🔬 What it measures

nimbench measures wall-clock request time for a minimal POST /v1/chat/completions call. It is designed to evaluate request/response latency rather than long-form output quality.

Default Request Shape:

  • Prompt: Reply with one short word.
  • Max Tokens: 8
  • Temperature: 0.0 (with automatic fallback to 0.1 if rejected).

The CLI reports tokens per second for each model. It uses server-provided metrics when available, or derives an approximate rate from completion_tokens / wall_time.


🛠️ How it behaves

  • Discovery: Fetches all models from GET /v1/models and filters for likely chat-capable IDs.
  • Sequential Execution: Benchmarking is performed sequentially to preserve the 40 RPM (Requests Per Minute) cap.
  • Intelligent Skipping: A local skip cache is maintained for models that are not provisioned, reject chat input, or repeatedly timeout.
  • Cap Logic: The --limit flag means "stop after N successful benchmarks", preventing your rate limit from being wasted on unavailable models.

📦 Installation

Requires Python 3.10+.

git clone https://github.com/your-username/nimbench.git
cd nimbench
pip install -e .

🚀 Quick Start

Benchmark the top 10 most likely chat models:

python3 -m nimbench --limit 10

Advanced Usage

# Benchmark everything (including non-chat) with 3 repeats each
python3 -m nimbench --all-models --repeats 3

# Filter for specific models using regex
python3 -m nimbench --pattern "llama|nemotron|mistral"

# Export results to JSON
python3 -m nimbench --limit 5 --json > results.json

⚙️ Configuration & Options

API Key Precedence

  1. --api-key command-line argument.
  2. NVIDIA_API_KEY environment variable.
  3. Interactive prompt.

Options

Option Description
--api-key KEY NVIDIA API key
--base-url URL API base URL (Default: https://integrate.api.nvidia.com/v1)
--limit N Stop after N successful benchmarks
--pattern REGEX Only consider model ids matching REGEX
--timeout SECONDS Request timeout for each HTTP call
--repeats N Requests per model
--json Emit JSON instead of a text table
--rpm N Request rate cap (Default: 40)
--all-models Benchmark full catalog instead of chat-only default
--refresh-cache Ignore the local skip cache for this run

Environment Variables

  • NVIDIA_API_KEY: Your NVIDIA API key.
  • NIMBENCH_CACHE_DIR: Set this to override the default local skip cache directory.

🧪 Testing

Run the comprehensive test suite:

python3 -m unittest discover tests

Built with 💚 for the LLM community.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nimbench-0.1.0.tar.gz (181.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nimbench-0.1.0-py3-none-any.whl (176.6 kB view details)

Uploaded Python 3

File details

Details for the file nimbench-0.1.0.tar.gz.

File metadata

  • Download URL: nimbench-0.1.0.tar.gz
  • Upload date:
  • Size: 181.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nimbench-0.1.0.tar.gz
Algorithm Hash digest
SHA256 76ced5d45a3bf585cef90a98bab9a40110fe45df3a19c305ba5e4ff89285a127
MD5 a4df68a00116fefd041a20077624ed50
BLAKE2b-256 26e71aa17f7f64dc790b6279afcc261f4e7bc885e0109e4591843805db690d0a

See more details on using hashes here.

Provenance

The following attestation bundles were made for nimbench-0.1.0.tar.gz:

Publisher: publish.yml on youxufkhan/nimbench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file nimbench-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: nimbench-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 176.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nimbench-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e58d4e40e701c8cab54f8f5033bb35f4549f7bd9797382485df2406610fb215a
MD5 c0899792d8ddba6b61db01793f302c44
BLAKE2b-256 2612ead95d0d16523e5ea868b7ba90a8d6028d5193eeb6482f8a9e768893dd0f

See more details on using hashes here.

Provenance

The following attestation bundles were made for nimbench-0.1.0-py3-none-any.whl:

Publisher: publish.yml on youxufkhan/nimbench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page