Production-grade HuggingFace inference benchmarking tool
Project description
HF Inference Benchmark
Production-grade benchmarking infrastructure for HuggingFace inference workloads.
hf-inference-benchmark is a reproducible, device-agnostic benchmarking system that measures the real operational cost of running LLMs — latency, throughput, and memory — under production-like conditions.
It answers one critical question:
“Will this model crash my server — and how fast can it actually run?”
Why This Tool Exists
Most public HuggingFace benchmarking scripts:
• Measure only a single forward pass
• Ignore warmup behavior
• Ignore GPU synchronization
• Ignore memory allocator behavior
• Produce misleading results
This tool implements the same benchmarking discipline used by real ML infrastructure teams.
Benchmarking Pipeline
Model Load
↓
Warmup Passes
↓
Synchronized Execution
↓
Latency Profiling (P50 / P95 / Avg)
↓
Token Counting
↓
Throughput Calculation (tokens/sec)
↓
Peak Memory Tracking
↓
Structured JSON Export
Metric Definitions
| Metric | Meaning |
|---|---|
| latency_p50 | Median inference latency |
| throughput | Real generation speed in tokens/sec |
| memory_mb | Peak RAM/VRAM usage |
| warmup | Kernel stabilization |
| synchronization | Accurate GPU timing |
User Installation
# From PyPI
pip install hf-inference-benchmark
# From Source (for Developers)
git clone https://github.com/rgb-99/hf-inference-benchmark.git
cd hf-inference-benchmark
pip install -e .
Basic Usage
# Run on CPU/GPU (Auto-detected)
hf-bench facebook/opt-125m
# Export results for the Reporting Suite
hf-bench gpt2 --tokens 64 --out results/gpt2_perf.json
Persisting Results
hf-bench facebook/opt-125m --tokens 64 --out results/opt125m.json
Example:
{
"model": "facebook/opt-125m",
"throughput": 54.36,
"latency_p50": 878.83,
"memory_mb": 797.56
}
Reproducible Benchmarking
For fair comparison:
• Fix prompt
• Fix token count
• Fix device
• Use warmup runs
• Compare structured JSON outputs
Platform Integration
This tool is part of the unified NLP infrastructure platform:
nlp-tool benchmark facebook/opt-125m
nlp-tool report results/opt125m.json
Roadmap
• Batch-size profiling
• Streaming generation benchmarks
• Multi-GPU scaling
• Energy-cost estimation
• CI-based regression tracking
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hf_inference_benchmark-0.1.4.tar.gz.
File metadata
- Download URL: hf_inference_benchmark-0.1.4.tar.gz
- Upload date:
- Size: 5.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6a4aa6df006af05ae95e5e3ec24ca938a30756dc53cacc3d42f26d75f2fcdf19
|
|
| MD5 |
cd7f546a71cad3985fc372d6b116a8f4
|
|
| BLAKE2b-256 |
16b909e34412e50566c355e5ec2e2f4050fdecedfa2f3036878949e35ad88424
|
File details
Details for the file hf_inference_benchmark-0.1.4-py3-none-any.whl.
File metadata
- Download URL: hf_inference_benchmark-0.1.4-py3-none-any.whl
- Upload date:
- Size: 7.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4b363243566460ad37cbdde74386d5bd9e9ecf7762a36554c2b6b0e2a750ac17
|
|
| MD5 |
2c03f03992ea9424f71b4cf49d6811b6
|
|
| BLAKE2b-256 |
e2fbe6d9cbf9d4ac051d8097c7503d1584c956a01d72298dc49a8bde8a2be2e4
|