Production-grade HuggingFace inference benchmarking tool

Project description

HF Inference Benchmark

Python Code Style License

Production-grade benchmarking infrastructure for HuggingFace inference workloads.

hf-inference-benchmark is a reproducible, device-agnostic benchmarking system that measures the real operational cost of running LLMs — latency, throughput, and memory — under production-like conditions.

It answers one critical question:

“Will this model crash my server — and how fast can it actually run?”

Why This Tool Exists

Most public HuggingFace benchmarking scripts:

• Measure only a single forward pass
• Ignore warmup behavior
• Ignore GPU synchronization
• Ignore memory allocator behavior
• Produce misleading results

This tool implements the same benchmarking discipline used by real ML infrastructure teams.

Benchmarking Pipeline

Model Load
   ↓
Warmup Passes
   ↓
Synchronized Execution
   ↓
Latency Profiling (P50 / P95 / Avg)
   ↓
Token Counting
   ↓
Throughput Calculation (tokens/sec)
   ↓
Peak Memory Tracking
   ↓
Structured JSON Export

Metric Definitions

Metric	Meaning
latency_p50	Median inference latency
throughput	Real generation speed in tokens/sec
memory_mb	Peak RAM/VRAM usage
warmup	Kernel stabilization
synchronization	Accurate GPU timing

User Installation

# From PyPI
pip install hf-inference-benchmark

# From Source (for Developers)
git clone https://github.com/rgb-99/hf-inference-benchmark.git
cd hf-inference-benchmark
pip install -e .

Basic Usage

# Run on CPU/GPU (Auto-detected)
hf-bench facebook/opt-125m

# Export results for the Reporting Suite
hf-bench gpt2 --tokens 64 --out results/gpt2_perf.json

Persisting Results

hf-bench facebook/opt-125m --tokens 64 --out results/opt125m.json

Example:

{
  "model": "facebook/opt-125m",
  "throughput": 54.36,
  "latency_p50": 878.83,
  "memory_mb": 797.56
}

Reproducible Benchmarking

For fair comparison:

• Fix prompt
• Fix token count
• Fix device
• Use warmup runs
• Compare structured JSON outputs

Platform Integration

This tool is part of the unified NLP infrastructure platform:

nlp-tool benchmark facebook/opt-125m
nlp-tool report results/opt125m.json

Roadmap

• Batch-size profiling
• Streaming generation benchmarks
• Multi-GPU scaling
• Energy-cost estimation
• CI-based regression tracking

Project details

Release history Release notifications | RSS feed

This version

0.1.4

Jan 14, 2026

0.1.3

Jan 3, 2026

0.1.2

Jan 3, 2026

0.1.1

Jan 2, 2026

0.1.0

Jan 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hf_inference_benchmark-0.1.4.tar.gz (5.8 kB view details)

Uploaded Jan 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hf_inference_benchmark-0.1.4-py3-none-any.whl (7.6 kB view details)

Uploaded Jan 14, 2026 Python 3

File details

Details for the file hf_inference_benchmark-0.1.4.tar.gz.

File metadata

Download URL: hf_inference_benchmark-0.1.4.tar.gz
Upload date: Jan 14, 2026
Size: 5.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for hf_inference_benchmark-0.1.4.tar.gz
Algorithm	Hash digest
SHA256	`6a4aa6df006af05ae95e5e3ec24ca938a30756dc53cacc3d42f26d75f2fcdf19`
MD5	`cd7f546a71cad3985fc372d6b116a8f4`
BLAKE2b-256	`16b909e34412e50566c355e5ec2e2f4050fdecedfa2f3036878949e35ad88424`

See more details on using hashes here.

File details

Details for the file hf_inference_benchmark-0.1.4-py3-none-any.whl.

File metadata

Download URL: hf_inference_benchmark-0.1.4-py3-none-any.whl
Upload date: Jan 14, 2026
Size: 7.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for hf_inference_benchmark-0.1.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4b363243566460ad37cbdde74386d5bd9e9ecf7762a36554c2b6b0e2a750ac17`
MD5	`2c03f03992ea9424f71b4cf49d6811b6`
BLAKE2b-256	`e2fbe6d9cbf9d4ac051d8097c7503d1584c956a01d72298dc49a8bde8a2be2e4`

See more details on using hashes here.

hf-inference-benchmark 0.1.4

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

HF Inference Benchmark

Why This Tool Exists

Benchmarking Pipeline

Metric Definitions

User Installation

Basic Usage

Persisting Results

Reproducible Benchmarking

Platform Integration

Roadmap

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes