Skip to main content

Production-grade HuggingFace inference benchmarking tool

Project description

HF Inference Benchmark

CI Pipeline PyPI Python Code Style License

Production-grade benchmarking infrastructure for HuggingFace inference workloads.

hf-inference-benchmark is a reproducible, device-agnostic benchmarking system that measures the real operational cost of running LLMs — latency, throughput, and memory — under production-like conditions.

It answers one critical question:

“Will this model crash my server — and how fast can it actually run?”


Why This Tool Exists

Most public HuggingFace benchmarking scripts:

• Measure only a single forward pass
• Ignore warmup behavior
• Ignore GPU synchronization
• Ignore memory allocator behavior
• Produce misleading results

This tool implements the same benchmarking discipline used by real ML infrastructure teams.


Benchmarking Pipeline

Model Load
   ↓
Warmup Passes
   ↓
Synchronized Execution
   ↓
Latency Profiling (P50 / P95 / Avg)
   ↓
Token Counting
   ↓
Throughput Calculation (tokens/sec)
   ↓
Peak Memory Tracking
   ↓
Structured JSON Export

Metric Definitions

Metric Meaning
latency_p50 Median inference latency
throughput Real generation speed in tokens/sec
memory_mb Peak RAM/VRAM usage
warmup Kernel stabilization
synchronization Accurate GPU timing

User Installation

# From PyPI
pip install hf-inference-benchmark

# From Source (for Developers)
git clone https://github.com/rgb-99/hf-inference-benchmark.git
cd hf-inference-benchmark
pip install -e .

Basic Usage

# Run on CPU/GPU (Auto-detected)
hf-bench facebook/opt-125m

# Export results for the Reporting Suite
hf-bench gpt2 --tokens 64 --out results/gpt2_perf.json

Persisting Results

hf-bench facebook/opt-125m --tokens 64 --out results/opt125m.json

Example:

{
  "model": "facebook/opt-125m",
  "throughput": 54.36,
  "latency_p50": 878.83,
  "memory_mb": 797.56
}

Reproducible Benchmarking

For fair comparison:

• Fix prompt
• Fix token count
• Fix device
• Use warmup runs
• Compare structured JSON outputs


Platform Integration

This tool is part of the unified NLP infrastructure platform:

nlp-tool benchmark facebook/opt-125m
nlp-tool report results/opt125m.json

Roadmap

• Batch-size profiling
• Streaming generation benchmarks
• Multi-GPU scaling
• Energy-cost estimation
• CI-based regression tracking


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hf_inference_benchmark-0.1.4.tar.gz (5.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hf_inference_benchmark-0.1.4-py3-none-any.whl (7.6 kB view details)

Uploaded Python 3

File details

Details for the file hf_inference_benchmark-0.1.4.tar.gz.

File metadata

  • Download URL: hf_inference_benchmark-0.1.4.tar.gz
  • Upload date:
  • Size: 5.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for hf_inference_benchmark-0.1.4.tar.gz
Algorithm Hash digest
SHA256 6a4aa6df006af05ae95e5e3ec24ca938a30756dc53cacc3d42f26d75f2fcdf19
MD5 cd7f546a71cad3985fc372d6b116a8f4
BLAKE2b-256 16b909e34412e50566c355e5ec2e2f4050fdecedfa2f3036878949e35ad88424

See more details on using hashes here.

File details

Details for the file hf_inference_benchmark-0.1.4-py3-none-any.whl.

File metadata

File hashes

Hashes for hf_inference_benchmark-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 4b363243566460ad37cbdde74386d5bd9e9ecf7762a36554c2b6b0e2a750ac17
MD5 2c03f03992ea9424f71b4cf49d6811b6
BLAKE2b-256 e2fbe6d9cbf9d4ac051d8097c7503d1584c956a01d72298dc49a8bde8a2be2e4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page