Skip to main content

Hopper-native CUDA kernels for Whisper large-v3 on H100 GPU

Project description

whisper-blaze

Hopper-native CUDA kernels for Whisper large-v3 on NVIDIA H100 GPUs.

Replaces standard PyTorch operations with hand-tuned CUDA kernels that exploit H100-specific hardware:

  • WGMMA (Warpgroup MMA) GEMM in FP16 and FP8 (E4M3 / E5M2)
  • TMA (Tensor Memory Accelerator) async bulk copy
  • Flash Attention 3 for encoder self-attention, decoder self/cross-attention
  • Fused residual + LayerNorm / RMSNorm
  • GPU mel spectrogram (replaces CPU librosa/HuggingFace preprocessor)
  • FP8 quantize/dequantize with per-tensor scaling
  • Dynamic cross-request batching — fuses concurrent API calls into one model.generate() pass to fill all 80 GB of H100 VRAM

Requirements

Component Version
GPU NVIDIA H100 (Hopper, SM90)
CUDA toolkit 12.2+ (12.6 recommended)
PyTorch 2.1.0+ with matching CUDA
Python 3.9+
OS Linux x86_64

Installation

Step 1 — Install PyTorch with CUDA support (if you haven't already):

pip install torch --index-url https://download.pytorch.org/whl/cu124

Step 2 — Install whisper-blaze:

pip install whisper-blaze --no-build-isolation

--no-build-isolation is required — it tells pip to use your existing PyTorch instead of fetching it into an isolated build environment.

From source:

git clone https://github.com/YOUR_USERNAME/whisper-blaze.git
cd whisper-blaze
pip install -e . --no-build-isolation

If your CUDA toolkit isn't at /usr/local/cuda, set CUDA_HOME first:

export CUDA_HOME=/usr/local/cuda-12.6

Quick Start

from whisper_blaze import WhisperBlaze
from whisper_blaze.precision import mixed_fp8

model = WhisperBlaze.from_pretrained(
    "openai/whisper-large-v3",
    precision=mixed_fp8(),
)

# Single file — numpy array or torch tensor, float32, 16 kHz
# 1D [samples] or 2D [channels, samples] both accepted
result = model.transcribe(audio, language="en")
print(result["text"])

Batch Transcription

transcribe_batch() accepts multiple audio files and fuses all their 30-second chunks into a single model.generate() call, maximising VRAM utilisation on an 80 GB H100.

# results is a list of dicts, one per input audio
results = model.transcribe_batch(
    [audio1, audio2, audio3],
    language="en",
    task="transcribe",
)
for r in results:
    print(r["text"])

Why it matters: a single 15-minute file uses ~40 GB VRAM. With transcribe_batch() you can process a second 15-minute file in the same GPU pass, using ~78 GB — the remaining 40 GB that would otherwise sit idle.

VRAM usage guide

Batch contents Approx. VRAM Notes
1 × 15-min audio ~40 GB baseline
2 × 15-min audio ~78 GB safe on 80 GB H100
4 × 5-min audio ~45 GB short calls batch very efficiently
8 × 2-min audio ~40 GB high-concurrency call-centre workload

Rule of thumb: ~1.15 GB per 30-second chunk plus ~6 GB for model weights.

Precision Presets

Preset When to use
full_fp16() Maximum quality, no quantization
mixed_fp8() Recommended — FP8 on FFN/QKV, FP16 on attention
aggressive_fp8() Maximum throughput, FP8 everywhere
from whisper_blaze.precision import full_fp16, mixed_fp8, aggressive_fp8

model = WhisperBlaze.from_pretrained(precision=aggressive_fp8())

Serving at Scale

For production deployments, pair whisper-blaze with a dynamic batching API server that keeps a pool of concurrent requests in-flight and automatically groups them into GPU batches:

Client pool (10 concurrent)
        │
        ▼
  FastAPI server              ← collect requests for 400 ms
        │
        ▼
  transcribe_batch()          ← one model.generate() for the whole batch
        │
        ▼
  Results returned individually

Throughput on a single H100 (80 GB) with 10-minute audio:

Mode Calls/day
Single request, serialised ~30,000
Dynamic batching (2 concurrent) ~60,000
Dynamic batching (mixed lengths) up to ~90,000

GPU Mel Spectrogram

from whisper_blaze import WhisperBlazeProcessor

proc = WhisperBlazeProcessor(device="cuda")
mel = proc(audio_tensor, sampling_rate=16000)   # [1, 128, T] fp16 on GPU

# Long audio with overlapping chunks
mels = proc.process_chunks(long_audio, sampling_rate=16000, overlap_s=1.0)

Direct Kernel API

import torch
import whisper_blaze_kernels as k

# FP8 quantize / dequantize
x = torch.randn(512, 512, dtype=torch.float16, device="cuda")
fp8, scale = k.quantise_e4m3(x)
x_back = k.dequantise_e4m3(fp8, scale, [512, 512])

# Fused residual + LayerNorm
out = k.layernorm_fused(hidden, residual, gamma, beta, 1e-5)

# Fused RMSNorm
out = k.rmsnorm_fused(hidden, residual, gamma, 1e-5)

# Flash Attention 3
out = k.encoder_self_attn(Q, K, V)    # no causal mask
out = k.decoder_self_attn(Q, K, V)    # causal mask
out = k.decoder_cross_attn(Q, K, V)   # no causal mask

# GPU mel spectrogram
mel = k.mel_spectrogram(audio_cpu_float32)  # → [1, 128, T] fp16 on GPU

Troubleshooting

RuntimeError: CUDA version mismatch — Your PyTorch was compiled against a different CUDA version than your system toolkit. Reinstall PyTorch from the correct index:

pip install torch --index-url https://download.pytorch.org/whl/cu124

ninja not found — Install ninja for faster builds:

pip install ninja

nvcc does not support sm_90a — Upgrade your CUDA toolkit to 12.2+. The H100 Hopper architecture requires sm_90a.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

whisper_blaze-0.1.8.tar.gz (37.1 kB view details)

Uploaded Source

File details

Details for the file whisper_blaze-0.1.8.tar.gz.

File metadata

  • Download URL: whisper_blaze-0.1.8.tar.gz
  • Upload date:
  • Size: 37.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for whisper_blaze-0.1.8.tar.gz
Algorithm Hash digest
SHA256 3c6d05723929f482aced49d42609ea05b7c5c14e3eb92d3899025590827c44c3
MD5 4d0743647dc5976937c6b685ed07bc07
BLAKE2b-256 b8bc12bda093cc34044d1d9e3cbb6a7bf88bf4545b2c415de4369051005efbfa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page