NVIDIA Nemotron Speech Streaming ASR on Apple Silicon via MLX
Project description
nemotron-asr-mlx
NVIDIA Nemotron ASR on Apple Silicon. 112x realtime. Pure MLX.
93 minutes of audio transcribed in under a minute on an M-series Mac. No GPU drivers, no CUDA, no Docker. Just pip install and go.
This is a native MLX port of NVIDIA's Nemotron-ASR 0.6B — the cache-aware streaming conformer that processes each audio frame exactly once. No sliding windows, no recomputation, no rewinding. State lives in fixed-size ring buffers so latency stays flat no matter how long you talk.
Requirements
- Apple Silicon Mac (M1/M2/M3/M4)
- Python 3.10+
- ffmpeg installed and on PATH (for audio loading)
Install
pip install nemotron-asr-mlx
Model weights (~1.2 GB) download automatically on first run from HuggingFace.
Quick Start
from nemotron_asr_mlx import from_pretrained
model = from_pretrained("dboris/nemotron-asr-mlx")
result = model.transcribe("meeting.wav")
print(result.text) # full transcription string
print(result.tokens) # list of BPE token IDs
# Optional: beam search for maximum accuracy (slower)
result = model.transcribe("meeting.wav", beam_size=4)
# Maximum accuracy: beam search + ILM subtraction
result = model.transcribe("meeting.wav", beam_size=4, ilm_scale=0.15)
transcribe() accepts a file path (any format ffmpeg supports: wav, mp3, flac, m4a, ogg, opus, webm, mp4, etc.) or a numpy array of float32 PCM samples at 16 kHz.
It returns a StreamEvent with these fields:
| Field | Type | Description |
|---|---|---|
text |
str |
Full transcription text |
text_delta |
str |
New text (same as text in batch mode) |
tokens |
list[int] |
BPE token IDs |
is_final |
bool |
Always True in batch mode |
CLI
nemotron-asr transcribe meeting.wav # transcribe a file
nemotron-asr transcribe recording.mp3 # any format ffmpeg supports
nemotron-asr transcribe meeting.wav --beam-size 4 # beam search (slower, lower WER)
nemotron-asr transcribe meeting.wav --beam-size 4 --ilm-scale 0.15 # + ILM subtraction
nemotron-asr listen # stream from microphone
Benchmark
Official WER (Open ASR Leaderboard datasets)
Evaluated on the standard Open ASR Leaderboard datasets. Machine: Apple M4 Max, 16-core, 64 GB.
| Dataset | WER | NVIDIA ref | RTFx |
|---|---|---|---|
| LibriSpeech test-clean | 2.70% | 2.31% | 112x |
| LibriSpeech test-other | 5.57% | 4.75% | — |
| TED-LIUM v3 | 6.25% | 4.50% | — |
NVIDIA reference numbers are from nemotron-asr-speech-streaming-en-0.6b at 1120ms chunk size (PyTorch, A100 GPU). Our MLX port runs in batch mode on Apple Silicon.
v0.2.0 improvements: mel frontend parity fixes (periodic Hann window, center-padded STFT) + blank-frame skipping decoder reduced WER from 2.79% to 2.70% and increased speed from 76x to 112x realtime.
Run the evaluation yourself:
pip install datasets jiwer torchcodec
python eval_wer.py librispeech-clean librispeech-other tedlium
Speed benchmark
| Content | Duration | Inference | Speed | Tokens |
|---|---|---|---|---|
| Short conversation | 5s | 0.09s | 55x RT | 35 |
| Technical explainer | 98s | 1.04s | 95x RT | 474 |
| Audiobook excerpt | 9s | 0.15s | 58x RT | 57 |
| Long-form analysis | 25.6 min | 17.0s | 91x RT | 10,572 |
| Lecture recording | 36.1 min | 23.5s | 92x RT | 14,688 |
| Meeting recording | 29.4 min | 17.6s | 101x RT | 7,796 |
| Total | 93.0 min | 59.3s | 94x RT | 33,622 |
618.5M parameters. 3.4 GB peak GPU memory. 112x realtime on M4 Max. Model loads in 0.1s after first download.
python benchmark.py /path/to/audio/files
Why this exists
Most "streaming" ASR on Mac is either (a) Whisper with overlapping windows reprocessing the same audio over and over, or (b) cloud APIs adding network latency to every utterance. Nemotron's cache-aware conformer is architecturally different:
- Each frame processed once — state carried forward in fixed-size ring buffers, not recomputed
- Constant memory — no growing KV caches, no memory spikes on long recordings
- Native Metal — no PyTorch, no ONNX, no bridge layers. Direct MLX on Apple GPU
- 112x realtime — an hour of audio in 32 seconds
Architecture
FastConformer encoder (24 layers, 1024-dim) with 8x depthwise striding subsampling. RNNT decoder with 2-layer LSTM prediction network and joint network. Per-layer-group attention context windows [[70,13], [70,6], [70,1], [70,0]] for progressive causal restriction. Greedy decoding with blank-frame skipping (batched joint network evaluation skips ~90% of silent frames). Optional beam search with n-gram LM shallow fusion and ILM (Internal Language Model) subtraction.
Based on Cache-aware Streaming Conformer and the NeMo toolkit.
Live Demo
A browser-based demo with live mic transcription. Mic is captured in the terminal via sounddevice; the browser displays the transcript.
pip install websockets sounddevice
python demo/server.py
Open http://localhost:8765 and click Record.
Weight Conversion
If you have a .nemo checkpoint and want to convert it yourself:
pip install torch safetensors pyyaml # conversion deps only
nemotron-asr convert model.nemo ./output_dir
Produces config.json + model.safetensors. Conversion deps are not needed for inference.
Dependencies
Deliberately minimal:
mlx— Apple's ML frameworkhuggingface-hub— model downloadnumpy— mel spectrogramlibrosa— mel filterbank (optional, improves accuracy)sounddevice— mic access (for live streaming)websockets— live demo server (optional)typer— CLI
Links
- PyPI: nemotron-asr-mlx
- HuggingFace: dboris/nemotron-asr-mlx
- Original model: nvidia/nemotron-asr-speech-streaming-en-0.6b
License
Apache 2.0
Author
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nemotron_asr_mlx-0.2.0.tar.gz.
File metadata
- Download URL: nemotron_asr_mlx-0.2.0.tar.gz
- Upload date:
- Size: 43.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
911b75e45d0e146fa35aeef64ab61d7d09f6941eb1773c98f2a1ea218729bacc
|
|
| MD5 |
1ef0b1686cdfaf615f95b6e22506d89b
|
|
| BLAKE2b-256 |
b62c057c3e6ffcfe78330ed3e3c60e04b04544120bf7640152fb9b220d399fa2
|
File details
Details for the file nemotron_asr_mlx-0.2.0-py3-none-any.whl.
File metadata
- Download URL: nemotron_asr_mlx-0.2.0-py3-none-any.whl
- Upload date:
- Size: 42.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
01c173a1b43c8c429fa0e768c9df9127897df96f5ad9a8c1f0fd6c393fac1a11
|
|
| MD5 |
96200b480cb7e71920e59ca48b170495
|
|
| BLAKE2b-256 |
6e01f3bc2e9967df0fce24fc7fc16b6cbce2dadd2fb9eff1592281434d115717
|