Skip to main content

NVIDIA Nemotron Speech Streaming ASR on Apple Silicon via MLX

Project description

nemotron-asr-mlx

nemotron-asr-mlx

NVIDIA Nemotron ASR on Apple Silicon. 112x realtime. Pure MLX.

PyPI version license python version


93 minutes of audio transcribed in under a minute on an M-series Mac. No GPU drivers, no CUDA, no Docker. Just pip install and go.

This is a native MLX port of NVIDIA's Nemotron-ASR 0.6B — the cache-aware streaming conformer that processes each audio frame exactly once. No sliding windows, no recomputation, no rewinding. State lives in fixed-size ring buffers so latency stays flat no matter how long you talk.

Requirements

  • Apple Silicon Mac (M1/M2/M3/M4)
  • Python 3.10+
  • ffmpeg installed and on PATH (for audio loading)

Install

pip install nemotron-asr-mlx

Model weights (~1.2 GB) download automatically on first run from HuggingFace.

Quick Start

from nemotron_asr_mlx import from_pretrained

model = from_pretrained("dboris/nemotron-asr-mlx")
result = model.transcribe("meeting.wav")
print(result.text)    # full transcription string
print(result.tokens)  # list of BPE token IDs

# Optional: beam search for maximum accuracy (slower)
result = model.transcribe("meeting.wav", beam_size=4)

# Maximum accuracy: beam search + ILM subtraction
result = model.transcribe("meeting.wav", beam_size=4, ilm_scale=0.15)

transcribe() accepts a file path (any format ffmpeg supports: wav, mp3, flac, m4a, ogg, opus, webm, mp4, etc.) or a numpy array of float32 PCM samples at 16 kHz.

It returns a StreamEvent with these fields:

Field Type Description
text str Full transcription text
text_delta str New text (same as text in batch mode)
tokens list[int] BPE token IDs
is_final bool Always True in batch mode

CLI

nemotron-asr transcribe meeting.wav                 # transcribe a file
nemotron-asr transcribe recording.mp3               # any format ffmpeg supports
nemotron-asr transcribe meeting.wav --beam-size 4   # beam search (slower, lower WER)
nemotron-asr transcribe meeting.wav --beam-size 4 --ilm-scale 0.15  # + ILM subtraction
nemotron-asr listen                                 # stream from microphone

Benchmark

Official WER (Open ASR Leaderboard datasets)

Evaluated on the standard Open ASR Leaderboard datasets. Machine: Apple M4 Max, 16-core, 64 GB.

Dataset WER NVIDIA ref RTFx
LibriSpeech test-clean 2.70% 2.31% 112x
LibriSpeech test-other 5.57% 4.75%
TED-LIUM v3 6.25% 4.50%

NVIDIA reference numbers are from nemotron-asr-speech-streaming-en-0.6b at 1120ms chunk size (PyTorch, A100 GPU). Our MLX port runs in batch mode on Apple Silicon.

v0.2.0 improvements: mel frontend parity fixes (periodic Hann window, center-padded STFT) + blank-frame skipping decoder reduced WER from 2.79% to 2.70% and increased speed from 76x to 112x realtime.

Run the evaluation yourself:

pip install datasets jiwer torchcodec
python eval_wer.py librispeech-clean librispeech-other tedlium

Speed benchmark

Content Duration Inference Speed Tokens
Short conversation 5s 0.09s 55x RT 35
Technical explainer 98s 1.04s 95x RT 474
Audiobook excerpt 9s 0.15s 58x RT 57
Long-form analysis 25.6 min 17.0s 91x RT 10,572
Lecture recording 36.1 min 23.5s 92x RT 14,688
Meeting recording 29.4 min 17.6s 101x RT 7,796
Total 93.0 min 59.3s 94x RT 33,622

618.5M parameters. 3.4 GB peak GPU memory. 112x realtime on M4 Max. Model loads in 0.1s after first download.

python benchmark.py /path/to/audio/files

Why this exists

Most "streaming" ASR on Mac is either (a) Whisper with overlapping windows reprocessing the same audio over and over, or (b) cloud APIs adding network latency to every utterance. Nemotron's cache-aware conformer is architecturally different:

  • Each frame processed once — state carried forward in fixed-size ring buffers, not recomputed
  • Constant memory — no growing KV caches, no memory spikes on long recordings
  • Native Metal — no PyTorch, no ONNX, no bridge layers. Direct MLX on Apple GPU
  • 112x realtime — an hour of audio in 32 seconds

Architecture

FastConformer encoder (24 layers, 1024-dim) with 8x depthwise striding subsampling. RNNT decoder with 2-layer LSTM prediction network and joint network. Per-layer-group attention context windows [[70,13], [70,6], [70,1], [70,0]] for progressive causal restriction. Greedy decoding with blank-frame skipping (batched joint network evaluation skips ~90% of silent frames). Optional beam search with n-gram LM shallow fusion and ILM (Internal Language Model) subtraction.

Based on Cache-aware Streaming Conformer and the NeMo toolkit.

Live Demo

A browser-based demo with live mic transcription. Mic is captured in the terminal via sounddevice; the browser displays the transcript.

pip install websockets sounddevice
python demo/server.py

Open http://localhost:8765 and click Record.

Weight Conversion

If you have a .nemo checkpoint and want to convert it yourself:

pip install torch safetensors pyyaml  # conversion deps only
nemotron-asr convert model.nemo ./output_dir

Produces config.json + model.safetensors. Conversion deps are not needed for inference.

Dependencies

Deliberately minimal:

  • mlx — Apple's ML framework
  • huggingface-hub — model download
  • numpy — mel spectrogram
  • librosa — mel filterbank (optional, improves accuracy)
  • sounddevice — mic access (for live streaming)
  • websockets — live demo server (optional)
  • typer — CLI

Links

License

Apache 2.0

Author

Boris Djordjevic / 199 Biotechnologies / @longevityboris

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nemotron_asr_mlx-0.2.0.tar.gz (43.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nemotron_asr_mlx-0.2.0-py3-none-any.whl (42.5 kB view details)

Uploaded Python 3

File details

Details for the file nemotron_asr_mlx-0.2.0.tar.gz.

File metadata

  • Download URL: nemotron_asr_mlx-0.2.0.tar.gz
  • Upload date:
  • Size: 43.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for nemotron_asr_mlx-0.2.0.tar.gz
Algorithm Hash digest
SHA256 911b75e45d0e146fa35aeef64ab61d7d09f6941eb1773c98f2a1ea218729bacc
MD5 1ef0b1686cdfaf615f95b6e22506d89b
BLAKE2b-256 b62c057c3e6ffcfe78330ed3e3c60e04b04544120bf7640152fb9b220d399fa2

See more details on using hashes here.

File details

Details for the file nemotron_asr_mlx-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for nemotron_asr_mlx-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 01c173a1b43c8c429fa0e768c9df9127897df96f5ad9a8c1f0fd6c393fac1a11
MD5 96200b480cb7e71920e59ca48b170495
BLAKE2b-256 6e01f3bc2e9967df0fce24fc7fc16b6cbce2dadd2fb9eff1592281434d115717

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page