NVIDIA Nemotron Speech Streaming ASR on Apple Silicon via MLX

These details have not been verified by PyPI

Project links

Project description

nemotron-asr-mlx

NVIDIA Nemotron ASR on Apple Silicon. 112x realtime. Pure MLX.

93 minutes of audio transcribed in under a minute on an M-series Mac. No GPU drivers, no CUDA, no Docker. Just pip install and go.

This is a native MLX port of NVIDIA's Nemotron-ASR 0.6B — the cache-aware streaming conformer that processes each audio frame exactly once. No sliding windows, no recomputation, no rewinding. State lives in fixed-size ring buffers so latency stays flat no matter how long you talk.

Requirements

Apple Silicon Mac (M1/M2/M3/M4)
Python 3.10+
ffmpeg installed and on PATH (for audio loading)

Install

pip install nemotron-asr-mlx

Model weights (~1.2 GB) download automatically on first run from HuggingFace.

Quick Start

from nemotron_asr_mlx import from_pretrained

model = from_pretrained("dboris/nemotron-asr-mlx")
result = model.transcribe("meeting.wav")
print(result.text)    # full transcription string
print(result.tokens)  # list of BPE token IDs

# Optional: beam search for maximum accuracy (slower)
result = model.transcribe("meeting.wav", beam_size=4)

# Maximum accuracy: beam search + ILM subtraction
result = model.transcribe("meeting.wav", beam_size=4, ilm_scale=0.15)

transcribe() accepts a file path (any format ffmpeg supports: wav, mp3, flac, m4a, ogg, opus, webm, mp4, etc.) or a numpy array of float32 PCM samples at 16 kHz.

It returns a StreamEvent with these fields:

Field	Type	Description
`text`	`str`	Full transcription text
`text_delta`	`str`	New text (same as `text` in batch mode)
`tokens`	`list[int]`	BPE token IDs
`is_final`	`bool`	Always `True` in batch mode

CLI

nemotron-asr transcribe meeting.wav                 # transcribe a file
nemotron-asr transcribe recording.mp3               # any format ffmpeg supports
nemotron-asr transcribe meeting.wav --beam-size 4   # beam search (slower, lower WER)
nemotron-asr transcribe meeting.wav --beam-size 4 --ilm-scale 0.15  # + ILM subtraction
nemotron-asr listen                                 # stream from microphone

Benchmark

Official WER (Open ASR Leaderboard datasets)

Evaluated on the standard Open ASR Leaderboard datasets. Machine: Apple M4 Max, 16-core, 64 GB.

Dataset	WER	NVIDIA ref	RTFx
LibriSpeech test-clean	2.70%	2.31%	112x
LibriSpeech test-other	5.57%	4.75%	—
TED-LIUM v3	6.25%	4.50%	—

NVIDIA reference numbers are from nemotron-asr-speech-streaming-en-0.6b at 1120ms chunk size (PyTorch, A100 GPU). Our MLX port runs in batch mode on Apple Silicon.

v0.2.0 improvements: mel frontend parity fixes (periodic Hann window, center-padded STFT) + blank-frame skipping decoder reduced WER from 2.79% to 2.70% and increased speed from 76x to 112x realtime.

Run the evaluation yourself:

pip install datasets jiwer torchcodec
python eval_wer.py librispeech-clean librispeech-other tedlium

Speed benchmark

Content	Duration	Inference	Speed	Tokens
Short conversation	5s	0.09s	55x RT	35
Technical explainer	98s	1.04s	95x RT	474
Audiobook excerpt	9s	0.15s	58x RT	57
Long-form analysis	25.6 min	17.0s	91x RT	10,572
Lecture recording	36.1 min	23.5s	92x RT	14,688
Meeting recording	29.4 min	17.6s	101x RT	7,796
Total	93.0 min	59.3s	94x RT	33,622

618.5M parameters. 3.4 GB peak GPU memory. 112x realtime on M4 Max. Model loads in 0.1s after first download.

python benchmark.py /path/to/audio/files

Why this exists

Most "streaming" ASR on Mac is either (a) Whisper with overlapping windows reprocessing the same audio over and over, or (b) cloud APIs adding network latency to every utterance. Nemotron's cache-aware conformer is architecturally different:

Each frame processed once — state carried forward in fixed-size ring buffers, not recomputed
Constant memory — no growing KV caches, no memory spikes on long recordings
Native Metal — no PyTorch, no ONNX, no bridge layers. Direct MLX on Apple GPU
112x realtime — an hour of audio in 32 seconds

Architecture

FastConformer encoder (24 layers, 1024-dim) with 8x depthwise striding subsampling. RNNT decoder with 2-layer LSTM prediction network and joint network. Per-layer-group attention context windows [[70,13], [70,6], [70,1], [70,0]] for progressive causal restriction. Greedy decoding with blank-frame skipping (batched joint network evaluation skips ~90% of silent frames). Optional beam search with n-gram LM shallow fusion and ILM (Internal Language Model) subtraction.

Based on Cache-aware Streaming Conformer and the NeMo toolkit.

Live Demo

A browser-based demo with live mic transcription. Mic is captured in the terminal via sounddevice; the browser displays the transcript.

pip install websockets sounddevice
python demo/server.py

Open http://localhost:8765 and click Record.

Weight Conversion

If you have a .nemo checkpoint and want to convert it yourself:

pip install torch safetensors pyyaml  # conversion deps only
nemotron-asr convert model.nemo ./output_dir

Produces config.json + model.safetensors. Conversion deps are not needed for inference.

Dependencies

Deliberately minimal:

mlx — Apple's ML framework
huggingface-hub — model download
numpy — mel spectrogram
librosa — mel filterbank (optional, improves accuracy)
sounddevice — mic access (for live streaming)
websockets — live demo server (optional)
typer — CLI

License

Apache 2.0

Author

Boris Djordjevic / 199 Biotechnologies / @longevityboris

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Feb 28, 2026

0.1.0

Feb 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nemotron_asr_mlx-0.2.0.tar.gz (43.6 kB view details)

Uploaded Feb 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nemotron_asr_mlx-0.2.0-py3-none-any.whl (42.5 kB view details)

Uploaded Feb 28, 2026 Python 3

File details

Details for the file nemotron_asr_mlx-0.2.0.tar.gz.

File metadata

Download URL: nemotron_asr_mlx-0.2.0.tar.gz
Upload date: Feb 28, 2026
Size: 43.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for nemotron_asr_mlx-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`911b75e45d0e146fa35aeef64ab61d7d09f6941eb1773c98f2a1ea218729bacc`
MD5	`1ef0b1686cdfaf615f95b6e22506d89b`
BLAKE2b-256	`b62c057c3e6ffcfe78330ed3e3c60e04b04544120bf7640152fb9b220d399fa2`

See more details on using hashes here.

File details

Details for the file nemotron_asr_mlx-0.2.0-py3-none-any.whl.

File metadata

Download URL: nemotron_asr_mlx-0.2.0-py3-none-any.whl
Upload date: Feb 28, 2026
Size: 42.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for nemotron_asr_mlx-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`01c173a1b43c8c429fa0e768c9df9127897df96f5ad9a8c1f0fd6c393fac1a11`
MD5	`96200b480cb7e71920e59ca48b170495`
BLAKE2b-256	`6e01f3bc2e9967df0fce24fc7fc16b6cbce2dadd2fb9eff1592281434d115717`

See more details on using hashes here.

nemotron-asr-mlx 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

nemotron-asr-mlx

Requirements

Install

Quick Start

CLI

Benchmark

Official WER (Open ASR Leaderboard datasets)

Speed benchmark

Why this exists

Architecture

Live Demo

Weight Conversion

Dependencies

Links

License

Author

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes