NVIDIA Nemotron Speech Streaming ASR on Apple Silicon via MLX
Project description
nemotron-asr-mlx
NVIDIA Nemotron ASR on Apple Silicon. 94x realtime. Pure MLX.
93 minutes of audio transcribed in 59 seconds on an M-series Mac. No GPU drivers, no CUDA, no Docker. Just pip install and go.
This is a native MLX port of NVIDIA's Nemotron-ASR 0.6B — the cache-aware streaming conformer that processes each audio frame exactly once. No sliding windows, no recomputation, no rewinding. State lives in fixed-size ring buffers so latency stays flat no matter how long you talk.
pip install nemotron-asr-mlx
from nemotron_asr_mlx import from_pretrained
model = from_pretrained("dboris/nemotron-asr-mlx")
result = model.transcribe("meeting.wav")
print(result.text)
That's it. Model downloads on first run (~1.2 GB).
Benchmark
Tested on Apple Silicon. All times are wall-clock inference only (no I/O).
| Content | Duration | Inference | Speed | Tokens |
|---|---|---|---|---|
| Short conversation | 5s | 0.09s | 55x RT | 35 |
| Technical explainer | 98s | 1.04s | 95x RT | 474 |
| Audiobook excerpt | 9s | 0.15s | 58x RT | 57 |
| Long-form analysis | 25.6 min | 17.0s | 91x RT | 10,572 |
| Lecture recording | 36.1 min | 23.5s | 92x RT | 14,688 |
| Meeting recording | 29.4 min | 17.6s | 101x RT | 7,796 |
| Total | 93.0 min | 59.3s | 94x RT | 33,622 |
618.5M parameters. 3.4 GB peak GPU memory. Model loads in 0.1s after first download.
Run your own:
python benchmark.py /path/to/audio/files
Why this exists
Most "streaming" ASR on Mac is either (a) Whisper with overlapping windows reprocessing the same audio over and over, or (b) cloud APIs adding network latency to every utterance. Nemotron's cache-aware conformer is architecturally different:
- Each frame processed once — state carried forward in fixed-size ring buffers, not recomputed
- Constant memory — no growing KV caches, no memory spikes on long recordings
- Native Metal — no PyTorch, no ONNX, no bridge layers. Direct MLX on Apple GPU
- 94x realtime — an hour of audio in under a minute
The model achieves 2.43% WER on LibriSpeech test-clean, competitive with much larger models.
Install
pip install nemotron-asr-mlx
Python 3.10+ and an Apple Silicon Mac.
Usage
CLI
nemotron-asr transcribe meeting.wav # batch transcribe a file
nemotron-asr listen # stream from microphone
nemotron-asr listen --chunk-ms 80 # lowest latency streaming
Python API
from nemotron_asr_mlx import from_pretrained
model = from_pretrained("dboris/nemotron-asr-mlx")
# Batch — transcribe a file or numpy array
result = model.transcribe("audio.wav")
print(result.text)
print(result.tokens) # BPE token IDs
# Streaming — push audio chunks, get text back incrementally
session = model.create_stream(chunk_ms=160)
event = session.push(pcm_chunk) # StreamEvent with text_delta
print(event.text_delta, end="")
final = session.flush() # final result
session.reset() # reuse for next utterance
# Live mic streaming
with model.listen(chunk_ms=160) as stream:
for event in stream:
print(event.text_delta, end="", flush=True)
StreamEvent
Every push() and flush() returns a StreamEvent:
| Field | Type | Description |
|---|---|---|
text_delta |
str |
New text since last event |
text |
str |
Full accumulated text |
is_final |
bool |
True only from flush() |
tokens |
list[int] |
All accumulated BPE token IDs |
Architecture
FastConformer encoder (24 layers, 1024-dim) with 8x depthwise striding subsampling. RNNT decoder with 2-layer LSTM prediction network and joint network. Per-layer-group attention context windows [[70,13], [70,6], [70,1], [70,0]] for progressive causal restriction. Greedy decoding with blank suppression.
Based on Cache-aware Streaming Conformer and the NeMo toolkit.
Live Demo
A browser-based demo with live mic transcription:
pip install websockets
python demo/server.py
Open http://localhost:8765, click Record, and start speaking. Transcription updates in real-time with inference stats.
Weight conversion
If you have a .nemo checkpoint and want to convert it yourself:
pip install torch safetensors pyyaml # conversion deps only
nemotron-asr convert model.nemo ./output_dir
Produces config.json + model.safetensors. Conversion deps are not needed for inference.
Dependencies
Deliberately minimal:
mlx— Apple's ML frameworkhuggingface-hub— model downloadnumpy— mel spectrogramsounddevice— mic access (optional)websockets— live demo server (optional)typer— CLI
License
Apache 2.0
Author
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nemotron_asr_mlx-0.1.0.tar.gz.
File metadata
- Download URL: nemotron_asr_mlx-0.1.0.tar.gz
- Upload date:
- Size: 36.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4a8187d12b5f664f8a272e2414dad859c5374ae832f90161c3cabe57a815dd89
|
|
| MD5 |
111cf8e858fb183036454e8690a5b3e4
|
|
| BLAKE2b-256 |
f323630f698ed788c057eb20e8e26e07de903c5b9b5cd6c104fc64c659ee721a
|
File details
Details for the file nemotron_asr_mlx-0.1.0-py3-none-any.whl.
File metadata
- Download URL: nemotron_asr_mlx-0.1.0-py3-none-any.whl
- Upload date:
- Size: 35.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4561cf42d2e372eda4a930b4568ce46852a48ed2029bde881378cdfb9fab7dbb
|
|
| MD5 |
f2059734e23831adcbbc95cbe40974f1
|
|
| BLAKE2b-256 |
5a99197518c2492290ff26bb7c7081fe4462d6ad16909b900fdb0cc2e45f452c
|