Qwen3-ASR speech recognition on Apple Silicon via MLX

These details have not been verified by PyPI

Project links

Project description

mlx-qwen3-asr

Run Qwen3-ASR — one of the strongest open-source speech recognition models — natively on Apple Silicon.

A ground-up reimplementation of the official PyTorch model using Apple's MLX framework. Same weights, benchmarked against official/reference outputs and ground-truth eval sets, optimized for Mac GPUs via Metal. No PyTorch dependency for core transcription.

Why this exists

Qwen3-ASR is one of the strongest open-source ASR models available, with benchmark results exceeding Whisper-large-v3 across multiple languages and datasets. It supports 30 languages plus 22 Chinese dialects. But the official implementation is PyTorch + NVIDIA CUDA — it doesn't use Apple GPUs.

This project rewrites every layer for MLX so the same model runs natively on M1/M2/M3/M4 hardware. Not a wrapper — a full reimplementation with correct interleaved MRoPE, per-chunk windowed encoder attention, and all the architectural details that matter for output quality.

What's included

Full encoder-decoder pipeline — audio encoder (Conv2d stem + windowed transformer) and text decoder (Qwen3-style with interleaved MRoPE), reimplemented from scratch for MLX
Whisper-compatible mel frontend — native log-mel spectrogram computation with cached filterbank and Hann window
Both model sizes — 0.6B (fast, default) and 1.7B (higher accuracy)
Long audio support — energy-based chunking up to 20 minutes per chunk, no 30-second feature truncation
Word-level timestamps — native MLX forced aligner (default, 2.6x faster than PyTorch alternative) with O(n log n) LIS-based timestamp correction
Speaker diarization (optional) — offline speaker-labeled outputs via pyannote integration (--diarize)
4-bit and 8-bit quantization — up to 4.7x speedup with measured quality reporting on 100 speaker-balanced samples
Multiple output formats — txt, json, srt, vtt, tsv
Built-in HTTP server — mlx-qwen3-asr serve exposes the pipeline over HTTP with async jobs, OpenAI API compatibility, and Bearer token auth
Session API — explicit model/tokenizer ownership with no hidden global state
Speculative decoding — experimental opt-in path (0.6B drafts for 1.7B target), parity-verified
Streaming — KV-cache streaming with linear complexity, context trimming, and tail refinement
Native WAV fast-path — custom binary WAV parser bypasses ffmpeg for PCM/float WAV files
462 tests — every optimization is benchmark-gated with committed JSON artifacts
Minimal dependencies — mlx, numpy, regex, huggingface-hub

Requirements

Apple Silicon Mac (M1/M2/M3/M4) — this is an MLX project, Metal GPU required
Python 3.10+
ffmpeg — required for non-WAV audio formats (mp3, m4a, flac, mp4, etc.). WAV files work without ffmpeg via the native fast-path loader
~1.2 GB memory for 0.6B model (fp16), ~3.4 GB for 1.7B

Installation

Install from PyPI:

pip install mlx-qwen3-asr

For video and most non-WAV audio formats, install ffmpeg on your system:

brew install ffmpeg

Install with optional timestamp alignment extras (for Japanese/Korean tokenization parity):

pip install "mlx-qwen3-asr[aligner]"

Install with optional microphone capture support:

pip install "mlx-qwen3-asr[mic]"

Install with HTTP server support:

pip install "mlx-qwen3-asr[serve]"

Install with diarization extras:

pip install "mlx-qwen3-asr[diarize]"

Note: --diarize uses pyannote.audio 4.x and defaults to pyannote/speaker-diarization-community-1. Accept the model terms on Hugging Face and set a token:

export PYANNOTE_AUTH_TOKEN=hf_...

Core ASR does not require any Hugging Face token.

For development:

git clone https://github.com/moona3k/mlx-qwen3-asr.git
cd mlx-qwen3-asr
pip install -e ".[dev]"

Quick start

Python API

from mlx_qwen3_asr import transcribe

result = transcribe("audio.wav")
print(result.text)
print(result.language)

By default, transcribe() uses Qwen/Qwen3-ASR-0.6B for fast local usage on Mac. Use Qwen/Qwen3-ASR-1.7B when you want higher accuracy and can afford higher latency/memory.

With options:

result = transcribe(
    "meeting.mp3",
    model="Qwen/Qwen3-ASR-1.7B",
    language="English",
    return_chunks=True,
    on_progress=lambda e: print(e["event"], e.get("progress", 0.0)),
    verbose=True,
)
print(result.text)
print(result.chunks)

Session API (recommended for repeated calls)

The Session object owns model and tokenizer state explicitly — no hidden globals, no cache surprises:

from mlx_qwen3_asr import Session

session = Session(model="Qwen/Qwen3-ASR-0.6B")

# Fast repeated transcription — model stays loaded
for audio_file in audio_files:
    result = session.transcribe(audio_file)
    print(result.text)

Loading models explicitly

from mlx_qwen3_asr import load_model, load_audio, transcribe

model, config = load_model("Qwen/Qwen3-ASR-0.6B")
audio = load_audio("speech.wav")
result = transcribe(audio, model=model)

CLI

mlx-qwen3-asr audio.wav

Specify model, language, and output format:

mlx-qwen3-asr recording.mp3 --model Qwen/Qwen3-ASR-0.6B --language English -f srt -o output/

Word-level timestamps:

mlx-qwen3-asr audio.wav --timestamps

Speaker-labeled output (experimental, offline):

mlx-qwen3-asr meeting.wav --diarize --num-speakers 2 -f json

Multiple files with all output formats:

mlx-qwen3-asr *.wav -f all -o transcripts/ --verbose

Stdout/file behavior:

mlx-qwen3-asr audio.wav --stdout-only        # print only (no output file)
mlx-qwen3-asr audio.wav --quiet -o out/      # write files only (no stdout text)

Language discovery:

mlx-qwen3-asr --list-languages

Environment diagnostics (ffmpeg, optional diarization deps, token status):

mlx-qwen3-asr --doctor

Run mlx-qwen3-asr --help for the full list of options.

HTTP server

Serve transcriptions over HTTP. Two endpoint styles: an async job API and an OpenAI-compatible synchronous endpoint.

pip install "mlx-qwen3-asr[serve]"
mlx-qwen3-asr serve --api-key $(openssl rand -hex 16)

Submit audio and poll for results:

# Submit
curl -X POST http://localhost:8765/transcribe \
  -H "Authorization: Bearer YOUR_KEY" \
  -F "audio=@recording.wav"

# Poll
curl http://localhost:8765/jobs/JOB_ID \
  -H "Authorization: Bearer YOUR_KEY"

Or use the OpenAI-compatible endpoint with existing SDK code:

from openai import OpenAI

client = OpenAI(api_key="YOUR_KEY", base_url="http://localhost:8765/v1")
result = client.audio.transcriptions.create(
    model="Qwen/Qwen3-ASR-0.6B",
    file=open("recording.wav", "rb"),
)
print(result.text)

The async API is better for long audio (no HTTP timeout risk). The OpenAI endpoint blocks until done — simpler for short clips and SDK integration. The server also implements /v1/models for SDK clients that perform model discovery.

See docs/server/ for the full API spec, deployment guide, and architecture decision record. See examples/ for copy-paste workflows covering the OpenAI-compatible server, subtitles, meetings, scanner/noisy audio, and batch folders.

Performance on Apple Silicon

Measured on Apple M4 Pro, macOS 26.2. All numbers from controlled runs with benchmark JSON artifacts committed to the repo. See docs/BENCHMARKS.md for the full breakdown.

Latency (0.6B)

Configuration	Short clip (~2.5s)	10s clip	Real-time factor	vs fp16
fp16 (baseline)	0.46s	0.83s	0.08x	—
8-bit (q8, group 64)	0.11s	0.27s	0.03x	3.11x faster
4-bit (q4, group 64)	0.13s	0.18s	0.02x	4.68x faster

English quality refresh (LibriSpeech, 100 speaker-balanced samples per subset)

Model	Subset	WER	CER	Mean eval latency	RTF
0.6B	test-clean	2.29%	0.59%	0.86s	0.0957
0.6B	test-other	4.20%	2.09%	0.71s	0.0985
1.7B	test-clean	1.99%	0.61%	2.43s	0.2708
1.7B	test-other	3.45%	1.42%	2.02s	0.2814

Artifacts: docs/benchmarks/2026-02-15-librispeech-test-clean-100.json, docs/benchmarks/2026-02-15-librispeech-test-other-100.json, docs/benchmarks/2026-02-15-librispeech-test-clean-100-1p7b.json, docs/benchmarks/2026-02-15-librispeech-test-other-100-1p7b.json.

Quantization quality (0.6B, LibriSpeech test-clean, 100 speaker-balanced samples)

Configuration	WER	CER	Mean eval latency	vs fp16 Speed
fp16	2.29%	0.59%	1.09s	—
8-bit (g64)	2.33%	0.59%	0.34s	3.11x
4-bit (g64)	2.72%	0.88%	0.30s	4.68x

8-bit is near-fp16 quality (+0.04pp WER). 4-bit trades +0.43pp WER for maximum speed.

On the harder test-other lane (n=100, speaker-balanced), 8-bit remains near-fp16 (-0.05pp WER) while 4-bit shows a larger quality tradeoff (+1.38pp WER). Speedups remain strong (3.66x for 8-bit, 4.37x for 4-bit on the 10s benchmark clip).

Artifact: docs/benchmarks/2026-02-15-quant-matrix-test-other-speaker100.md.

Multilingual quality (FLEURS, 10 languages x 10 samples)

Model	Primary Error Rate	Mean Latency	Best Languages	Weakest
0.6B fp16	9.37%	1.44s	Spanish 3.0%, Chinese 4.4%, English 4.6%	Hindi 16.7%, French 18.2%, Arabic 21.5%
1.7B fp16	6.70%	4.12s	Spanish 0.7%, Japanese 3.6%, French 4.1%	Chinese 8.5%, Arabic 16.5%, Hindi 17.4%

The 1.7B delivers a 28% relative improvement, with the biggest gains on French (-14.1pp), Japanese (-4.9pp), and Arabic (-5.0pp). The 1.7B runs ~2.86x slower.

Artifacts: docs/benchmarks/2026-02-15-manifest-quality-multilingual100-0p6b-refresh.json, docs/benchmarks/2026-02-15-manifest-quality-multilingual100-1p7b-refresh.json.

MLX vs PyTorch quality (0.6B, Multilingual-100)

Metric	MLX	PyTorch	Delta
Primary error rate	9.54%	10.34%	-0.81pp
WER	16.00%	16.69%	-0.70pp
CER	5.43%	5.64%	-0.21pp

67% of samples produce identical text output. Remaining differences are minor lexical shifts, numeric surface forms (10,000 vs zehntausend), or punctuation — not quality regressions.

On long-form audio (75-90s clips), MLX is 4.19x faster than PyTorch on the same machine.

On an expanded real-world mixed lane (AMI IHM meetings + Earnings22 chunked, n=200), MLX remains near parity with PyTorch (23.23% vs 23.04% WER, +0.19pp delta) while staying 3.27x faster on the same machine (1.34s vs 4.39s mean latency).

Optimizations applied

Preallocated KV cache with in-place slice writes and rollback-safe trimming
Direct grouped-query fused attention via mx.fast.scaled_dot_product_attention (no explicit K/V head expansion)
Hybrid encoder windowing — dense block-diagonal mask for short audio, segmented per-window execution for long contexts (up to 4.2x faster on long audio)
Cached mel filterbank and Hann window — computed once, reused across calls
Native WAV fast-path — custom binary parser bypasses ffmpeg process startup for PCM/float WAV files (up to 25% faster on quantized short clips)
Native in-repo BPE tokenizer — no transformers dependency in runtime transcription path
Cached model and tokenizer instances — repeated transcribe() calls skip reload overhead
4-bit / 8-bit quantization — up to 4.7x speed gain with explicit per-profile quality reporting

Full benchmark report: docs/BENCHMARKS.md. Latest refresh snapshot: docs/benchmarks/2026-02-15-quality-matrix-refresh.md. All benchmark artifacts are committed under docs/benchmarks/ for reproducibility.

Model quality

Word error rates from the Qwen3-ASR technical report compared against current open-source and proprietary leaders (lower is better):

English benchmarks

Benchmark	GPT-4o-Transcribe	Parakeet-TDT-0.6B	Whisper-large-v3	Qwen3-ASR-0.6B	Qwen3-ASR-1.7B
LibriSpeech test-clean	1.39	1.93	1.51	2.11	1.63
LibriSpeech test-other	3.75	3.59	3.97	4.55	3.38
FLEURS-en	2.40	4.85	4.08	4.39	3.35
GigaSpeech	25.50	—	9.76	8.88	8.45

Chinese + multilingual benchmarks

Benchmark	GPT-4o-Transcribe	Whisper-large-v3	Qwen3-ASR-0.6B	Qwen3-ASR-1.7B
WenetSpeech test-net	15.30	9.86	5.97	4.97
AISHELL-2 test	4.24	5.06	3.15	2.71
FLEURS (12-lang avg)	—	5.27	7.57	4.90
CommonVoice	—	10.77	12.75	9.18

Robustness benchmarks

Benchmark	GPT-4o-Transcribe	Whisper-large-v3	Qwen3-ASR-0.6B	Qwen3-ASR-1.7B
Accented English	28.56	21.30	16.62	16.07
Extreme Noise	36.11	63.17	17.88	16.17
Elders & Kids (Mandarin)	14.27	10.61	4.48	3.81

GPT-4o-Transcribe leads on clean English read speech (1.39 WER). Parakeet-TDT-0.6B is strong on English. But Qwen3-ASR dominates on Chinese, multilingual, noisy, and accented speech — and is the only open-source model competitive across all categories.

Parakeet numbers from model card. All other numbers from the Qwen3-ASR paper. Robustness benchmarks are Qwen3-ASR internal test sets.

Correctness validation

This implementation is validated against the official PyTorch model via multiple parity gates:

MLX vs PyTorch head-to-head — on the current multilingual-100 artifact, MLX shows lower aggregate primary error than PyTorch (9.54% vs 10.34%)
Token-level greedy parity — current multilingual-100 parity artifact shows 67% exact text match and 64% exact token match across 10 languages; remaining diffs are mostly lexical/numeric surface-form differences
Expanded parity suite — tested across LibriSpeech test-clean, test-other, synthetic long mixes, and noise variants (SNR 10dB, 5dB)
Long-form parity — 10 multilingual clips (75-90s each) transcribed correctly with no chunking artifacts, 4.19x faster than PyTorch
Mel spectrogram parity — custom MLX mel matches HuggingFace WhisperFeatureExtractor with MAE < 3e-7
Native aligner parity — MLX forced aligner matches official qwen-asr backend with 100% text match rate, <6ms timing MAE, and 2.64x speed advantage on 50 LibriSpeech samples

Model variants

	Qwen3-ASR-0.6B (default)	Qwen3-ASR-1.7B
Parameters	0.6B	1.7B
Audio encoder layers	18	24
Audio encoder dim	896	1024
Text decoder layers	28	28
Text hidden size	1024	2048
Text attention (Q/KV heads)	GQA (16/8)	GQA (16/8)
RoPE theta	1,000,000	1,000,000
HuggingFace	`Qwen/Qwen3-ASR-0.6B`	`Qwen/Qwen3-ASR-1.7B`

Both models use interleaved Multi-dimensional RoPE (MRoPE) with sections [24, 20, 20], 128-bin mel spectrograms, and the same tokenizer (vocabulary size 151,936).

# Default: 0.6B (fast, ~1.2 GB memory)
result = transcribe("audio.wav")

# Accuracy-first: 1.7B (~3.4 GB memory)
result = transcribe("audio.wav", model="Qwen/Qwen3-ASR-1.7B")

Timestamps

Word-level timestamps via forced alignment using a dedicated aligner model (Qwen/Qwen3-ForcedAligner-0.6B). This path is native MLX (no PyTorch backend bridge):

mlx-qwen3-asr audio.wav --timestamps

result = transcribe("audio.wav", return_timestamps=True)
for segment in result.segments:
    print(f"{segment['start']:.2f}s - {segment['end']:.2f}s: {segment['text']}")

SRT/VTT outputs are grouped into subtitle-friendly phrase segments (not one word per cue). When -f srt or -f vtt is requested in offline mode, timestamps are auto-enabled.

Measured parity (LibriSpeech test-clean, n=50):

Metric	Value
Text match rate (MLX vs official)	100%
Timing MAE (all word boundaries)	5.69 ms
MLX aligner mean latency	0.21s
Official backend mean latency	0.56s
Relative speed	2.64x faster

The aligner uses O(n log n) LIS-based timestamp correction (Fenwick tree) for monotonicity repair, validated against the legacy O(n^2) implementation via randomized parity tests.

For Japanese/Korean timestamp alignment, install the [aligner] extra so nagisa/soynlp tokenization matches the official path.

Speaker diarization (optional)

Speaker attribution is available as an offline optional path powered by pyannote.audio:

result = transcribe("meeting.wav", diarize=True)
print(result.speaker_segments)

mlx-qwen3-asr meeting.wav --diarize -f json

Current status:

The public API/CLI and output schema are stable.
The diarization backend is pyannote.audio 4.x (installed via [diarize] extra).
The default model is pyannote/speaker-diarization-community-1; accept its Hugging Face terms and configure PYANNOTE_AUTH_TOKEN (or HF_TOKEN).
PYANNOTE_MODEL_ID can point to another pyannote pipeline or a local offline clone.
--diarize auto-enables timestamps and is not supported in --streaming/--mic mode.
Migration note (2026-02-15): legacy diarization window/hop controls were removed (diarization_window_sec, diarization_hop_sec, --diarization-window-sec, --diarization-hop-sec). Speaker-count controls remain (--num-speakers, --min-speakers, --max-speakers).

Diarization setup troubleshooting

Install optional diarization dependencies:
```
pip install "mlx-qwen3-asr[diarize]"
```
Accept the default pyannote/speaker-diarization-community-1 model terms on Hugging Face and set a token:
```
export PYANNOTE_AUTH_TOKEN=hf_...
```

Run a quick smoke test:

mlx-qwen3-asr meeting.wav --diarize -f json

Common errors and fixes:

requires optional dependency 'pyannote.audio': install [diarize] extra.
requires PyTorch via pyannote dependencies: reinstall [diarize] extra in the active environment.
Failed to initialize pyannote pipeline ...: accept model terms on Hugging Face, set PYANNOTE_AUTH_TOKEN (or HF_TOKEN), and inspect the Root cause: details.
--streaming does not support --diarize / --mic does not support --diarize: use offline file transcription mode for diarization.

Quantization

Convert and run a quantized model:

python scripts/convert.py \
  --model Qwen/Qwen3-ASR-0.6B \
  --quantize 4 --group-size 64 \
  --output-dir ./qwen3-asr-4bit

mlx-qwen3-asr audio.wav --model ./qwen3-asr-4bit

Recommended profiles:

Speed-first: 4-bit, group_size=64 — 4.68x faster / +0.43 WER (test-clean), 4.37x faster / +1.38 WER (test-other)
Quality-first: 8-bit, group_size=64 — 3.11x faster / +0.04 WER (test-clean), 3.66x faster / -0.05 WER (test-other)

Publish quantized models to HuggingFace:

HF_TOKEN=... python scripts/publish_quantized.py \
  --source-model Qwen/Qwen3-ASR-0.6B \
  --repo-id YOUR_USER/mlx-qwen3-asr-0.6b-4bit \
  --bits 4

Output formats

mlx-qwen3-asr audio.wav -f txt           # plain text
mlx-qwen3-asr audio.wav -f srt -o out/   # SRT subtitles
mlx-qwen3-asr audio.wav -f json          # structured JSON
mlx-qwen3-asr audio.wav -f vtt -o out/   # WebVTT
mlx-qwen3-asr *.wav -f all -o out/       # all formats at once

Supported: txt, json, srt, vtt, tsv. Subtitle formats (srt/vtt) require timestamp segments and are only supported in offline mode.

Supported languages

Qwen3-ASR officially lists 30 core languages:


Arabic	Cantonese	Chinese	Czech
Danish	Dutch	English	Filipino
Finnish	French	German	Greek
Hindi	Hungarian	Indonesian	Italian
Japanese	Korean	Macedonian	Malay
Persian	Polish	Portuguese	Romanian
Russian	Spanish	Swedish	Thai
Turkish	Vietnamese

Plus 22 Chinese dialects (Sichuan, Shanghai, Cantonese, and others), for 52 total language/dialect variants.

Print CLI-accepted aliases/codes:

mlx-qwen3-asr --list-languages

Experimental features

Speculative decoding

Uses the 0.6B model as a draft to accelerate 1.7B inference. Currently parity-safe but slower on tested workloads due to draft audio encoder overhead:

mlx-qwen3-asr audio.wav \
  --model Qwen/Qwen3-ASR-1.7B \
  --draft-model Qwen/Qwen3-ASR-0.6B \
  --num-draft-tokens 4

result = transcribe(
    "audio.wav",
    model="Qwen/Qwen3-ASR-1.7B",
    draft_model="Qwen/Qwen3-ASR-0.6B",
    num_draft_tokens=4,
)

Status: greedy parity verified, but 0.53-0.55x on short/10s clips. Not enabled by default until benchmark evidence shows net speed wins.

Domain vocabulary context

When transcribing specialized audio — earnings calls, medical dictation, legal proceedings — the model can confuse rare terms with more common homophones. The context parameter lets you provide a hint: a string of domain-specific words or phrases that gets injected into the system prompt, nudging the decoder toward the correct vocabulary.

This matches the official Qwen3-ASR context API. The format is space-separated terms:

# Finance: avoids "e-bit-da" → "EBITDA", "FX" not "effects", etc.
result = transcribe("earnings-call.wav", context="EBITDA non-GAAP FX hedging")

# Medical
result = transcribe("consult.wav", context="metformin HbA1c nephropathy")

# Also works with streaming
state = init_streaming(context="EBITDA non-GAAP FX hedging")

mlx-qwen3-asr earnings-call.wav --context "EBITDA non-GAAP FX hedging"

For batch transcription, pass a list of per-audio context strings:

results = transcribe_batch(
    [audio_en, audio_zh],
    context=["EBITDA non-GAAP", "交易 停滞"],
)

When omitted, the system prompt is empty (matching the official default) — no domain bias is applied.

Streaming

Rolling decode implementation for near-real-time transcription:

from mlx_qwen3_asr.streaming import (
    init_streaming,
    feed_audio,
    finish_streaming,
    streaming_metrics,
)

state = init_streaming(chunk_size_sec=2.0, max_context_sec=30.0)
for chunk in audio_chunks:
    state = feed_audio(chunk, state)
    print(state.text)
state = finish_streaming(state)
print(streaming_metrics(state))

CLI:

mlx-qwen3-asr --streaming --stream-finalization-mode accuracy audio.wav
# Optional: speech-aware boundary selection near chunk edges
mlx-qwen3-asr --streaming --stream-endpointing-mode energy audio.wav

Live microphone transcription:

mlx-qwen3-asr --mic
mlx-qwen3-asr --mic --language Japanese

Optional microphone flags: --mic-device, --mic-duration-sec, --mic-sample-rate.

Ingests small PCM chunks (default 2s)
Incremental decoder KV-cache reuse across chunk turns (avoids O(n²) re-transcription)
Bounded context window (default 30s) for stable memory/runtime
Prefix rollback controls (unfixed_chunk_num, unfixed_token_num)
stable_text is monotonic by design: corrections that would shorten already-stable prefix text are intentionally not applied to the stable prefix (favoring stability over maximal editability in partial output)
Optional speech-aware endpointing (endpointing_mode="energy") that selects low-energy boundaries near chunk edges
Configurable finalization policy: finalization_mode="accuracy" (default) or "latency"
Backward-compatible override: enable_tail_refine=True|False
Input validation: handles int16 PCM normalization, non-1D arrays, empty input

API reference

`transcribe(audio, *, model, draft_model, context, language, return_timestamps, diarize, diarization_num_speakers, diarization_min_speakers, diarization_max_speakers, return_chunks, forced_aligner, dtype, max_new_tokens, num_draft_tokens, verbose, on_progress)`

Transcribe audio to text. Accepts a file path, numpy array, mx.array, or (array, sample_rate) tuple. Returns a TranscriptionResult.

max_new_tokens=None (default) uses a duration-aware per-chunk decode budget to avoid runaway generation on noisy inputs that do not emit EOS. Pass an integer to override the cap explicitly. If you use unusually long custom chunks and see truncated=True, pass a larger explicit value for that workload.

Additional Python entry points:

transcribe_batch(audios, ...) and transcribe_batch_async(audios, ...)
transcribe_async(audio, ...)

`Session(model, *, dtype, tokenizer_model)`

Explicit transcription session. Owns model and tokenizer state with no hidden globals.

Offline: session.transcribe(audio, ...) with the same parameters as top-level transcribe.
Async: await session.transcribe_async(audio, ...).
Streaming: session.init_streaming(...), session.feed_audio(pcm, state), session.finish_streaming(state).
Introspection: session.model_info (model id/path, dtype, vocab size, model-declared language codes).

`streaming_metrics(state)`

Return streaming diagnostics for a session state:

partial_stability
rewrite_rate
finalization_delta_chars

`load_model(name_or_path, *, dtype)`

Load a Qwen3-ASR model and config from HuggingFace or local path. Returns (model, config).

`load_audio(path_or_url)`

Load and resample audio to mono 16 kHz. Returns an mx.array.

`ForcedAligner(model_path, *, dtype, backend)`

Word-level forced aligner. Native backend: mlx (default).

`TranscriptionResult`

Frozen dataclass:

text (str) — transcribed text
language (str) — detected or forced language (canonicalized names, e.g. English)
segments (list[dict] | None) — word-level timestamps when requested: [{"text": "hello", "start": 0.5, "end": 0.8}, ...]
chunks (list[dict] | None) — chunk-level transcript and generation metadata when return_chunks=True
speaker_segments (list[dict] | None) — speaker-attributed spans when diarize=True: [{"speaker": "SPEAKER_00", "start": 0.0, "end": 2.0, "text": "..."}, ...]
finish_reason (str | None) — aggregate decode stop reason: eos, repetition, length, or mixed
truncated (bool) — true when any chunk exhausted its token budget before EOS/repetition

Quality gates

This project enforces parity with the official PyTorch implementation. No optimization lands without passing quality gates and committing benchmark artifacts.

# Unit tests (462 tests)
pytest -q

# Fast quality gate
python scripts/quality_gate.py --mode fast

# Release gate with token-level parity (downloads model weights)
RUN_REFERENCE_PARITY=1 python scripts/quality_gate.py --mode release

# Speaker-balanced WER evaluation (100 samples)
python scripts/eval_librispeech.py --subset test-clean --samples 100 --sampling speaker_round_robin

# Latency benchmark
python scripts/benchmark_asr.py tests/fixtures/test_speech.wav \
  --model Qwen/Qwen3-ASR-0.6B --runs 5 \
  --json-output docs/benchmarks/latest.json

Additional quality lanes available:

Aligner parity: RUN_ALIGNER_PARITY=1 — validates MLX aligner against official backend
Expanded parity suite: RUN_REFERENCE_PARITY_SUITE=1 — test-clean, test-other, long mixes, noise variants with Unicode-safe text comparison
Multilingual parity: manifest-driven workflow via scripts/build_multilingual_manifest.py for cross-language validation
Streaming manifest quality: RUN_STREAMING_MANIFEST_QUALITY_EVAL=1 with STREAMING_MANIFEST_QUALITY_EVAL_JSONL=... — multi-file streaming stability/rewrite/finalization lane via scripts/eval_streaming_manifest.py
Real-world long-form quality: RUN_REALWORLD_LONGFORM_EVAL=1 on full-recording Earnings22 manifests
Diarization quality: RUN_DIARIZATION_QUALITY_EVAL=1 with DIARIZATION_QUALITY_EVAL_JSONL=... — DER/JER lane via scripts/eval_diarization.py

See docs/QUALITY_GATE.md for full documentation. Evaluation coverage status and prioritized gaps are tracked in docs/EVAL_GAPS.md.

Architecture overview

Audio (16kHz mono)
  → 128-bin log-mel spectrogram (native MLX, Whisper-compatible)
  → Conv2d stem (3 layers, stride 2 each → 8x downsample)
  → Sinusoidal position embeddings
  → Windowed transformer encoder (18 or 24 layers, hybrid dense/segmented attention)
  → LayerNorm + GELU projection → audio features

Chat-template prompt (context is optional domain vocabulary, empty by default):
  <|im_start|>system\n{context}<|im_end|>
  <|im_start|>user\n<|audio_start|><|audio_pad|>*N<|audio_end|><|im_end|>
  <|im_start|>assistant\n

  → Token embedding (151,936 vocab)
  → Replace audio_pad positions with encoded audio features
  → Qwen3 text decoder (28 layers, interleaved MRoPE, SwiGLU, RMSNorm)
  → Autoregressive decode with preallocated KV cache
  → Parse output: "language English<asr_text>transcribed text here"

Key architectural details:

Interleaved MRoPE — sections [24, 20, 20] with stride-3 frequency assignment across temporal, height, and width dimensions. This is the detail other MLX ports get wrong (using standard RoPE or chunked assignment).
Audio encoder uses LayerNorm + bias — different from the text decoder which uses RMSNorm without bias.
Q/K norms — RMSNorm applied per-head on queries and keys before attention (Qwen3 innovation).

Project structure

mlx_qwen3_asr/           # 7,602 lines of source
├── transcribe.py         # High-level pipeline + batch/async + diarization (739 lines)
├── cli.py                # CLI entry point and UX guardrails (664 lines)
├── streaming.py          # KV-cache streaming + context trimming (624 lines)
├── tokenizer.py          # Native BPE tokenizer + output parsing (607 lines)
├── diarization.py        # Optional pyannote integration + attribution helpers
├── audio.py              # Mel spectrogram + audio I/O (526 lines)
├── encoder.py            # Audio encoder (512 lines)
├── decoder.py            # Text decoder + KV cache (464 lines)
├── forced_aligner.py     # Forced alignment + LIS correction (439 lines)
├── model.py              # Top-level model + audio-text fusion (372 lines)
├── generate.py           # Autoregressive + speculative decode (350 lines)
├── load_models.py        # Model loading + caching (256 lines)
├── config.py             # Dataclass configs (228 lines)
├── server.py             # HTTP server + OpenAI compat (697 lines)
├── session.py            # Session API (224 lines)
├── writers.py            # Output format writers (221 lines)
├── mrope.py              # Interleaved MRoPE (167 lines)
├── chunking.py           # Long audio splitting (104 lines)
├── attention.py          # Attention utilities (67 lines)
├── convert.py            # Weight remapping (67 lines)
├── eval_metrics.py       # WER/CER/BERTScore helpers (65 lines)
└── cache_utils.py        # KV cache utilities (57 lines)

tests/                    # 7,391 lines, 462 tests
scripts/                  # Benchmarks, evaluation, conversion, publishing
docs/                     # Architecture, decisions, benchmarks, roadmap
docs/benchmarks/          # 160+ committed artifacts for reproducibility

Development

git clone https://github.com/moona3k/mlx-qwen3-asr.git
cd mlx-qwen3-asr
pip install -e ".[dev]"
pytest -q                 # 462 tests

Acknowledgments

Qwen team at Alibaba for the Qwen3-ASR model
Apple MLX team for the MLX framework
mlx-whisper for architecture patterns and inspiration

License

Apache 2.0. See LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.5

May 16, 2026

0.3.4

May 14, 2026

0.3.3

Apr 26, 2026

0.3.2

Mar 24, 2026

0.3.1

Mar 21, 2026

0.3.0

Mar 21, 2026

0.2.4

Mar 11, 2026

0.2.3

Feb 15, 2026

0.2.2

Feb 15, 2026

0.2.1

Feb 15, 2026

0.2.0

Feb 15, 2026

0.1.0

Feb 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx_qwen3_asr-0.3.5.tar.gz (244.2 kB view details)

Uploaded May 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mlx_qwen3_asr-0.3.5-py3-none-any.whl (168.0 kB view details)

Uploaded May 16, 2026 Python 3

File details

Details for the file mlx_qwen3_asr-0.3.5.tar.gz.

File metadata

Download URL: mlx_qwen3_asr-0.3.5.tar.gz
Upload date: May 16, 2026
Size: 244.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for mlx_qwen3_asr-0.3.5.tar.gz
Algorithm	Hash digest
SHA256	`3b2932a6cc5e162d8a686434ee0ea63e86a294c845451496729a6f0f6a0df050`
MD5	`63b403946e8e76ed5cda281e97a3dfc3`
BLAKE2b-256	`3dc2b53450ad34ec4960bbb108701e8268e109c12b9888b101ea22f94897ffb5`

See more details on using hashes here.

File details

Details for the file mlx_qwen3_asr-0.3.5-py3-none-any.whl.

File metadata

Download URL: mlx_qwen3_asr-0.3.5-py3-none-any.whl
Upload date: May 16, 2026
Size: 168.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for mlx_qwen3_asr-0.3.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5c7169392a8d06f38ccc791b768cdc9e7fd6038a34130576c2af8b3afa28d06b`
MD5	`134bae2a67b75e7cd22ee11208bfc8cb`
BLAKE2b-256	`7d65a450105650127e28c4467f834c8e71cdca422e62efa4161919bb1d94cd36`

See more details on using hashes here.

mlx-qwen3-asr 0.3.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

mlx-qwen3-asr

Why this exists

What's included

Requirements

Installation

Quick start

Python API

Session API (recommended for repeated calls)

Loading models explicitly

CLI

HTTP server

Performance on Apple Silicon

Latency (0.6B)

English quality refresh (LibriSpeech, 100 speaker-balanced samples per subset)

Quantization quality (0.6B, LibriSpeech test-clean, 100 speaker-balanced samples)

Multilingual quality (FLEURS, 10 languages x 10 samples)

MLX vs PyTorch quality (0.6B, Multilingual-100)

Optimizations applied

Model quality

English benchmarks

Chinese + multilingual benchmarks

Robustness benchmarks

Correctness validation

Model variants

Timestamps

Speaker diarization (optional)

Diarization setup troubleshooting

Quantization

Output formats

Supported languages

Experimental features

Speculative decoding

Domain vocabulary context

Streaming

API reference

transcribe(audio, *, model, draft_model, context, language, return_timestamps, diarize, diarization_num_speakers, diarization_min_speakers, diarization_max_speakers, return_chunks, forced_aligner, dtype, max_new_tokens, num_draft_tokens, verbose, on_progress)

Session(model, *, dtype, tokenizer_model)

streaming_metrics(state)

load_model(name_or_path, *, dtype)

load_audio(path_or_url)

ForcedAligner(model_path, *, dtype, backend)

TranscriptionResult

Quality gates

Architecture overview

Project structure

Development

Acknowledgments

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`transcribe(audio, *, model, draft_model, context, language, return_timestamps, diarize, diarization_num_speakers, diarization_min_speakers, diarization_max_speakers, return_chunks, forced_aligner, dtype, max_new_tokens, num_draft_tokens, verbose, on_progress)`

`Session(model, *, dtype, tokenizer_model)`

`streaming_metrics(state)`

`load_model(name_or_path, *, dtype)`

`load_audio(path_or_url)`

`ForcedAligner(model_path, *, dtype, backend)`

`TranscriptionResult`