Production-grade Traditional Chinese / Taiwan Mandarin ASR toolkit — Qwen3-ASR + Breeze-ASR-25, LLM polish, speaker diarization, RTX 5090 / Blackwell tuned, RTF up to 1554x.
Project description
Taiwan ASR Toolkit
Production-grade Traditional Chinese (Taiwan Mandarin) speech-to-text — RTF up to 1554x on a single RTX 5090
Qwen3-ASR-1.7B + MediaTek Breeze-ASR-25 · Hot-word injection · LLM context polish · Speaker diarization · OpenCC s2twp · 69 TDD tests
Why this exists · Quick start · Benchmarks · Usage · Architecture · vs alternatives
Why this exists
If you've tried openai/whisper-large-v3 or whisperX on Taiwan Mandarin recordings, you've hit:
- Output is Simplified Chinese by default (you keep getting
软件instead of軟體) - Whisper's built-in VAD silently fails on long sparse audio → 一個 48-min 失控段
- Proper nouns die:
延三舍 / 研三舍(NTU dorms) become圓三 / 圓山 - Generic Whisper is not tuned for Taiwan vocabulary — homophone errors everywhere
- Variable-length VAD chunks waste 5-10x compute through padding
This toolkit fixes all of those. Two production-grade Mandarin ASR models, identical pipeline for fair comparison, glossary-driven hot-word injection at the source, LLM polish with proper-noun protection, OpenCC s2twp baked in. Tested on real lecture/interview/standard recordings.
Star this repo if you've been burned by
condition_on_previous_text=Trueon Mandarin. We feel you.
Quick start
Zero-effort try (Colab)
Click the Open in Colab badge at the top — runs the full pipeline on a bundled 30-second sample with a Colab GPU, no install on your machine.
Install from PyPI
pip install --index-url https://download.pytorch.org/whl/cu128 torch torchaudio
pip install taiwan-asr-toolkit
# transcribe with the bundled NTU glossary (packaged in the wheel)
asr-breeze your_audio.mp3 --glossary-file builtin
30-second local test (clone, no audio needed)
git clone https://github.com/thc1006/taiwan-asr-toolkit.git && cd taiwan-asr-toolkit
pip install --index-url https://download.pytorch.org/whl/cu128 torch torchaudio
pip install -e ".[all]"
asr-breeze tests/fixtures/clip_30s.wav --glossary-file builtin
cat transcripts/breeze/clip_30s_breeze.txt
That's it — 30 s of Taiwan Mandarin in, Traditional Chinese transcript out.
On your own audio
asr-breeze path/to/your_audio.mp3 --glossary-file builtin
# or use your own glossary file:
asr-breeze path/to/your_audio.mp3 --glossary-file my_terms.txt
# Output: transcripts/breeze/{filename}_breeze.{txt,srt,json}
Output is Traditional Chinese (Taiwan), with timestamps, segment-level + word-level (Breeze) timing, and proper SRT subtitles.
Benchmarks
Real numbers on a single RTX 5090 (Blackwell sm_120, 32 GB GDDR7) + i9-14900 (24 threads). Test corpus: 11 audio files, 712.6 minutes (≈12 hours) of Taiwan-Mandarin lectures + interviews.
Speed (Real-Time Factor — higher = faster)
| Audio file | Length | Breeze RTF | Qwen3 RTF |
|---|---|---|---|
| 4-min standard recording | 4 min | 189x | 136x |
| 24-min standard recording | 24 min | 239x | 199x |
| 65-min interview | 65 min | 341x | 297x |
| 140-min lecture (TASA) | 140 min | 546x | 448x |
| 189-min sparse audio | 189 min | 1554x | 1497x |
| All 11 files combined | 712 min | 382x | 354x |
| Total ASR time | 111.9 s | 126.2 s |
Quality vs hallucination (proxy metrics, no GT)
| Metric | Qwen3 | Breeze |
|---|---|---|
| Avg quality score (35% coverage + 20% c/s + 15% Trad + 15% vocab + 15% no-halluc) | 0.815 | 0.808 |
| Per-file wins (out of 11) | 8 | 3 |
| Catastrophic >60s segments | 0 | 0 |
| Coverage % (transcribed vs audio time) | 83.3% | 79.8% |
OpenCC s2twp Traditional ratio |
0.97 | 0.97 |
Real CER on hand-corrected fixture (886 first 55s)
| Model | CER | Notes |
|---|---|---|
| Breeze-ASR-25 + glossary | 2.34% | hot-word injection fixes 圓三→研三 at source |
| Qwen3-ASR-1.7B | 68.42% | over-transcribes (97 extra chars not in fixed-time GT) |
Want to verify? pytest tests/test_glossary_effect.py -v — locks in the 圓三 → 研三 improvement as a regression test.
Features
| What it does | |
|---|---|
| Two SOTA Mandarin ASR models | Qwen/Qwen3-ASR-1.7B + MediaTek-Research/Breeze-ASR-25. Both run, both compared. |
| Traditional Chinese always | OpenCC s2twp post-processing converts any leftover 簡體 → 繁體 (Taiwan idioms): 軟件→軟體, 激光→雷射, 視頻→影片. |
| Hot-word injection | Pass --glossary-file builtin for the packaged NTU glossary (or your own .txt); proper nouns get fed to Whisper's initial_prompt + hotwords. Fixes 圓三 → 研三, 祝福二族 → 住輔二組, etc. at the source. |
| Symmetric pipeline | Same Silero VAD ONNX, same chunking, same dtype on both models. The benchmark measures the model, not the plumbing. |
| Multi-file pool batching | Cross-file length-sorted batching keeps batch=48 fully utilized when transcribing folders of mixed-length files. |
| LLM context polish | Optional Qwen3-8B post-correction with NTU glossary protection (won't accidentally "fix" 研三舍 to 延長). |
| Speaker diarization | Optional pyannote 3.x integration with open-mirror fallback (no gated-license blocker). |
| Real CER measurement | jiwer-based CER with s2twp normalization. Bring your own ground-truth or use the included approximate fixture. |
| 69 TDD tests | Including 5 invariant tests that lock the Breeze model ID so optimizations can't accidentally swap to a different Whisper variant. |
| Blackwell-native | bf16 + cuDNN-SDPA + torch.compile for RTX 5090. Auto-falls back gracefully on Hopper/Ada/Ampere/CPU. |
Usage
Single-file transcription
# Breeze (Whisper-Large-v2 fine-tune, fastest)
asr-breeze "music/lecture.mp3" --glossary-file builtin
# Qwen3-ASR (more comprehensive coverage)
asr-qwen3 "music/interview.m4a"
# Both at once (same audio, two transcripts to compare)
./run.sh both "music/standard_recording.mp3"
Batch transcription (auto pool batching)
# Transcribes everything in music/ via Qwen3 with pool batching
asr-qwen3 music/*.mp3 music/*.m4a
# Same with Breeze (now also gets cross-file batched if multiple files)
asr-breeze music/*.{mp3,m4a,wav} --glossary-file builtin
Power-user flags
| Flag | Effect |
|---|---|
--glossary-file PATH |
(Breeze) Inject domain terms via Whisper's prompt + hotwords. Use --glossary-file builtin for the packaged NTU dorm/dept glossary, or pass your own .txt file. |
--fast |
(Breeze) int8_bfloat16 quantization, ~1.5x speedup, +0.3-0.5% CER on Mandarin |
--beam N |
(Breeze) Beam size; default 5. --beam 1 = greedy (fastest), --beam 10 = max accuracy |
--no-aligner |
(Qwen3) Skip ForcedAligner-0.6B; ~25% faster but loses word-level timestamps |
--no-pool |
(Qwen3) Disable cross-file chunk pooling for multi-file runs |
--internal-vad |
(Breeze) Use faster-whisper's built-in VAD instead of our ONNX Silero (not recommended) |
--no-s2tw |
Disable OpenCC s2twp post-processing |
LLM context polish (post-process)
# Polish a Breeze output with Qwen3-8B + glossary protection
asr-polish transcripts/breeze/lecture_breeze.json --glossary-file builtin
# → transcripts/breeze-polished/lecture_breeze-polished.{txt,srt,json}
Speaker diarization
# Adds [SPEAKER_00] / [SPEAKER_01] labels to each segment
asr-diarize transcripts/breeze/interview_breeze.json music/interview.m4a
# Requires HF license accept on:
# - https://hf.co/pyannote/speaker-diarization-3.1
# - https://hf.co/pyannote/speaker-diarization-community-1
# - https://hf.co/pyannote/segmentation-3.0
Benchmark + CER report
# Generates docs/BENCHMARK.md with speed + quality metrics
asr-bench --gt-dir tests/fixtures
Architecture
audio (.mp3/.m4a/.wav/...)
↓ ffmpeg pipe → numpy float32 16kHz mono
↓
Silero VAD ONNX (CPU SIMD, ~3-5x faster than PyTorch backend)
↓ ≤28s chunks
├──→ Qwen3-ASR-1.7B + ForcedAligner (HF transformers, bf16, batch=48)
└──→ Breeze-ASR-25 (CTranslate2, bf16, batch=32, beam=5, hotwords)
↓
OpenCC s2twp 簡→繁(台灣慣用詞)
↓
┌───────────┬─────────────┐
↓ ↓ ↓
TXT SRT JSON
┌──── Optional post-processing ────┐
│ asr-polish asr-diarize asr-bench │
│ Qwen3-8B pyannote CER + RTF │
└──────────────────────────────────┘
Full architectural details: docs/ARCHITECTURE.md
Project structure
taiwan-asr-toolkit/
├── src/taiwan_asr/
│ ├── __init__.py ← package; minimal export to avoid heavy import cascades
│ ├── common.py ← shared utils: ffmpeg pipe, OpenCC, Silero VAD, glossary, Segment
│ ├── qwen3.py ← Qwen3-ASR + multi-file chunk pool
│ ├── breeze.py ← Breeze-ASR-25 with manual VAD + hot-word injection
│ ├── polish.py ← Qwen3-8B LLM context correction (glossary-protected)
│ ├── diarize.py ← pyannote.audio speaker diarization
│ ├── cer_eval.py ← jiwer-based CER with s2twp normalization
│ └── benchmark.py ← speed + accuracy report
├── glossary.txt ← default NTU glossary (dorm/dept names)
├── run.sh ← convenience wrapper around asr-* CLI commands
├── pyproject.toml ← project metadata, deps, CLI scripts (asr-qwen3, asr-breeze, …)
├── tests/ ← 69 TDD tests (including 5 Breeze invariants)
├── docs/ ← BENCHMARK.md / ARCHITECTURE.md / INSTALL.md
└── archive/ ← legacy Colab notebooks (kept for reference only)
After pip install -e . the following CLI commands are on PATH: asr-qwen3, asr-breeze, asr-polish, asr-diarize, asr-bench, asr-cer.
Testing & contributing
# All 69 tests, no model load required for "fast" tier
pytest -m fast
# Breeze contract tests (NEVER allowed to fail)
pytest -m breeze_invariant
# Full suite (some need VAD load)
pytest
The toolkit follows strict TDD. Any contribution must:
- Have a failing test that now passes
- Keep all 56 existing tests green
- Pass
pytest -m breeze_invariant(which locksMediaTek-Research/Breeze-ASR-25as Breeze's model) - Keep Traditional Chinese (Taiwan) output
See CONTRIBUTING.md for the full guide.
vs alternatives
| This toolkit | whisperX |
faster-whisper (raw) |
openai/whisper |
|
|---|---|---|---|---|
| Taiwan Traditional Chinese by default | s2twp baked-in | |||
| Two SOTA Mandarin models compared | Qwen3 + Breeze | Whisper only | Whisper only | Whisper only |
Fixes 圓三/延三 → 研三 proper-noun ASR errors |
glossary hot-word | manual prompt | ||
| LLM context polish with proper-noun protection | Qwen3-8B + glossary | |||
| Speaker diarization (open-mirror fallback) | tensorlake mirror | pyannote (gated) | ||
| RTX 5090 / Blackwell native (bf16 + cuDNN-SDPA) | ||||
| TDD with model-invariant lock | 69 tests | |||
| Best RTF on long Mandarin audio | 1554x | ~70x | ~250x | ~30x |
Citation & credits
This toolkit is integration plumbing — credit goes to the model authors:
- MediaTek-Research/Breeze-ASR-25 (HuggingFace) — Whisper-Large-v2 fine-tune for Taiwan Mandarin
- Alibaba/Tongyi Qwen3-ASR-1.7B (HuggingFace) — multilingual ASR with ForcedAligner
- OpenAI Whisper — base architecture for Breeze-ASR-25
- CTranslate2 / faster-whisper — Whisper inference engine
- pyannote/audio — speaker diarization
- OpenCC — simplified-traditional Chinese conversion (s2twp recipe)
- Silero VAD — fast voice activity detection
If this toolkit helps your research, please cite the underlying models. A toolkit-level citation is fine but optional:
@software{taiwan_asr_toolkit,
title = {Taiwan ASR Toolkit: Production-grade Traditional Chinese Speech-to-Text Pipeline},
author = {Taiwan ASR Toolkit Contributors},
year = {2026},
url = {https://github.com/thc1006/taiwan-asr-toolkit},
note = {Qwen3-ASR-1.7B + MediaTek Breeze-ASR-25 with hot-word injection, LLM polish, and speaker diarization}
}
License
MIT for this toolkit's code. See LICENSE.
Third-party model licenses (you must comply with each):
- Qwen models: Apache 2.0
- Breeze-ASR-25: Apache 2.0
- Silero VAD: MIT
- pyannote: gated, requires HF license accept
** Made with bf16 tensor cores in Taiwan **
If this toolkit saved you hours, drop a star — it helps more people find it.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file taiwan_asr_toolkit-0.5.3.tar.gz.
File metadata
- Download URL: taiwan_asr_toolkit-0.5.3.tar.gz
- Upload date:
- Size: 64.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
86a1ca3853901bc6d22113eef3ae4e8b2d1ac47d3a6d4237c045aae4f6e8a3bb
|
|
| MD5 |
218836a05a54657c78cfcaaf181bc2cc
|
|
| BLAKE2b-256 |
2f72bbc6b425123243be6ccffcf45cf960cc29c3fe335ff6e2421b1d8c02959c
|
File details
Details for the file taiwan_asr_toolkit-0.5.3-py3-none-any.whl.
File metadata
- Download URL: taiwan_asr_toolkit-0.5.3-py3-none-any.whl
- Upload date:
- Size: 50.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f0007d010835bea946c88b519cf83a30b16df2f4237b0eb9e451ff930e37050e
|
|
| MD5 |
c01dd5e52628db93f93a1b25303430e6
|
|
| BLAKE2b-256 |
b8e3f6d5337c4b1ea75ef92002b20e2b7cd17ac6ecada7fe3c8ef2a4f7344ecc
|