Production-grade Traditional Chinese / Taiwan Mandarin ASR toolkit — Qwen3-ASR + Breeze-ASR-25, LLM polish, speaker diarization, RTX 5090 / Blackwell tuned, RTF up to 1554x.
Project description
Taiwan ASR Toolkit
Production-grade Traditional Chinese (Taiwan Mandarin) speech-to-text — RTF up to 1554x on a single RTX 5090
Qwen3-ASR-1.7B + MediaTek Breeze-ASR-25 · Hot-word injection · LLM context polish · Speaker diarization · OpenCC s2twp · 109 TDD tests
Why this exists · Quick start · Benchmarks · Usage · Architecture · Benchmark deep dive
Why this exists
If you've tried openai/whisper-large-v3 or whisperX on Taiwan Mandarin recordings, you've hit:
- Output is Simplified Chinese by default (you keep getting
软件instead of軟體) - Whisper's built-in VAD silently fails on long sparse audio → 一個 48-min 失控段
- Proper nouns die:
延三舍 / 研三舍(NTU dorms) become圓三 / 圓山 - Generic Whisper is not tuned for Taiwan vocabulary — homophone errors everywhere
- Variable-length VAD chunks waste 5-10x compute through padding
This toolkit fixes all of those. Two production-grade Mandarin ASR models, identical pipeline for fair comparison, glossary-driven hot-word injection at the source, LLM polish with proper-noun protection, OpenCC s2twp baked in. Tested on real lecture/interview/standard recordings.
Star this repo if you've been burned by
condition_on_previous_text=Trueon Mandarin. We feel you.
Quick start
Zero-effort try (Colab)
Click the Open in Colab badge at the top — opens a file picker so you can upload any Taiwan-Mandarin clip from your machine and run the full pipeline on a Colab GPU. No local install required.
Install from PyPI
pip install --index-url https://download.pytorch.org/whl/cu128 torch torchaudio
pip install taiwan-asr-toolkit
# transcribe with the bundled NTU glossary (packaged in the wheel)
asr-breeze your_audio.mp3 --glossary-file builtin
Run on your own audio (clone-and-go)
git clone https://github.com/thc1006/taiwan-asr-toolkit.git && cd taiwan-asr-toolkit
pip install --index-url https://download.pytorch.org/whl/cu128 torch torchaudio
pip install -e ".[all]"
# Drop any Taiwan-Mandarin audio file (m4a / mp3 / wav / mp4 / flac) into the repo:
asr-breeze your_audio.m4a --glossary-file builtin
cat "transcripts/breeze/$(basename your_audio .m4a)_breeze.txt"
That's it — Taiwan Mandarin in, Traditional Chinese transcript out.
No fixture is bundled in this repo. The toolkit deliberately ships zero real-voice audio to avoid any chance of leaking identifiable speakers. Bring your own clip; the Colab quickstart notebook opens a file picker so users can upload their own.
On your own audio
asr-breeze path/to/your_audio.mp3 --glossary-file builtin
# or use your own glossary file:
asr-breeze path/to/your_audio.mp3 --glossary-file my_terms.txt
# Output: transcripts/breeze/{filename}_breeze.{txt,srt,json}
Output is Traditional Chinese (Taiwan), with timestamps, segment-level + word-level (Breeze) timing, and proper SRT subtitles.
Benchmarks
Real numbers on a single RTX 5090 (Blackwell sm_120, 32 GB GDDR7) + i9-14900 (24 threads). Test corpus: 11 audio files, 712.6 minutes (≈12 hours) of Taiwan-Mandarin lectures + interviews.
Speed (Real-Time Factor — higher = faster)
| Audio file | Length | Breeze RTF | Qwen3 RTF |
|---|---|---|---|
| 4-min standard recording | 4 min | 189x | 136x |
| 24-min standard recording | 24 min | 239x | 199x |
| 65-min interview | 65 min | 341x | 297x |
| 140-min lecture (TASA) | 140 min | 546x | 448x |
| 189-min sparse audio | 189 min | 1554x | 1497x |
| All 11 files combined | 712 min | 382x | 354x |
| Total ASR time | 111.9 s | 126.2 s |
Quality vs hallucination (proxy metrics, no GT)
| Metric | Qwen3 | Breeze |
|---|---|---|
| Avg quality score (35% coverage + 20% c/s + 15% Trad + 15% vocab + 15% no-halluc) | 0.815 | 0.808 |
| Per-file wins (out of 11) | 8 | 3 |
| Catastrophic >60s segments | 0 | 0 |
| Coverage % (transcribed vs audio time) | 83.3% | 79.8% |
OpenCC s2twp Traditional ratio |
0.97 | 0.97 |
Real CER on a 55-second hand-corrected sample
| Model | CER | Notes |
|---|---|---|
| Breeze-ASR-25 + glossary | 2.34% | hot-word injection fixes 圓三 → 研三 at source |
| Qwen3-ASR-1.7B | 68.42% | over-transcribes (97 extra chars not in fixed-time GT) |
Numbers are from one author-held internal recording; the audio itself is
not redistributed. Bring your own ground-truth and run asr-bench --gt-dir path/to/your_gts/ to reproduce on your data. The hot-word effect is
locked separately as a regression test in
tests/test_glossary_effect.py, which pytest.skip()s cleanly when the
audio is not present locally.
Features
| What it does | |
|---|---|
| Two SOTA Mandarin ASR models | Qwen/Qwen3-ASR-1.7B + MediaTek-Research/Breeze-ASR-25. Both run, both compared. |
| Traditional Chinese always | OpenCC s2twp post-processing converts any leftover 簡體 → 繁體 (Taiwan idioms): 軟件→軟體, 激光→雷射, 視頻→影片. |
| Hot-word injection | Pass --glossary-file builtin for the packaged NTU glossary (or your own .txt); proper nouns get fed to Whisper's initial_prompt + hotwords. Fixes homophone errors like 圓三 → 研三 (NTU graduate dorm) at the source. |
| Symmetric pipeline | Same Silero VAD ONNX, same chunking, same dtype on both models. The benchmark measures the model, not the plumbing. |
| Multi-file pool batching (Qwen3) | Cross-file length-sorted batching keeps batch=48 fully utilized when transcribing folders of mixed-length files. Breeze relies on faster-whisper's internal batched inference per call instead. |
| LLM context polish | Optional Qwen3-8B post-correction with NTU glossary protection (won't accidentally "fix" 研三舍 to 延長). |
| Speaker diarization | Optional pyannote 3.x integration with open-mirror fallback (no gated-license blocker). |
| Real CER measurement | jiwer-based CER with s2twp normalization. Bring your own ground-truth or use the included approximate fixture. |
| 109 TDD tests | Including 5 invariant tests that lock the Breeze model ID so optimizations can't accidentally swap to a different Whisper variant. |
| Blackwell-native | bf16 + cuDNN-SDPA + torch.compile for RTX 5090. Auto-falls back gracefully on Hopper/Ada/Ampere/CPU. |
Usage
Single-file transcription
# Breeze (Whisper-Large-v2 fine-tune, fastest)
asr-breeze "music/lecture.mp3" --glossary-file builtin
# Qwen3-ASR (more comprehensive coverage)
asr-qwen3 "music/interview.m4a"
# Both at once (same audio, two transcripts to compare)
./run.sh both "music/standard_recording.mp3"
Batch transcription (auto pool batching)
# Transcribes everything in music/ via Qwen3 with cross-file pool batching.
asr-qwen3 music/*.mp3 music/*.m4a
# Breeze processes multi-file inputs sequentially (faster-whisper already
# batches internally per call via BatchedInferencePipeline).
asr-breeze music/*.{mp3,m4a,wav} --glossary-file builtin
Power-user flags
| Flag | Effect |
|---|---|
--glossary-file PATH |
(Breeze) Inject domain terms via Whisper's prompt + hotwords. Use --glossary-file builtin for the packaged NTU dorm/dept glossary, or pass your own .txt file. |
--fast |
(Breeze) int8_bfloat16 quantization, ~1.5x speedup, +0.3-0.5% CER on Mandarin |
--beam N |
(Breeze) Beam size; default 5. --beam 1 = greedy (fastest), --beam 10 = max accuracy |
--no-aligner |
(Qwen3) Skip ForcedAligner-0.6B; ~25% faster but loses word-level timestamps |
--no-pool |
(Qwen3) Disable cross-file chunk pooling for multi-file runs |
--internal-vad |
(Breeze) Use faster-whisper's built-in VAD instead of our ONNX Silero (not recommended) |
--no-s2tw |
Disable OpenCC s2twp post-processing |
LLM context polish (post-process)
# Polish a Breeze output with Qwen3-8B + glossary protection
asr-polish transcripts/breeze/lecture_breeze.json --glossary-file builtin
# → transcripts/breeze-polished/lecture_breeze-polished.{txt,srt,json}
Speaker diarization
# Adds [SPEAKER_00] / [SPEAKER_01] labels to each segment
asr-diarize transcripts/breeze/interview_breeze.json music/interview.m4a
# Requires HF license accept on:
# - https://hf.co/pyannote/speaker-diarization-3.1
# - https://hf.co/pyannote/speaker-diarization-community-1
# - https://hf.co/pyannote/segmentation-3.0
Benchmark + CER report
# Generates docs/BENCHMARK.md with speed + quality metrics.
# --gt-dir points at a folder of {audio_stem}_first_{N}s_gt.txt files
# you provide yourself; the repo no longer ships any voice fixtures.
asr-bench --gt-dir path/to/your_gt_dir
Architecture
audio (.mp3/.m4a/.wav/...)
↓ ffmpeg pipe → numpy float32 16kHz mono
↓
Silero VAD ONNX (CPU SIMD, ~3-5x faster than PyTorch backend)
↓ ≤28s chunks
├──→ Qwen3-ASR-1.7B + ForcedAligner (HF transformers, bf16, batch=48)
└──→ Breeze-ASR-25 (CTranslate2, bf16, batch=32, beam=5, hotwords)
↓
OpenCC s2twp 簡→繁(台灣慣用詞)
↓
┌───────────┬─────────────┐
↓ ↓ ↓
TXT SRT JSON
┌──── Optional post-processing ────┐
│ asr-polish asr-diarize asr-bench │
│ Qwen3-8B pyannote CER + RTF │
└──────────────────────────────────┘
Full architectural details: docs/ARCHITECTURE.md
Project structure
taiwan-asr-toolkit/
├── src/taiwan_asr/
│ ├── __init__.py ← package; minimal export to avoid heavy import cascades
│ ├── common.py ← shared utils: ffmpeg pipe, OpenCC, Silero VAD, glossary, Segment
│ ├── qwen3.py ← Qwen3-ASR + multi-file chunk pool
│ ├── breeze.py ← Breeze-ASR-25 with manual VAD + hot-word injection
│ ├── polish.py ← Qwen3-8B LLM context correction (glossary-protected)
│ ├── diarize.py ← pyannote.audio speaker diarization
│ ├── cer_eval.py ← jiwer-based CER with s2twp normalization
│ └── benchmark.py ← speed + accuracy report
├── glossary.txt ← default NTU glossary (dorm/dept names)
├── run.sh ← convenience wrapper around asr-* CLI commands
├── pyproject.toml ← project metadata, deps, CLI scripts (asr-qwen3, asr-breeze, …)
├── tests/ ← 109 TDD tests (including 5 Breeze invariants)
├── docs/ ← BENCHMARK.md / ARCHITECTURE.md / INSTALL.md
└── archive/ ← legacy Colab notebooks (kept for reference only)
After pip install -e . the following CLI commands are on PATH: asr-qwen3, asr-breeze, asr-polish, asr-diarize, asr-bench, asr-cer.
Testing & contributing
# Most fast-tier tests run without model load (~2-3 s); a small subset
# requires local audio fixtures and pytest.skip()s gracefully when absent.
pytest -m fast
# Breeze contract tests (NEVER allowed to fail)
pytest -m breeze_invariant
# Full suite (some need VAD load)
pytest
The toolkit follows strict TDD. Any contribution must:
- Have a failing test that now passes
- Keep all 56 existing tests green
- Pass
pytest -m breeze_invariant(which locksMediaTek-Research/Breeze-ASR-25as Breeze's model) - Keep Traditional Chinese (Taiwan) output
See CONTRIBUTING.md for the full guide.
vs alternatives
| This toolkit | whisperX |
faster-whisper (raw) |
openai/whisper |
|
|---|---|---|---|---|
| Taiwan Traditional Chinese by default | s2twp baked-in | |||
| Two SOTA Mandarin models compared | Qwen3 + Breeze | Whisper only | Whisper only | Whisper only |
Fixes 圓三/延三 → 研三 proper-noun ASR errors |
glossary hot-word | manual prompt | ||
| LLM context polish with proper-noun protection | Qwen3-8B + glossary | |||
| Speaker diarization (open-mirror fallback) | tensorlake mirror | pyannote (gated) | ||
| RTX 5090 / Blackwell native (bf16 + cuDNN-SDPA) | ||||
| TDD with model-invariant lock | 109 tests | |||
| Best RTF on long Mandarin audio | 1554x | ~70x | ~250x | ~30x |
Citation & credits
This toolkit is integration plumbing — credit goes to the model authors:
- MediaTek-Research/Breeze-ASR-25 (HuggingFace) — Whisper-Large-v2 fine-tune for Taiwan Mandarin
- Alibaba/Tongyi Qwen3-ASR-1.7B (HuggingFace) — multilingual ASR with ForcedAligner
- OpenAI Whisper — base architecture for Breeze-ASR-25
- CTranslate2 / faster-whisper — Whisper inference engine
- pyannote/audio — speaker diarization
- OpenCC — simplified-traditional Chinese conversion (s2twp recipe)
- Silero VAD — fast voice activity detection
If this toolkit helps your research, please cite the underlying models. A toolkit-level citation is fine but optional:
@software{taiwan_asr_toolkit,
title = {Taiwan ASR Toolkit: Production-grade Traditional Chinese Speech-to-Text Pipeline},
author = {Taiwan ASR Toolkit Contributors},
year = {2026},
url = {https://github.com/thc1006/taiwan-asr-toolkit},
note = {Qwen3-ASR-1.7B + MediaTek Breeze-ASR-25 with hot-word injection, LLM polish, and speaker diarization}
}
License
MIT for this toolkit's code. See LICENSE.
Third-party model licenses (you must comply with each):
- Qwen models: Apache 2.0
- Breeze-ASR-25: Apache 2.0
- Silero VAD: MIT
- pyannote: gated, requires HF license accept
** Made with bf16 tensor cores in Taiwan **
If this toolkit saved you hours, drop a star — it helps more people find it.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file taiwan_asr_toolkit-0.5.5.tar.gz.
File metadata
- Download URL: taiwan_asr_toolkit-0.5.5.tar.gz
- Upload date:
- Size: 109.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2ab455b6b8828cd1914de6a0b3a7bafc30d17d2c1db3a223dfd35b377f3b0c34
|
|
| MD5 |
576df0b47c776627150e1ca5d5c2c206
|
|
| BLAKE2b-256 |
25eccf6971aaafe97a3ea7b575aee65d20322a4d46c4d7c8baf15c2c13b8a5f5
|
File details
Details for the file taiwan_asr_toolkit-0.5.5-py3-none-any.whl.
File metadata
- Download URL: taiwan_asr_toolkit-0.5.5-py3-none-any.whl
- Upload date:
- Size: 53.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
696bad2344b93db0d468b07ecce50920902e2fb0c083b393fa5fd524c574a59f
|
|
| MD5 |
2279075abf3f910bb6121ab1b03e6b25
|
|
| BLAKE2b-256 |
d4ee24a28b09bc537715e66c75db1651e1374b77243310f6f9a74af9a7b3efe4
|