Production-grade Traditional Chinese / Taiwan Mandarin ASR toolkit — Qwen3-ASR + Breeze-ASR-25, LLM polish, speaker diarization, RTX 5090 / Blackwell tuned, RTF up to 1554x.

These details have not been verified by PyPI

Project links

Project description

Taiwan ASR Toolkit

Production-grade Traditional Chinese (Taiwan Mandarin) speech-to-text — RTF up to 1554x on a single RTX 5090

Qwen3-ASR-1.7B + MediaTek Breeze-ASR-25 · Hot-word injection · LLM context polish · Speaker diarization · OpenCC s2twp · 109 TDD tests

Why this exists · Quick start · Benchmarks · Usage · Architecture · Benchmark deep dive

Why this exists

If you've tried openai/whisper-large-v3 or whisperX on Taiwan Mandarin recordings, you've hit:

Output is Simplified Chinese by default (you keep getting 软件 instead of 軟體)
Whisper's built-in VAD silently fails on long sparse audio → 一個 48-min 失控段
Proper nouns die: 延三舍 / 研三舍 (NTU dorms) become 圓三 / 圓山
Generic Whisper is not tuned for Taiwan vocabulary — homophone errors everywhere
Variable-length VAD chunks waste 5-10x compute through padding

This toolkit fixes all of those. Two production-grade Mandarin ASR models, identical pipeline for fair comparison, glossary-driven hot-word injection at the source, LLM polish with proper-noun protection, OpenCC s2twp baked in. Tested on real lecture/interview/standard recordings.

Star this repo if you've been burned by condition_on_previous_text=True on Mandarin. We feel you.

Quick start

Zero-effort try (Colab)

Click the Open in Colab badge at the top — opens a file picker so you can upload any Taiwan-Mandarin clip from your machine and run the full pipeline on a Colab GPU. No local install required.

Install from PyPI

pip install --index-url https://download.pytorch.org/whl/cu128 torch torchaudio
pip install taiwan-asr-toolkit

# transcribe with the bundled NTU glossary (packaged in the wheel)
asr-breeze your_audio.mp3 --glossary-file builtin

Run on your own audio (clone-and-go)

git clone https://github.com/thc1006/taiwan-asr-toolkit.git && cd taiwan-asr-toolkit
pip install --index-url https://download.pytorch.org/whl/cu128 torch torchaudio
pip install -e ".[all]"

# Drop any Taiwan-Mandarin audio file (m4a / mp3 / wav / mp4 / flac) into the repo:
asr-breeze your_audio.m4a --glossary-file builtin
cat "transcripts/breeze/$(basename your_audio .m4a)_breeze.txt"

That's it — Taiwan Mandarin in, Traditional Chinese transcript out.

No fixture is bundled in this repo. The toolkit deliberately ships zero real-voice audio to avoid any chance of leaking identifiable speakers. Bring your own clip; the Colab quickstart notebook opens a file picker so users can upload their own.

On your own audio

asr-breeze path/to/your_audio.mp3 --glossary-file builtin
# or use your own glossary file:
asr-breeze path/to/your_audio.mp3 --glossary-file my_terms.txt
# Output: transcripts/breeze/{filename}_breeze.{txt,srt,json}

Output is Traditional Chinese (Taiwan), with timestamps, segment-level + word-level (Breeze) timing, and proper SRT subtitles.

Benchmarks

Real numbers on a single RTX 5090 (Blackwell sm_120, 32 GB GDDR7) + i9-14900 (24 threads). Test corpus: 11 audio files, 712.6 minutes (≈12 hours) of Taiwan-Mandarin lectures + interviews.

Speed (Real-Time Factor — higher = faster)

Audio file	Length	Breeze RTF	Qwen3 RTF
4-min standard recording	4 min	189x	136x
24-min standard recording	24 min	239x	199x
65-min interview	65 min	341x	297x
140-min lecture (TASA)	140 min	546x	448x
189-min sparse audio	189 min	1554x	1497x
All 11 files combined	712 min	382x	354x
Total ASR time		111.9 s	126.2 s

Quality vs hallucination (proxy metrics, no GT)

Metric	Qwen3	Breeze
Avg quality score (35% coverage + 20% c/s + 15% Trad + 15% vocab + 15% no-halluc)	0.815	0.808
Per-file wins (out of 11)	8	3
Catastrophic >60s segments	0	0
Coverage % (transcribed vs audio time)	83.3%	79.8%
OpenCC `s2twp` Traditional ratio	0.97	0.97

Real CER on a 55-second hand-corrected sample

Model	CER	Notes
Breeze-ASR-25 + glossary	2.34%	hot-word injection fixes `圓三 → 研三` at source
Qwen3-ASR-1.7B	68.42%	over-transcribes (97 extra chars not in fixed-time GT)

Numbers are from one author-held internal recording; the audio itself is not redistributed. Bring your own ground-truth and run asr-bench --gt-dir path/to/your_gts/ to reproduce on your data. The hot-word effect is locked separately as a regression test in tests/test_glossary_effect.py, which pytest.skip()s cleanly when the audio is not present locally.

Features

	What it does
Two SOTA Mandarin ASR models	Qwen/Qwen3-ASR-1.7B + MediaTek-Research/Breeze-ASR-25. Both run, both compared.
Traditional Chinese always	OpenCC `s2twp` post-processing converts any leftover 簡體 → 繁體 (Taiwan idioms): 軟件→軟體, 激光→雷射, 視頻→影片.
Hot-word injection	Pass `--glossary-file builtin` for the packaged NTU glossary (or your own .txt); proper nouns get fed to Whisper's `initial_prompt` + `hotwords`. Fixes homophone errors like `圓三 → 研三` (NTU graduate dorm) at the source.
Symmetric pipeline	Same Silero VAD ONNX, same chunking, same dtype on both models. The benchmark measures the model, not the plumbing.
Multi-file pool batching (Qwen3)	Cross-file length-sorted batching keeps batch=48 fully utilized when transcribing folders of mixed-length files. Breeze relies on faster-whisper's internal batched inference per call instead.
LLM context polish	Optional Qwen3-8B post-correction with NTU glossary protection (won't accidentally "fix" `研三舍` to `延長`).
Speaker diarization	Optional pyannote 3.x integration with open-mirror fallback (no gated-license blocker).
Real CER measurement	jiwer-based CER with s2twp normalization. Bring your own ground-truth or use the included approximate fixture.
109 TDD tests	Including 5 invariant tests that lock the Breeze model ID so optimizations can't accidentally swap to a different Whisper variant.
Blackwell-native	bf16 + cuDNN-SDPA + torch.compile for RTX 5090. Auto-falls back gracefully on Hopper/Ada/Ampere/CPU.

Usage

Single-file transcription

# Breeze (Whisper-Large-v2 fine-tune, fastest)
asr-breeze "music/lecture.mp3" --glossary-file builtin

# Qwen3-ASR (more comprehensive coverage)
asr-qwen3 "music/interview.m4a"

# Both at once (same audio, two transcripts to compare)
./run.sh both "music/standard_recording.mp3"

Batch transcription (auto pool batching)

# Transcribes everything in music/ via Qwen3 with cross-file pool batching.
asr-qwen3 music/*.mp3 music/*.m4a

# Breeze processes multi-file inputs sequentially (faster-whisper already
# batches internally per call via BatchedInferencePipeline).
asr-breeze music/*.{mp3,m4a,wav} --glossary-file builtin

Power-user flags

Flag	Effect
`--glossary-file PATH`	(Breeze) Inject domain terms via Whisper's prompt + hotwords. Use `--glossary-file builtin` for the packaged NTU dorm/dept glossary, or pass your own .txt file.
`--fast`	(Breeze) `int8_bfloat16` quantization, ~1.5x speedup, +0.3-0.5% CER on Mandarin
`--beam N`	(Breeze) Beam size; default 5. `--beam 1` = greedy (fastest), `--beam 10` = max accuracy
`--no-aligner`	(Qwen3) Skip ForcedAligner-0.6B; ~25% faster but loses word-level timestamps
`--no-pool`	(Qwen3) Disable cross-file chunk pooling for multi-file runs
`--internal-vad`	(Breeze) Use faster-whisper's built-in VAD instead of our ONNX Silero (not recommended)
`--no-s2tw`	Disable OpenCC s2twp post-processing

LLM context polish (post-process)

# Polish a Breeze output with Qwen3-8B + glossary protection
asr-polish transcripts/breeze/lecture_breeze.json --glossary-file builtin
# → transcripts/breeze-polished/lecture_breeze-polished.{txt,srt,json}

Speaker diarization

# Adds [SPEAKER_00] / [SPEAKER_01] labels to each segment
asr-diarize transcripts/breeze/interview_breeze.json music/interview.m4a
# Requires HF license accept on:
# - https://hf.co/pyannote/speaker-diarization-3.1
# - https://hf.co/pyannote/speaker-diarization-community-1
# - https://hf.co/pyannote/segmentation-3.0

Benchmark + CER report

# Generates docs/BENCHMARK.md with speed + quality metrics.
# --gt-dir points at a folder of {audio_stem}_first_{N}s_gt.txt files
# you provide yourself; the repo no longer ships any voice fixtures.
asr-bench --gt-dir path/to/your_gt_dir

Architecture

audio (.mp3/.m4a/.wav/...)
       ↓ ffmpeg pipe → numpy float32 16kHz mono
       ↓
Silero VAD ONNX (CPU SIMD, ~3-5x faster than PyTorch backend)
       ↓ ≤28s chunks
       ├──→ Qwen3-ASR-1.7B + ForcedAligner (HF transformers, bf16, batch=48)
       └──→ Breeze-ASR-25 (CTranslate2, bf16, batch=32, beam=5, hotwords)
                                ↓
                  OpenCC s2twp 簡→繁(台灣慣用詞)
                                ↓
              ┌───────────┬─────────────┐
              ↓           ↓             ↓
            TXT          SRT          JSON

   ┌──── Optional post-processing ────┐
   │ asr-polish  asr-diarize  asr-bench  │
   │ Qwen3-8B    pyannote     CER + RTF     │
   └──────────────────────────────────┘

Full architectural details: docs/ARCHITECTURE.md

Project structure

taiwan-asr-toolkit/
├── src/taiwan_asr/
│   ├── __init__.py    ← package; minimal export to avoid heavy import cascades
│   ├── common.py      ← shared utils: ffmpeg pipe, OpenCC, Silero VAD, glossary, Segment
│   ├── qwen3.py       ← Qwen3-ASR + multi-file chunk pool
│   ├── breeze.py      ← Breeze-ASR-25 with manual VAD + hot-word injection
│   ├── polish.py      ← Qwen3-8B LLM context correction (glossary-protected)
│   ├── diarize.py     ← pyannote.audio speaker diarization
│   ├── cer_eval.py    ← jiwer-based CER with s2twp normalization
│   └── benchmark.py   ← speed + accuracy report
├── glossary.txt       ← default NTU glossary (dorm/dept names)
├── run.sh             ← convenience wrapper around asr-* CLI commands
├── pyproject.toml     ← project metadata, deps, CLI scripts (asr-qwen3, asr-breeze, …)
├── tests/             ← 109 TDD tests (including 5 Breeze invariants)
├── docs/              ← BENCHMARK.md / ARCHITECTURE.md / INSTALL.md
└── archive/           ← legacy Colab notebooks (kept for reference only)

After pip install -e . the following CLI commands are on PATH: asr-qwen3, asr-breeze, asr-polish, asr-diarize, asr-bench, asr-cer.

Testing & contributing

# Most fast-tier tests run without model load (~2-3 s); a small subset
# requires local audio fixtures and pytest.skip()s gracefully when absent.
pytest -m fast

# Breeze contract tests (NEVER allowed to fail)
pytest -m breeze_invariant

# Full suite (some need VAD load)
pytest

The toolkit follows strict TDD. Any contribution must:

Have a failing test that now passes
Keep all 56 existing tests green
Pass pytest -m breeze_invariant (which locks MediaTek-Research/Breeze-ASR-25 as Breeze's model)
Keep Traditional Chinese (Taiwan) output

See CONTRIBUTING.md for the full guide.

vs alternatives

	This toolkit	`whisperX`	`faster-whisper` (raw)	`openai/whisper`
Taiwan Traditional Chinese by default	s2twp baked-in
Two SOTA Mandarin models compared	Qwen3 + Breeze	Whisper only	Whisper only	Whisper only
Fixes `圓三/延三 → 研三` proper-noun ASR errors	glossary hot-word		manual prompt
LLM context polish with proper-noun protection	Qwen3-8B + glossary
Speaker diarization (open-mirror fallback)	tensorlake mirror	pyannote (gated)
RTX 5090 / Blackwell native (bf16 + cuDNN-SDPA)
TDD with model-invariant lock	109 tests
Best RTF on long Mandarin audio	1554x	~70x	~250x	~30x

Citation & credits

This toolkit is integration plumbing — credit goes to the model authors:

MediaTek-Research/Breeze-ASR-25 (HuggingFace) — Whisper-Large-v2 fine-tune for Taiwan Mandarin
Alibaba/Tongyi Qwen3-ASR-1.7B (HuggingFace) — multilingual ASR with ForcedAligner
OpenAI Whisper — base architecture for Breeze-ASR-25
CTranslate2 / faster-whisper — Whisper inference engine
pyannote/audio — speaker diarization
OpenCC — simplified-traditional Chinese conversion (s2twp recipe)
Silero VAD — fast voice activity detection

If this toolkit helps your research, please cite the underlying models. A toolkit-level citation is fine but optional:

@software{taiwan_asr_toolkit,
  title = {Taiwan ASR Toolkit: Production-grade Traditional Chinese Speech-to-Text Pipeline},
  author = {Taiwan ASR Toolkit Contributors},
  year   = {2026},
  url    = {https://github.com/thc1006/taiwan-asr-toolkit},
  note   = {Qwen3-ASR-1.7B + MediaTek Breeze-ASR-25 with hot-word injection, LLM polish, and speaker diarization}
}

License

MIT for this toolkit's code. See LICENSE.

Third-party model licenses (you must comply with each):

Qwen models: Apache 2.0
Breeze-ASR-25: Apache 2.0
Silero VAD: MIT
pyannote: gated, requires HF license accept

** Made with bf16 tensor cores in Taiwan **

If this toolkit saved you hours, drop a star — it helps more people find it.

Report a bug · Request a feature · Discuss

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.5.5

May 7, 2026

0.5.4

May 7, 2026

0.5.3

May 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

taiwan_asr_toolkit-0.5.5.tar.gz (109.6 kB view details)

Uploaded May 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

taiwan_asr_toolkit-0.5.5-py3-none-any.whl (53.7 kB view details)

Uploaded May 7, 2026 Python 3

File details

Details for the file taiwan_asr_toolkit-0.5.5.tar.gz.

File metadata

Download URL: taiwan_asr_toolkit-0.5.5.tar.gz
Upload date: May 7, 2026
Size: 109.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for taiwan_asr_toolkit-0.5.5.tar.gz
Algorithm	Hash digest
SHA256	`2ab455b6b8828cd1914de6a0b3a7bafc30d17d2c1db3a223dfd35b377f3b0c34`
MD5	`576df0b47c776627150e1ca5d5c2c206`
BLAKE2b-256	`25eccf6971aaafe97a3ea7b575aee65d20322a4d46c4d7c8baf15c2c13b8a5f5`

See more details on using hashes here.

File details

Details for the file taiwan_asr_toolkit-0.5.5-py3-none-any.whl.

File metadata

Download URL: taiwan_asr_toolkit-0.5.5-py3-none-any.whl
Upload date: May 7, 2026
Size: 53.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for taiwan_asr_toolkit-0.5.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`696bad2344b93db0d468b07ecce50920902e2fb0c083b393fa5fd524c574a59f`
MD5	`2279075abf3f910bb6121ab1b03e6b25`
BLAKE2b-256	`d4ee24a28b09bc537715e66c75db1651e1374b77243310f6f9a74af9a7b3efe4`

See more details on using hashes here.

taiwan-asr-toolkit 0.5.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Taiwan ASR Toolkit

Production-grade Traditional Chinese (Taiwan Mandarin) speech-to-text — RTF up to 1554x on a single RTX 5090

Why this exists

Quick start

Zero-effort try (Colab)

Install from PyPI

Run on your own audio (clone-and-go)

On your own audio

Benchmarks

Speed (Real-Time Factor — higher = faster)

Quality vs hallucination (proxy metrics, no GT)

Real CER on a 55-second hand-corrected sample

Features

Usage

Single-file transcription

Batch transcription (auto pool batching)

Power-user flags

LLM context polish (post-process)

Speaker diarization

Benchmark + CER report

Architecture

Project structure

Testing & contributing

vs alternatives

Citation & credits

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes