Unified ASR inference framework — multi-backend, optimized for consumer GPUs
Project description
SonicScribe
SonicScribe is a unified speech-to-text framework that picks the best ASR model for your hardware and gives you one consistent API. It wraps five production backends — from lightweight CPU models to the fastest GPU engines — so you don't have to choose between speed, accuracy, and ease of use.
On a single consumer GPU (RTX 4070 Ti), SonicScribe transcribes audio up to 53x faster than real-time with 0.67% WER on LibriSpeech — better accuracy and speed than any single open-source ASR tool available today.
Benchmark
Comparison against popular ASR tools
All numbers measured on the same 30-utterance LibriSpeech val.clean slice (131.8s of real English speech), single RTX 4070 Ti GPU, 16-bit precision where applicable.
| Tool / model | WER | Speed | Notes |
|---|---|---|---|
| SonicScribe (Parakeet TDT v3, CUDA) | 0.67% | 53.4x RT | best WER + fastest |
| SonicScribe (Parakeet TDT v3, CPU) | 0.67% | 19.9x RT | no GPU needed |
| SonicScribe (Moonshine/base, CPU) | 1.34% | 20.9x RT | MIT, 61M params |
| SonicScribe (Qwen3-ASR-0.6B, CUDA) | 1.57% | 14.1x RT | 30+ languages |
| faster-whisper large-v3 (fp16 CUDA) | 1.79% | 11.4x RT | via SonicScribe whisper backend |
| HF Whisper large-v2 (fp16 CUDA) | 2.01% | 13.6x RT | via SonicScribe hf_whisper backend |
| openai/whisper large-v2 (fp16 CUDA)* | ~3% | ~4x RT | reference from upstream benchmarks |
* openai/whisper numbers are approximate from published benchmarks on comparable hardware; not measured on our test slice.
Batched transcription (native)
| Backend | Batch N=8 | Batch N=32 | Speedup over sequential |
|---|---|---|---|
| Qwen3-ASR-0.6B | 0.65s | 2.55s | 4.4x |
| HF Whisper-small | 0.53s | 1.86s | 2.3x |
| Parakeet TDT v3 (CPU) | via onnx-asr list input | — | native batch |
CPU-only (no GPU required)
| Tool | WER | Speed | Model size |
|---|---|---|---|
| SonicScribe (Moonshine/tiny) | 3.13% | 39.3x RT | 27M |
| SonicScribe (Moonshine/base) | 1.34% | 20.9x RT | 61M |
| SonicScribe (Parakeet, CPU EP) | 0.67% | 19.9x RT | 600M |
| faster-whisper large-v3 (int8) | 1.57% | 1.2x RT | 1.5B |
Moonshine/tiny on CPU is 33x faster than faster-whisper int8 on CPU with comparable accuracy.
Multi-dataset evaluation (8 benchmarks × 5 backends)
The numbers above use a single 30-utterance LibriSpeech slice. The table below extends evaluation to 8 diverse benchmarks: clean English, three multilingual splits (French / Spanish / German), accented business English, long-form (15-min) talks, dialectal English, and noisy multi-speaker meetings. 200 utterances for short-form, 20 utterances (~5h audio) for TED-LIUM long-form, and 10 utterances (~7h) for CORAAL — same RTX 4070 Ti, same jiwer transform across all cells.
WER (%):
| Dataset (domain) | parakeet | qwen | whisper-CT2 | hf-whisper | moonshine | best |
|---|---|---|---|---|---|---|
| LibriSpeech clean (en, audiobook) | 3.80% | 2.57% | 6.32% | 3.06% | 3.52% | qwen |
| VoxPopuli FR (fr, parliament) | 11.96% | 14.35% | 12.74% | 14.98% | N/A | parakeet |
| VoxPopuli ES (es, parliament) | 8.11% | 10.08% | 8.38% | 9.15% | N/A | parakeet |
| MLS DE (de, audiobook) | 9.93% | 6.73% | 6.36% | 3.98% | N/A | hf-whisper |
| Earnings-22 (en, accented finance) | 20.48% | 16.92% | 15.35% | 17.50% | 31.78% | whisper-CT2 |
| AMI IHM (en, noisy meetings) | 31.86% | 16.73% | 22.09% | 23.21% | 59.87% | qwen |
| TED-LIUM long-form (en, 15-min talks) | 8.63% | 93.07%* | 7.31% | 25.52% | N/A | whisper-CT2 |
| CORAAL Atlanta (en, dialect, 42-min interviews) | 22.94% | ERR* | 24.76% | 48.50% | N/A | parakeet |
* Honest correction (round 2): My initial reading was that Qwen had a lower long-form ceiling than other backends. Follow-up debugging on real continuous TED audio showed Qwen works well at 5 min (RTFx 10.1×) and 10 min (RTFx 10.2×). The 93% WER and CORAAL hang are caused by two separate issues: (1) max_new_tokens=256 was silently truncating dense long transcripts (fixed in qwen/backend.py to 4096); (2) TED utt 1 is 20.8 min, just past qwen-asr's MAX_ASR_INPUT_SECONDS=1200, so the library auto-splits into two chunks and per-chunk decoder cost grows non-linearly. Manual sweep confirms degradation isn't a chunk-boundary artifact: WER rises monotonically from < 5% at 10 min to 33.88% at 15 min, 45.33% at 18 min, 53.48% at 19 min — driven by attention/KV-cache numerical drift on long Qwen3-ASR contexts.
I tried the qwen-asr vLLM backend as the alleged production fix. After upgrading torch 2.4 → 2.9.1, installing vllm 0.14.0, and validating 5-min audio (vLLM RTFx 16.5× vs transformers 10.1×), vLLM turned out to be slower on long audio: 10-min audio takes 246s on vLLM (RTFx 2.4×) vs 58s on transformers (RTFx 10.2×) — a 4.2× regression. Root cause: Qwen3-ASR is prefill-bound (13K audio tokens fed in one shot), but vLLM is engineered for decode-bound LLM serving (paged KV cache for autoregressive generation). The optimization target doesn't match. The vLLM backend wiring stays in the codebase (inference_backend="vllm") for future hardware/version upgrades, but the recommended production path for >10-min audio remains Parakeet or whisper-CT2, not Qwen.
RTFx (real-time factor):
| Dataset | parakeet | qwen | whisper-CT2 | hf-whisper | moonshine |
|---|---|---|---|---|---|
| LibriSpeech clean | 62.3x | 16.0x | 13.3x | 17.0x | 22.1x |
| VoxPopuli FR | 72.1x | 14.1x | 15.3x | 19.2x | N/A |
| VoxPopuli ES | 75.9x | 15.2x | 16.9x | 20.5x | N/A |
| MLS DE | 92.0x | 17.3x | 19.5x | 23.6x | N/A |
| Earnings-22 | 56.3x | 17.1x | 14.9x | 17.6x | 23.1x |
| AMI IHM | 39.4x | 13.1x | 7.6x | 8.8x | 14.3x |
| TED-LIUM long-form | 97.6x | 56.0x | 12.0x | 24.0x | N/A |
| CORAAL Atlanta | 102.3x | ERR | 12.2x | 34.2x | N/A |
Three takeaways the headline number doesn't show:
- No single backend wins every category. Parakeet leads on multilingual + long-form + dialectal; qwen wins clean English and noisy meetings (16.73% WER on AMI is 5pp ahead of #2); HF Whisper takes German (3.98%); whisper-CT2 takes finance and TED long-form. Picking the right backend per domain matters.
- Long-form is genuinely hard. WER roughly doubles or triples on long-form vs short-form for the same backend. The Qwen rows on TED-LIUM and CORAAL look catastrophic, but see the footnote — those numbers reflect a pre-fix configuration plus a transformers-backend cost cliff at the 20-min chunk boundary, not a Qwen model limitation.
- Moonshine is English-only and short-form-only by design. N/A cells are intentional — Moonshine is the right choice for ≤30s English on CPU.
Reproduce: python benchmarks/bench_multi_dataset.py --all then --collect. Per-combo JSONs land in benchmarks/results/; runs are resumable.
Requirements
- Python 3.10 or greater
- No FFmpeg required (audio decoded via soundfile/numpy)
- GPU optional (all backends work on CPU; CUDA gives 2-50x speedup)
Installation
pip install sonic-scribe[parakeet] # Fastest + best WER (CC-BY-4.0)
pip install sonic-scribe[moonshine] # Best CPU option (MIT)
pip install sonic-scribe[whisper] # faster-whisper / CTranslate2 (MIT)
pip install sonic-scribe[qwen] # Multilingual champion (Apache-2.0)
pip install sonic-scribe[hf-whisper] # PyTorch Whisper, supports encoder hooks (MIT)
pip install sonic-scribe[all] # All backends
For GPU acceleration with Parakeet:
pip install sonic-scribe[parakeet-gpu] # includes CUDA 12 wheels
Usage
Simplest possible
import sonic_scribe
result = sonic_scribe.transcribe("meeting.wav")
print(result.text)
SonicScribe auto-detects installed backends and picks the best one.
Choose a backend
from sonic_scribe import Engine
engine = Engine(backend="parakeet", device="cuda")
result = engine.transcribe("podcast.mp3", language="en")
print(result.text)
print(f"Duration: {result.duration:.1f}s")
for seg in result.segments:
print(f" [{seg.start:.1f}s → {seg.end:.1f}s] {seg.text}")
Long-form audio (podcasts, lectures, meetings)
SonicScribe automatically handles audio of any length. Parakeet, faster-whisper, HF Whisper, and Qwen all support long-form transcription out of the box:
engine = Engine(backend="parakeet", device="cuda")
result = engine.transcribe("2hour_lecture.mp3") # just works
print(f"{len(result.segments)} segments transcribed")
Word-level timestamps
result = engine.transcribe("audio.wav", word_timestamps=True)
for seg in result.segments:
for word in seg.words:
print(f"[{word.start:.2f}s → {word.end:.2f}s] {word.word}")
Hotwords / prompt injection
Guide the model with domain-specific vocabulary (whisper and hf_whisper backends):
result = engine.transcribe("meeting.wav", initial_prompt="SonicScribe, PyTorch, RTFx")
Batch transcription
Process multiple files in one GPU pass for maximum throughput:
files = ["clip1.wav", "clip2.wav", "clip3.wav", "clip4.wav"]
results = engine.transcribe_batch(files, language="en")
# Qwen: 4.4x faster than sequential
# HF Whisper: 2.3x faster than sequential
Parallel file processing
Process many files across threads (different from GPU batching):
results = engine.transcribe_files(["a.wav", "b.wav", "c.wav", "d.wav"], workers=4)
Speaker diarization (who said what)
Add speaker labels to transcribed segments via pyannote.audio 4.0 as a post-process pipeline stage. Works on top of any backend; turns plain transcription into speaker-attributed dialogue:
from sonic_scribe import Engine
from sonic_scribe.optimizations import DiarizationStage, DiarizationConfig, Pipeline
pipeline = Pipeline(stages=[
DiarizationStage(DiarizationConfig(num_speakers=2, hf_token="hf_xxx")),
])
engine = Engine(backend="parakeet", device="cuda", pipeline=pipeline)
result = engine.transcribe("meeting.wav")
for seg in result.segments:
print(f"[{seg.speaker_id}] {seg.start:.1f}s-{seg.end:.1f}s: {seg.text}")
# Output:
# [SPEAKER_00] 0.5s-2.1s: Welcome to the meeting.
# [SPEAKER_01] 2.4s-4.0s: Thanks for having me.
Install with pip install "sonic-scribe[diarization]". The pyannote/speaker-diarization-community-1 model is gated; accept the license at https://huggingface.co/pyannote/speaker-diarization-community-1 and pass an HF read token via hf_token= (or set HF_TOKEN in the environment).
Uses pyannote 4.0's exclusive diarization timeline (one speaker per frame) by default for clean ASR alignment. Set use_exclusive=False in DiarizationConfig to expose overlapping-speech turns instead.
Punctuation restoration + truecasing
ASR output is typically lowercased and unpunctuated. Add a PunctuationStage to fix this automatically (47 languages via punctuators ONNX model):
from sonic_scribe.optimizations import Pipeline, PunctuationStage
pipeline = Pipeline(stages=[PunctuationStage()])
engine = Engine(backend="moonshine", pipeline=pipeline)
result = engine.transcribe("audio.wav")
# Input: "hello world how are you"
# Output: "Hello world, how are you?"
Install with pip install "sonic-scribe[punctuation]".
Export to SRT / VTT / JSON
result = engine.transcribe("lecture.wav")
print(result.to_srt()) # SRT subtitle format
print(result.to_vtt()) # WebVTT subtitle format
print(result.to_json()) # structured JSON
Async streaming (real-time)
Get partial transcripts as audio comes in — sub-500ms first-partial latency:
import asyncio
from sonic_scribe.streaming import (
StreamingConfig, transcribe_stream,
PartialTranscript, FinalSegment,
)
async def live_transcribe(audio_frames):
engine = Engine(backend="hf_whisper", device="cuda")
backend = engine.backend
config = StreamingConfig(chunk_seconds=4.0)
async for event in transcribe_stream(audio_frames, backend, config):
if isinstance(event, PartialTranscript):
print(f"... {event.text}", end="\r")
elif isinstance(event, FinalSegment):
print(f"✓ {event.segment.text}")
CLI
# Transcribe a single file
sonic-scribe transcribe interview.wav -b parakeet -d cuda
# Multiple files as JSON
sonic-scribe transcribe a.wav b.wav c.wav --json
# SRT / WebVTT subtitle output
sonic-scribe transcribe lecture.wav --srt > lecture.srt
sonic-scribe transcribe lecture.wav --vtt > lecture.vtt
# Word timestamps in JSON output
sonic-scribe transcribe audio.wav --word-timestamps --json
# List installed backends
sonic-scribe info
Encoder optimizations
SonicScribe includes two novel encoder compression algorithms that speed up PyTorch-based backends without retraining:
Token Merging (loss-less 1.4x speedup)
from sonic_scribe.optimizations.tome import AudioToMeStack, ToMeConfig
stack = AudioToMeStack(ToMeConfig(sim_threshold=0.95, mode="shrink")).apply(model)
# Encoder runs 1.4x faster, WER unchanged
LiteASR low-rank compression
# Offline: compress encoder weights
python tools/liteasr_compress.py --model openai/whisper-small --out factors.npz
# Runtime: swap in compressed weights (19% fewer encoder params)
from sonic_scribe.optimizations.liteasr import apply_factors_to_encoder, load_factors
entries = load_factors("factors.npz")
apply_factors_to_encoder(model, entries)
Composable optimization pipeline
Stack multiple optimizations together. The Pipeline validates conflicts at construction time:
from sonic_scribe import Engine
from sonic_scribe.optimizations.pipeline import Pipeline, VADStage, EncoderToMeStage
pipeline = Pipeline(stages=[
VADStage(), # silence removal
EncoderToMeStage(sim_threshold=0.95, mode="shrink"), # token merging
])
engine = Engine(backend="hf_whisper", device="cuda", pipeline=pipeline)
result = engine.transcribe("lecture.wav", language="en")
VAD backend selection
Three voice activity detection backends, switchable at pipeline level:
| VAD | F1 (FLEURS-102) | Latency | License |
|---|---|---|---|
| Silero (default) | 95.95% | Good | MIT |
| FireRedVAD | 97.57% | Good | Apache 2.0 |
| TEN-VAD | 95.19% | <10ms | Apache 2.0 + Agora restrictions |
from sonic_scribe.optimizations.vad import load_vad
vad = load_vad("firered") # or "silero", "ten"
pipeline = Pipeline(stages=[VADStage(vad=vad)])
Install with pip install "sonic-scribe[firered-vad]" or pip install "sonic-scribe[ten-vad]".
Available backends
| Backend | Best model | LibriSpeech WER | Speed | License |
|---|---|---|---|---|
parakeet |
Parakeet TDT v3 (0.6B) | 0.67% | 53.4x CUDA | CC-BY-4.0 |
moonshine |
moonshine/base (61M) | 1.34% | 20.9x CPU | MIT |
qwen |
Qwen3-ASR-0.6B | 1.57% | 14.1x CUDA | Apache-2.0 |
whisper |
large-v3 (CT2 int8) | 1.57% | 1.2x CPU | MIT |
hf_whisper |
whisper-large-v2 | 2.01% | 16.5x CUDA | MIT |
Reproducing benchmarks
Every number in this README is reproducible:
python benchmarks/bench_multi_dataset.py --all # 8 datasets × 5 backends, ~6h wall
python benchmarks/bench_multi_dataset.py --collect # aggregate to markdown
python benchmarks/bench_parakeet_cuda.py # Parakeet CUDA vs CPU
python benchmarks/bench_moonshine_whisper_real.py # Moonshine + faster-whisper
python benchmarks/bench_batch.py # Batch throughput comparison
python benchmarks/bench_tome_e2e.py # ToMe speedup sweep
All benchmarks use real audio (auto-downloaded via HuggingFace datasets). The multi-dataset runner writes per-combo JSON to benchmarks/results/ so interrupted runs resume cleanly.
Examples
Runnable scripts in examples/:
python examples/quickstart.py # 3-line transcription
python examples/multi_backend_compare.py # compare all installed backends
python examples/streaming_demo.py audio.wav # async streaming from file
python examples/optimizations_demo.py audio.wav # VAD + ToMe pipeline
Development
git clone https://github.com/SeaL773/SonicScribe.git
cd SonicScribe
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev,moonshine,whisper,hf-whisper]"
pytest -q # 330 tests
ruff check src/ tests/ benchmarks/ tools/ # zero warnings
CUDA-gated tests skip cleanly without a GPU. See CONTRIBUTING.md for project conventions.
License
MIT. Individual backend models have their own licenses (see table above).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sonicscribe_asr-0.1.0.tar.gz.
File metadata
- Download URL: sonicscribe_asr-0.1.0.tar.gz
- Upload date:
- Size: 265.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e4e1765dac7b26819f5e5f24ebd8cb75f17479e8fd94829a1514e1752903e75f
|
|
| MD5 |
1327a3173810f565eb6d4b25cf782c61
|
|
| BLAKE2b-256 |
6d4c8a494ef0557be3239c36245d344c4205088a31abae60c1c6cdc8653ee5b0
|
File details
Details for the file sonicscribe_asr-0.1.0-py3-none-any.whl.
File metadata
- Download URL: sonicscribe_asr-0.1.0-py3-none-any.whl
- Upload date:
- Size: 80.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1fcfea708f496f0744b2c27ccf5a254407f44872b7f4101104baff46ceb1911d
|
|
| MD5 |
6c60b5a9c1b858aef70ce2374fdad410
|
|
| BLAKE2b-256 |
33c4c6f3e43a0d79ac958a6fd45ab1bc0b4799493106722361905ad77d97f3e4
|