Skip to main content

Unified ASR inference framework — multi-backend, optimized for consumer GPUs

Project description

SonicScribe

tests lint License: MIT Python 3.10+

SonicScribe is a unified speech-to-text framework that picks the best ASR model for your hardware and gives you one consistent API. It wraps five production backends — from lightweight CPU models to the fastest GPU engines — so you don't have to choose between speed, accuracy, and ease of use.

On a single consumer GPU (RTX 4070 Ti), SonicScribe transcribes audio up to 53x faster than real-time with 0.67% WER on LibriSpeech — better accuracy and speed than any single open-source ASR tool available today.

Benchmark

Comparison against popular ASR tools

All numbers measured on the same 30-utterance LibriSpeech val.clean slice (131.8s of real English speech), single RTX 4070 Ti GPU, 16-bit precision where applicable.

Tool / model WER Speed Notes
SonicScribe (Parakeet TDT v3, CUDA) 0.67% 53.4x RT best WER + fastest
SonicScribe (Parakeet TDT v3, CPU) 0.67% 19.9x RT no GPU needed
SonicScribe (Moonshine/base, CPU) 1.34% 20.9x RT MIT, 61M params
SonicScribe (Qwen3-ASR-0.6B, CUDA) 1.57% 14.1x RT 30+ languages
faster-whisper large-v3 (fp16 CUDA) 1.79% 11.4x RT via SonicScribe whisper backend
HF Whisper large-v2 (fp16 CUDA) 2.01% 13.6x RT via SonicScribe hf_whisper backend
openai/whisper large-v2 (fp16 CUDA)* ~3% ~4x RT reference from upstream benchmarks

* openai/whisper numbers are approximate from published benchmarks on comparable hardware; not measured on our test slice.

Batched transcription (native)

Backend Batch N=8 Batch N=32 Speedup over sequential
Qwen3-ASR-0.6B 0.65s 2.55s 4.4x
HF Whisper-small 0.53s 1.86s 2.3x
Parakeet TDT v3 (CPU) via onnx-asr list input native batch

CPU-only (no GPU required)

Tool WER Speed Model size
SonicScribe (Moonshine/tiny) 3.13% 39.3x RT 27M
SonicScribe (Moonshine/base) 1.34% 20.9x RT 61M
SonicScribe (Parakeet, CPU EP) 0.67% 19.9x RT 600M
faster-whisper large-v3 (int8) 1.57% 1.2x RT 1.5B

Moonshine/tiny on CPU is 33x faster than faster-whisper int8 on CPU with comparable accuracy.

Multi-dataset evaluation (8 benchmarks × 5 backends)

The numbers above use a single 30-utterance LibriSpeech slice. The table below extends evaluation to 8 diverse benchmarks: clean English, three multilingual splits (French / Spanish / German), accented business English, long-form (15-min) talks, dialectal English, and noisy multi-speaker meetings. 200 utterances for short-form, 20 utterances (~5h audio) for TED-LIUM long-form, and 10 utterances (~7h) for CORAAL — same RTX 4070 Ti, same jiwer transform across all cells.

WER (%):

Dataset (domain) parakeet qwen whisper-CT2 hf-whisper moonshine best
LibriSpeech clean (en, audiobook) 3.80% 2.57% 6.32% 3.06% 3.52% qwen
VoxPopuli FR (fr, parliament) 11.96% 14.35% 12.74% 14.98% N/A parakeet
VoxPopuli ES (es, parliament) 8.11% 10.08% 8.38% 9.15% N/A parakeet
MLS DE (de, audiobook) 9.93% 6.73% 6.36% 3.98% N/A hf-whisper
Earnings-22 (en, accented finance) 20.48% 16.92% 15.35% 17.50% 31.78% whisper-CT2
AMI IHM (en, noisy meetings) 31.86% 16.73% 22.09% 23.21% 59.87% qwen
TED-LIUM long-form (en, 15-min talks) 8.63% 93.07%* 7.31% 25.52% N/A whisper-CT2
CORAAL Atlanta (en, dialect, 42-min interviews) 22.94% ERR* 24.76% 48.50% N/A parakeet

* Honest correction (round 2): My initial reading was that Qwen had a lower long-form ceiling than other backends. Follow-up debugging on real continuous TED audio showed Qwen works well at 5 min (RTFx 10.1×) and 10 min (RTFx 10.2×). The 93% WER and CORAAL hang are caused by two separate issues: (1) max_new_tokens=256 was silently truncating dense long transcripts (fixed in qwen/backend.py to 4096); (2) TED utt 1 is 20.8 min, just past qwen-asr's MAX_ASR_INPUT_SECONDS=1200, so the library auto-splits into two chunks and per-chunk decoder cost grows non-linearly. Manual sweep confirms degradation isn't a chunk-boundary artifact: WER rises monotonically from < 5% at 10 min to 33.88% at 15 min, 45.33% at 18 min, 53.48% at 19 min — driven by attention/KV-cache numerical drift on long Qwen3-ASR contexts.

I tried the qwen-asr vLLM backend as the alleged production fix. After upgrading torch 2.4 → 2.9.1, installing vllm 0.14.0, and validating 5-min audio (vLLM RTFx 16.5× vs transformers 10.1×), vLLM turned out to be slower on long audio: 10-min audio takes 246s on vLLM (RTFx 2.4×) vs 58s on transformers (RTFx 10.2×) — a 4.2× regression. Root cause: Qwen3-ASR is prefill-bound (13K audio tokens fed in one shot), but vLLM is engineered for decode-bound LLM serving (paged KV cache for autoregressive generation). The optimization target doesn't match. The vLLM backend wiring stays in the codebase (inference_backend="vllm") for future hardware/version upgrades, but the recommended production path for >10-min audio remains Parakeet or whisper-CT2, not Qwen.

RTFx (real-time factor):

Dataset parakeet qwen whisper-CT2 hf-whisper moonshine
LibriSpeech clean 62.3x 16.0x 13.3x 17.0x 22.1x
VoxPopuli FR 72.1x 14.1x 15.3x 19.2x N/A
VoxPopuli ES 75.9x 15.2x 16.9x 20.5x N/A
MLS DE 92.0x 17.3x 19.5x 23.6x N/A
Earnings-22 56.3x 17.1x 14.9x 17.6x 23.1x
AMI IHM 39.4x 13.1x 7.6x 8.8x 14.3x
TED-LIUM long-form 97.6x 56.0x 12.0x 24.0x N/A
CORAAL Atlanta 102.3x ERR 12.2x 34.2x N/A

Three takeaways the headline number doesn't show:

  1. No single backend wins every category. Parakeet leads on multilingual + long-form + dialectal; qwen wins clean English and noisy meetings (16.73% WER on AMI is 5pp ahead of #2); HF Whisper takes German (3.98%); whisper-CT2 takes finance and TED long-form. Picking the right backend per domain matters.
  2. Long-form is genuinely hard. WER roughly doubles or triples on long-form vs short-form for the same backend. The Qwen rows on TED-LIUM and CORAAL look catastrophic, but see the footnote — those numbers reflect a pre-fix configuration plus a transformers-backend cost cliff at the 20-min chunk boundary, not a Qwen model limitation.
  3. Moonshine is English-only and short-form-only by design. N/A cells are intentional — Moonshine is the right choice for ≤30s English on CPU.

Reproduce: python benchmarks/bench_multi_dataset.py --all then --collect. Per-combo JSONs land in benchmarks/results/; runs are resumable.

Requirements

  • Python 3.10 or greater
  • No FFmpeg required (audio decoded via soundfile/numpy)
  • GPU optional (all backends work on CPU; CUDA gives 2-50x speedup)

Installation

pip install sonic-scribe[parakeet]      # Fastest + best WER (CC-BY-4.0)
pip install sonic-scribe[moonshine]     # Best CPU option (MIT)
pip install sonic-scribe[whisper]       # faster-whisper / CTranslate2 (MIT)
pip install sonic-scribe[qwen]          # Multilingual champion (Apache-2.0)
pip install sonic-scribe[hf-whisper]    # PyTorch Whisper, supports encoder hooks (MIT)
pip install sonic-scribe[all]           # All backends

For GPU acceleration with Parakeet:

pip install sonic-scribe[parakeet-gpu]  # includes CUDA 12 wheels

Usage

Simplest possible

import sonic_scribe

result = sonic_scribe.transcribe("meeting.wav")
print(result.text)

SonicScribe auto-detects installed backends and picks the best one.

Choose a backend

from sonic_scribe import Engine

engine = Engine(backend="parakeet", device="cuda")
result = engine.transcribe("podcast.mp3", language="en")

print(result.text)
print(f"Duration: {result.duration:.1f}s")
for seg in result.segments:
    print(f"  [{seg.start:.1f}s → {seg.end:.1f}s] {seg.text}")

Long-form audio (podcasts, lectures, meetings)

SonicScribe automatically handles audio of any length. Parakeet, faster-whisper, HF Whisper, and Qwen all support long-form transcription out of the box:

engine = Engine(backend="parakeet", device="cuda")
result = engine.transcribe("2hour_lecture.mp3")  # just works
print(f"{len(result.segments)} segments transcribed")

Word-level timestamps

result = engine.transcribe("audio.wav", word_timestamps=True)
for seg in result.segments:
    for word in seg.words:
        print(f"[{word.start:.2f}s → {word.end:.2f}s] {word.word}")

Hotwords / prompt injection

Guide the model with domain-specific vocabulary (whisper and hf_whisper backends):

result = engine.transcribe("meeting.wav", initial_prompt="SonicScribe, PyTorch, RTFx")

Batch transcription

Process multiple files in one GPU pass for maximum throughput:

files = ["clip1.wav", "clip2.wav", "clip3.wav", "clip4.wav"]
results = engine.transcribe_batch(files, language="en")
# Qwen: 4.4x faster than sequential
# HF Whisper: 2.3x faster than sequential

Parallel file processing

Process many files across threads (different from GPU batching):

results = engine.transcribe_files(["a.wav", "b.wav", "c.wav", "d.wav"], workers=4)

Speaker diarization (who said what)

Add speaker labels to transcribed segments via pyannote.audio 4.0 as a post-process pipeline stage. Works on top of any backend; turns plain transcription into speaker-attributed dialogue:

from sonic_scribe import Engine
from sonic_scribe.optimizations import DiarizationStage, DiarizationConfig, Pipeline

pipeline = Pipeline(stages=[
    DiarizationStage(DiarizationConfig(num_speakers=2, hf_token="hf_xxx")),
])
engine = Engine(backend="parakeet", device="cuda", pipeline=pipeline)
result = engine.transcribe("meeting.wav")

for seg in result.segments:
    print(f"[{seg.speaker_id}] {seg.start:.1f}s-{seg.end:.1f}s: {seg.text}")
# Output:
# [SPEAKER_00] 0.5s-2.1s: Welcome to the meeting.
# [SPEAKER_01] 2.4s-4.0s: Thanks for having me.

Install with pip install "sonic-scribe[diarization]". The pyannote/speaker-diarization-community-1 model is gated; accept the license at https://huggingface.co/pyannote/speaker-diarization-community-1 and pass an HF read token via hf_token= (or set HF_TOKEN in the environment).

Uses pyannote 4.0's exclusive diarization timeline (one speaker per frame) by default for clean ASR alignment. Set use_exclusive=False in DiarizationConfig to expose overlapping-speech turns instead.

Punctuation restoration + truecasing

ASR output is typically lowercased and unpunctuated. Add a PunctuationStage to fix this automatically (47 languages via punctuators ONNX model):

from sonic_scribe.optimizations import Pipeline, PunctuationStage

pipeline = Pipeline(stages=[PunctuationStage()])
engine = Engine(backend="moonshine", pipeline=pipeline)
result = engine.transcribe("audio.wav")
# Input:  "hello world how are you"
# Output: "Hello world, how are you?"

Install with pip install "sonic-scribe[punctuation]".

Export to SRT / VTT / JSON

result = engine.transcribe("lecture.wav")
print(result.to_srt())   # SRT subtitle format
print(result.to_vtt())   # WebVTT subtitle format
print(result.to_json())  # structured JSON

Async streaming (real-time)

Get partial transcripts as audio comes in — sub-500ms first-partial latency:

import asyncio
from sonic_scribe.streaming import (
    StreamingConfig, transcribe_stream,
    PartialTranscript, FinalSegment,
)

async def live_transcribe(audio_frames):
    engine = Engine(backend="hf_whisper", device="cuda")
    backend = engine.backend
    config = StreamingConfig(chunk_seconds=4.0)

    async for event in transcribe_stream(audio_frames, backend, config):
        if isinstance(event, PartialTranscript):
            print(f"... {event.text}", end="\r")
        elif isinstance(event, FinalSegment):
            print(f"✓ {event.segment.text}")

CLI

# Transcribe a single file
sonic-scribe transcribe interview.wav -b parakeet -d cuda

# Multiple files as JSON
sonic-scribe transcribe a.wav b.wav c.wav --json

# SRT / WebVTT subtitle output
sonic-scribe transcribe lecture.wav --srt > lecture.srt
sonic-scribe transcribe lecture.wav --vtt > lecture.vtt

# Word timestamps in JSON output
sonic-scribe transcribe audio.wav --word-timestamps --json

# List installed backends
sonic-scribe info

Encoder optimizations

SonicScribe includes two novel encoder compression algorithms that speed up PyTorch-based backends without retraining:

Token Merging (loss-less 1.4x speedup)

from sonic_scribe.optimizations.tome import AudioToMeStack, ToMeConfig

stack = AudioToMeStack(ToMeConfig(sim_threshold=0.95, mode="shrink")).apply(model)
# Encoder runs 1.4x faster, WER unchanged

LiteASR low-rank compression

# Offline: compress encoder weights
python tools/liteasr_compress.py --model openai/whisper-small --out factors.npz

# Runtime: swap in compressed weights (19% fewer encoder params)
from sonic_scribe.optimizations.liteasr import apply_factors_to_encoder, load_factors
entries = load_factors("factors.npz")
apply_factors_to_encoder(model, entries)

Composable optimization pipeline

Stack multiple optimizations together. The Pipeline validates conflicts at construction time:

from sonic_scribe import Engine
from sonic_scribe.optimizations.pipeline import Pipeline, VADStage, EncoderToMeStage

pipeline = Pipeline(stages=[
    VADStage(),                                          # silence removal
    EncoderToMeStage(sim_threshold=0.95, mode="shrink"), # token merging
])
engine = Engine(backend="hf_whisper", device="cuda", pipeline=pipeline)
result = engine.transcribe("lecture.wav", language="en")

VAD backend selection

Three voice activity detection backends, switchable at pipeline level:

VAD F1 (FLEURS-102) Latency License
Silero (default) 95.95% Good MIT
FireRedVAD 97.57% Good Apache 2.0
TEN-VAD 95.19% <10ms Apache 2.0 + Agora restrictions
from sonic_scribe.optimizations.vad import load_vad

vad = load_vad("firered")  # or "silero", "ten"
pipeline = Pipeline(stages=[VADStage(vad=vad)])

Install with pip install "sonic-scribe[firered-vad]" or pip install "sonic-scribe[ten-vad]".

Available backends

Backend Best model LibriSpeech WER Speed License
parakeet Parakeet TDT v3 (0.6B) 0.67% 53.4x CUDA CC-BY-4.0
moonshine moonshine/base (61M) 1.34% 20.9x CPU MIT
qwen Qwen3-ASR-0.6B 1.57% 14.1x CUDA Apache-2.0
whisper large-v3 (CT2 int8) 1.57% 1.2x CPU MIT
hf_whisper whisper-large-v2 2.01% 16.5x CUDA MIT

Reproducing benchmarks

Every number in this README is reproducible:

python benchmarks/bench_multi_dataset.py --all     # 8 datasets × 5 backends, ~6h wall
python benchmarks/bench_multi_dataset.py --collect # aggregate to markdown
python benchmarks/bench_parakeet_cuda.py           # Parakeet CUDA vs CPU
python benchmarks/bench_moonshine_whisper_real.py  # Moonshine + faster-whisper
python benchmarks/bench_batch.py                   # Batch throughput comparison
python benchmarks/bench_tome_e2e.py                # ToMe speedup sweep

All benchmarks use real audio (auto-downloaded via HuggingFace datasets). The multi-dataset runner writes per-combo JSON to benchmarks/results/ so interrupted runs resume cleanly.

Examples

Runnable scripts in examples/:

python examples/quickstart.py                  # 3-line transcription
python examples/multi_backend_compare.py       # compare all installed backends
python examples/streaming_demo.py audio.wav    # async streaming from file
python examples/optimizations_demo.py audio.wav  # VAD + ToMe pipeline

Development

git clone https://github.com/SeaL773/SonicScribe.git
cd SonicScribe
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev,moonshine,whisper,hf-whisper]"

pytest -q                                          # 330 tests
ruff check src/ tests/ benchmarks/ tools/          # zero warnings

CUDA-gated tests skip cleanly without a GPU. See CONTRIBUTING.md for project conventions.

License

MIT. Individual backend models have their own licenses (see table above).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sonicscribe_asr-0.1.0.tar.gz (265.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sonicscribe_asr-0.1.0-py3-none-any.whl (80.2 kB view details)

Uploaded Python 3

File details

Details for the file sonicscribe_asr-0.1.0.tar.gz.

File metadata

  • Download URL: sonicscribe_asr-0.1.0.tar.gz
  • Upload date:
  • Size: 265.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for sonicscribe_asr-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e4e1765dac7b26819f5e5f24ebd8cb75f17479e8fd94829a1514e1752903e75f
MD5 1327a3173810f565eb6d4b25cf782c61
BLAKE2b-256 6d4c8a494ef0557be3239c36245d344c4205088a31abae60c1c6cdc8653ee5b0

See more details on using hashes here.

File details

Details for the file sonicscribe_asr-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for sonicscribe_asr-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1fcfea708f496f0744b2c27ccf5a254407f44872b7f4101104baff46ceb1911d
MD5 6c60b5a9c1b858aef70ce2374fdad410
BLAKE2b-256 33c4c6f3e43a0d79ac958a6fd45ab1bc0b4799493106722361905ad77d97f3e4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page