Unified ASR inference framework — multi-backend, optimized for consumer GPUs

These details have not been verified by PyPI

Project links

Project description

SonicScribe

SonicScribe is a unified speech-to-text framework that picks the best ASR model for your hardware and gives you one consistent API. It wraps five production backends — from lightweight CPU models to the fastest GPU engines — so you don't have to choose between speed, accuracy, and ease of use.

On a single consumer GPU (RTX 4070 Ti), SonicScribe transcribes audio up to 53x faster than real-time with 0.67% WER on LibriSpeech — better accuracy and speed than any single open-source ASR tool available today.

Benchmark

Comparison against popular ASR tools

All numbers measured on the same 30-utterance LibriSpeech val.clean slice (131.8s of real English speech), single RTX 4070 Ti GPU, 16-bit precision where applicable.

Tool / model	WER	Speed	Notes
SonicScribe (Parakeet TDT v3, CUDA)	0.67%	53.4x RT	best WER + fastest
SonicScribe (Parakeet TDT v3, CPU)	0.67%	19.9x RT	no GPU needed
SonicScribe (Moonshine/base, CPU)	1.34%	20.9x RT	MIT, 61M params
SonicScribe (Qwen3-ASR-0.6B, CUDA)	1.57%	14.1x RT	30+ languages
faster-whisper large-v3 (fp16 CUDA)	1.79%	11.4x RT	via SonicScribe whisper backend
HF Whisper large-v2 (fp16 CUDA)	2.01%	13.6x RT	via SonicScribe hf_whisper backend
openai/whisper large-v2 (fp16 CUDA)*	~3%	~4x RT	reference from upstream benchmarks

* openai/whisper numbers are approximate from published benchmarks on comparable hardware; not measured on our test slice.

Batched transcription (native)

Backend	Batch N=8	Batch N=32	Speedup over sequential
Qwen3-ASR-0.6B	0.65s	2.55s	4.4x
HF Whisper-small	0.53s	1.86s	2.3x
Parakeet TDT v3 (CPU)	via onnx-asr list input	—	native batch

CPU-only (no GPU required)

Tool	WER	Speed	Model size
SonicScribe (Moonshine/tiny)	3.13%	39.3x RT	27M
SonicScribe (Moonshine/base)	1.34%	20.9x RT	61M
SonicScribe (Parakeet, CPU EP)	0.67%	19.9x RT	600M
faster-whisper large-v3 (int8)	1.57%	1.2x RT	1.5B

Moonshine/tiny on CPU is 33x faster than faster-whisper int8 on CPU with comparable accuracy.

Multi-dataset evaluation (8 benchmarks × 5 backends)

The numbers above use a single 30-utterance LibriSpeech slice. The table below extends evaluation to 8 diverse benchmarks: clean English, three multilingual splits (French / Spanish / German), accented business English, long-form (15-min) talks, dialectal English, and noisy multi-speaker meetings. 200 utterances for short-form, 20 utterances (~5h audio) for TED-LIUM long-form, and 10 utterances (~7h) for CORAAL — same RTX 4070 Ti, same jiwer transform across all cells.

WER (%):

Dataset (domain)	parakeet	qwen	whisper-CT2	hf-whisper	moonshine	best
LibriSpeech clean (en, audiobook)	3.80%	2.57%	6.32%	3.06%	3.52%	qwen
VoxPopuli FR (fr, parliament)	11.96%	14.35%	12.74%	14.98%	N/A	parakeet
VoxPopuli ES (es, parliament)	8.11%	10.08%	8.38%	9.15%	N/A	parakeet
MLS DE (de, audiobook)	9.93%	6.73%	6.36%	3.98%	N/A	hf-whisper
Earnings-22 (en, accented finance)	20.48%	16.92%	15.35%	17.50%	31.78%	whisper-CT2
AMI IHM (en, noisy meetings)	31.86%	16.73%	22.09%	23.21%	59.87%	qwen
TED-LIUM long-form (en, 15-min talks)	8.63%	93.07%*	7.31%	25.52%	N/A	whisper-CT2
CORAAL Atlanta (en, dialect, 42-min interviews)	22.94%	ERR*	24.76%	48.50%	N/A	parakeet

* Honest correction (round 2): My initial reading was that Qwen had a lower long-form ceiling than other backends. Follow-up debugging on real continuous TED audio showed Qwen works well at 5 min (RTFx 10.1×) and 10 min (RTFx 10.2×). The 93% WER and CORAAL hang are caused by two separate issues: (1) max_new_tokens=256 was silently truncating dense long transcripts (fixed in qwen/backend.py to 4096); (2) TED utt 1 is 20.8 min, just past qwen-asr's MAX_ASR_INPUT_SECONDS=1200, so the library auto-splits into two chunks and per-chunk decoder cost grows non-linearly. Manual sweep confirms degradation isn't a chunk-boundary artifact: WER rises monotonically from < 5% at 10 min to 33.88% at 15 min, 45.33% at 18 min, 53.48% at 19 min — driven by attention/KV-cache numerical drift on long Qwen3-ASR contexts.

I tried the qwen-asr vLLM backend as the alleged production fix. After upgrading torch 2.4 → 2.9.1, installing vllm 0.14.0, and validating 5-min audio (vLLM RTFx 16.5× vs transformers 10.1×), vLLM turned out to be slower on long audio: 10-min audio takes 246s on vLLM (RTFx 2.4×) vs 58s on transformers (RTFx 10.2×) — a 4.2× regression. Root cause: Qwen3-ASR is prefill-bound (13K audio tokens fed in one shot), but vLLM is engineered for decode-bound LLM serving (paged KV cache for autoregressive generation). The optimization target doesn't match. The vLLM backend wiring stays in the codebase (inference_backend="vllm") for future hardware/version upgrades, but the recommended production path for >10-min audio remains Parakeet or whisper-CT2, not Qwen.

RTFx (real-time factor):

Dataset	parakeet	qwen	whisper-CT2	hf-whisper	moonshine
LibriSpeech clean	62.3x	16.0x	13.3x	17.0x	22.1x
VoxPopuli FR	72.1x	14.1x	15.3x	19.2x	N/A
VoxPopuli ES	75.9x	15.2x	16.9x	20.5x	N/A
MLS DE	92.0x	17.3x	19.5x	23.6x	N/A
Earnings-22	56.3x	17.1x	14.9x	17.6x	23.1x
AMI IHM	39.4x	13.1x	7.6x	8.8x	14.3x
TED-LIUM long-form	97.6x	56.0x	12.0x	24.0x	N/A
CORAAL Atlanta	102.3x	ERR	12.2x	34.2x	N/A

Three takeaways the headline number doesn't show:

No single backend wins every category. Parakeet leads on multilingual + long-form + dialectal; qwen wins clean English and noisy meetings (16.73% WER on AMI is 5pp ahead of #2); HF Whisper takes German (3.98%); whisper-CT2 takes finance and TED long-form. Picking the right backend per domain matters.
Long-form is genuinely hard. WER roughly doubles or triples on long-form vs short-form for the same backend. The Qwen rows on TED-LIUM and CORAAL look catastrophic, but see the footnote — those numbers reflect a pre-fix configuration plus a transformers-backend cost cliff at the 20-min chunk boundary, not a Qwen model limitation.
Moonshine is English-only and short-form-only by design. N/A cells are intentional — Moonshine is the right choice for ≤30s English on CPU.

Reproduce: python benchmarks/bench_multi_dataset.py --all then --collect. Per-combo JSONs land in benchmarks/results/; runs are resumable.

Requirements

Python 3.10 or greater
No FFmpeg required (audio decoded via soundfile/numpy)
GPU optional (all backends work on CPU; CUDA gives 2-50x speedup)

Installation

pip install sonic-scribe[parakeet]      # Fastest + best WER (CC-BY-4.0)
pip install sonic-scribe[moonshine]     # Best CPU option (MIT)
pip install sonic-scribe[whisper]       # faster-whisper / CTranslate2 (MIT)
pip install sonic-scribe[qwen]          # Multilingual champion (Apache-2.0)
pip install sonic-scribe[hf-whisper]    # PyTorch Whisper, supports encoder hooks (MIT)
pip install sonic-scribe[all]           # All backends

For GPU acceleration with Parakeet:

pip install sonic-scribe[parakeet-gpu]  # includes CUDA 12 wheels

Usage

Simplest possible

import sonic_scribe

result = sonic_scribe.transcribe("meeting.wav")
print(result.text)

SonicScribe auto-detects installed backends and picks the best one.

Choose a backend

from sonic_scribe import Engine

engine = Engine(backend="parakeet", device="cuda")
result = engine.transcribe("podcast.mp3", language="en")

print(result.text)
print(f"Duration: {result.duration:.1f}s")
for seg in result.segments:
    print(f"  [{seg.start:.1f}s → {seg.end:.1f}s] {seg.text}")

Long-form audio (podcasts, lectures, meetings)

SonicScribe automatically handles audio of any length. Parakeet, faster-whisper, HF Whisper, and Qwen all support long-form transcription out of the box:

engine = Engine(backend="parakeet", device="cuda")
result = engine.transcribe("2hour_lecture.mp3")  # just works
print(f"{len(result.segments)} segments transcribed")

Word-level timestamps

result = engine.transcribe("audio.wav", word_timestamps=True)
for seg in result.segments:
    for word in seg.words:
        print(f"[{word.start:.2f}s → {word.end:.2f}s] {word.word}")

Hotwords / prompt injection

Guide the model with domain-specific vocabulary (whisper and hf_whisper backends):

result = engine.transcribe("meeting.wav", initial_prompt="SonicScribe, PyTorch, RTFx")

Batch transcription

Process multiple files in one GPU pass for maximum throughput:

files = ["clip1.wav", "clip2.wav", "clip3.wav", "clip4.wav"]
results = engine.transcribe_batch(files, language="en")
# Qwen: 4.4x faster than sequential
# HF Whisper: 2.3x faster than sequential

Parallel file processing

Process many files across threads (different from GPU batching):

results = engine.transcribe_files(["a.wav", "b.wav", "c.wav", "d.wav"], workers=4)

Speaker diarization (who said what)

Add speaker labels to transcribed segments via pyannote.audio 4.0 as a post-process pipeline stage. Works on top of any backend; turns plain transcription into speaker-attributed dialogue:

from sonic_scribe import Engine
from sonic_scribe.optimizations import DiarizationStage, DiarizationConfig, Pipeline

pipeline = Pipeline(stages=[
    DiarizationStage(DiarizationConfig(num_speakers=2, hf_token="hf_xxx")),
])
engine = Engine(backend="parakeet", device="cuda", pipeline=pipeline)
result = engine.transcribe("meeting.wav")

for seg in result.segments:
    print(f"[{seg.speaker_id}] {seg.start:.1f}s-{seg.end:.1f}s: {seg.text}")
# Output:
# [SPEAKER_00] 0.5s-2.1s: Welcome to the meeting.
# [SPEAKER_01] 2.4s-4.0s: Thanks for having me.

Install with pip install "sonic-scribe[diarization]". The pyannote/speaker-diarization-community-1 model is gated; accept the license at https://huggingface.co/pyannote/speaker-diarization-community-1 and pass an HF read token via hf_token= (or set HF_TOKEN in the environment).

Uses pyannote 4.0's exclusive diarization timeline (one speaker per frame) by default for clean ASR alignment. Set use_exclusive=False in DiarizationConfig to expose overlapping-speech turns instead.

Punctuation restoration + truecasing

ASR output is typically lowercased and unpunctuated. Add a PunctuationStage to fix this automatically (47 languages via punctuators ONNX model):

from sonic_scribe.optimizations import Pipeline, PunctuationStage

pipeline = Pipeline(stages=[PunctuationStage()])
engine = Engine(backend="moonshine", pipeline=pipeline)
result = engine.transcribe("audio.wav")
# Input:  "hello world how are you"
# Output: "Hello world, how are you?"

Install with pip install "sonic-scribe[punctuation]".

Export to SRT / VTT / JSON

result = engine.transcribe("lecture.wav")
print(result.to_srt())   # SRT subtitle format
print(result.to_vtt())   # WebVTT subtitle format
print(result.to_json())  # structured JSON

Async streaming (real-time)

Get partial transcripts as audio comes in — sub-500ms first-partial latency:

import asyncio
from sonic_scribe.streaming import (
    StreamingConfig, transcribe_stream,
    PartialTranscript, FinalSegment,
)

async def live_transcribe(audio_frames):
    engine = Engine(backend="hf_whisper", device="cuda")
    backend = engine.backend
    config = StreamingConfig(chunk_seconds=4.0)

    async for event in transcribe_stream(audio_frames, backend, config):
        if isinstance(event, PartialTranscript):
            print(f"... {event.text}", end="\r")
        elif isinstance(event, FinalSegment):
            print(f"✓ {event.segment.text}")

CLI

# Transcribe a single file
sonic-scribe transcribe interview.wav -b parakeet -d cuda

# Multiple files as JSON
sonic-scribe transcribe a.wav b.wav c.wav --json

# SRT / WebVTT subtitle output
sonic-scribe transcribe lecture.wav --srt > lecture.srt
sonic-scribe transcribe lecture.wav --vtt > lecture.vtt

# Word timestamps in JSON output
sonic-scribe transcribe audio.wav --word-timestamps --json

# List installed backends
sonic-scribe info

Encoder optimizations

SonicScribe includes two novel encoder compression algorithms that speed up PyTorch-based backends without retraining:

Token Merging (loss-less 1.4x speedup)

from sonic_scribe.optimizations.tome import AudioToMeStack, ToMeConfig

stack = AudioToMeStack(ToMeConfig(sim_threshold=0.95, mode="shrink")).apply(model)
# Encoder runs 1.4x faster, WER unchanged

LiteASR low-rank compression

# Offline: compress encoder weights
python tools/liteasr_compress.py --model openai/whisper-small --out factors.npz

# Runtime: swap in compressed weights (19% fewer encoder params)
from sonic_scribe.optimizations.liteasr import apply_factors_to_encoder, load_factors
entries = load_factors("factors.npz")
apply_factors_to_encoder(model, entries)

Composable optimization pipeline

Stack multiple optimizations together. The Pipeline validates conflicts at construction time:

from sonic_scribe import Engine
from sonic_scribe.optimizations.pipeline import Pipeline, VADStage, EncoderToMeStage

pipeline = Pipeline(stages=[
    VADStage(),                                          # silence removal
    EncoderToMeStage(sim_threshold=0.95, mode="shrink"), # token merging
])
engine = Engine(backend="hf_whisper", device="cuda", pipeline=pipeline)
result = engine.transcribe("lecture.wav", language="en")

VAD backend selection

Three voice activity detection backends, switchable at pipeline level:

VAD	F1 (FLEURS-102)	Latency	License
Silero (default)	95.95%	Good	MIT
FireRedVAD	97.57%	Good	Apache 2.0
TEN-VAD	95.19%	<10ms	Apache 2.0 + Agora restrictions

from sonic_scribe.optimizations.vad import load_vad

vad = load_vad("firered")  # or "silero", "ten"
pipeline = Pipeline(stages=[VADStage(vad=vad)])

Install with pip install "sonic-scribe[firered-vad]" or pip install "sonic-scribe[ten-vad]".

Available backends

Backend	Best model	LibriSpeech WER	Speed	License
`parakeet`	Parakeet TDT v3 (0.6B)	0.67%	53.4x CUDA	CC-BY-4.0
`moonshine`	moonshine/base (61M)	1.34%	20.9x CPU	MIT
`qwen`	Qwen3-ASR-0.6B	1.57%	14.1x CUDA	Apache-2.0
`whisper`	large-v3 (CT2 int8)	1.57%	1.2x CPU	MIT
`hf_whisper`	whisper-large-v2	2.01%	16.5x CUDA	MIT

Reproducing benchmarks

Every number in this README is reproducible:

python benchmarks/bench_multi_dataset.py --all     # 8 datasets × 5 backends, ~6h wall
python benchmarks/bench_multi_dataset.py --collect # aggregate to markdown
python benchmarks/bench_parakeet_cuda.py           # Parakeet CUDA vs CPU
python benchmarks/bench_moonshine_whisper_real.py  # Moonshine + faster-whisper
python benchmarks/bench_batch.py                   # Batch throughput comparison
python benchmarks/bench_tome_e2e.py                # ToMe speedup sweep

All benchmarks use real audio (auto-downloaded via HuggingFace datasets). The multi-dataset runner writes per-combo JSON to benchmarks/results/ so interrupted runs resume cleanly.

Examples

Runnable scripts in examples/:

python examples/quickstart.py                  # 3-line transcription
python examples/multi_backend_compare.py       # compare all installed backends
python examples/streaming_demo.py audio.wav    # async streaming from file
python examples/optimizations_demo.py audio.wav  # VAD + ToMe pipeline

Development

git clone https://github.com/SeaL773/SonicScribe.git
cd SonicScribe
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev,moonshine,whisper,hf-whisper]"

pytest -q                                          # 330 tests
ruff check src/ tests/ benchmarks/ tools/          # zero warnings

CUDA-gated tests skip cleanly without a GPU. See CONTRIBUTING.md for project conventions.

License

MIT. Individual backend models have their own licenses (see table above).

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jun 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sonicscribe_asr-0.1.0.tar.gz (265.6 kB view details)

Uploaded Jun 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sonicscribe_asr-0.1.0-py3-none-any.whl (80.2 kB view details)

Uploaded Jun 1, 2026 Python 3

File details

Details for the file sonicscribe_asr-0.1.0.tar.gz.

File metadata

Download URL: sonicscribe_asr-0.1.0.tar.gz
Upload date: Jun 1, 2026
Size: 265.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for sonicscribe_asr-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`e4e1765dac7b26819f5e5f24ebd8cb75f17479e8fd94829a1514e1752903e75f`
MD5	`1327a3173810f565eb6d4b25cf782c61`
BLAKE2b-256	`6d4c8a494ef0557be3239c36245d344c4205088a31abae60c1c6cdc8653ee5b0`

See more details on using hashes here.

File details

Details for the file sonicscribe_asr-0.1.0-py3-none-any.whl.

File metadata

Download URL: sonicscribe_asr-0.1.0-py3-none-any.whl
Upload date: Jun 1, 2026
Size: 80.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for sonicscribe_asr-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1fcfea708f496f0744b2c27ccf5a254407f44872b7f4101104baff46ceb1911d`
MD5	`6c60b5a9c1b858aef70ce2374fdad410`
BLAKE2b-256	`33c4c6f3e43a0d79ac958a6fd45ab1bc0b4799493106722361905ad77d97f3e4`

See more details on using hashes here.

sonicscribe-asr 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SonicScribe

Benchmark

Comparison against popular ASR tools

Batched transcription (native)

CPU-only (no GPU required)

Multi-dataset evaluation (8 benchmarks × 5 backends)

Requirements

Installation

Usage

Simplest possible

Choose a backend

Long-form audio (podcasts, lectures, meetings)

Word-level timestamps

Hotwords / prompt injection

Batch transcription

Parallel file processing

Speaker diarization (who said what)

Punctuation restoration + truecasing

Export to SRT / VTT / JSON

Async streaming (real-time)

CLI

Encoder optimizations

Token Merging (loss-less 1.4x speedup)

LiteASR low-rank compression

Composable optimization pipeline

VAD backend selection

Available backends

Reproducing benchmarks

Examples

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes