Speech-To-Text pipeline supporting all OpenAI-compatible Automatic Speech Recognition providers with added resilience and metadata through LLM-powered transcript repair, audio preprocessing, vad, and pyannote diarization
Project description
Resilient STT
The only Speech-To-Text pipeline you need.
A provider-agnostic speech-to-text pipeline that plugs into any OpenAI-compatible ASR endpoint — OpenAI, OpenRouter, vLLM, faster-whisper, or the bundled Qwen worker — and returns rich, timestamped transcripts with speaker labels and validated metadata. Resilient STT handles everything around inference: audio preprocessing, voice-activity detection, intelligent chunking, pyannote diarization, and LLM-powered transcript repair. Tested on English, Hindi, and Hinglish but it is designed to support all the languages the ASR model of your choice supports.
Features
- Universal ASR — Connect to any OpenAI-compatible endpoint, switching models easily by flag.
- Automatic Discovery — Finds local or remote ASR, retries if needed, starts Qwen fallback automatically.
- Seamless Audio Prep — Effortless ffmpeg normalization, optional enhancement.
- Smart Silence Skipping — Uses Silero, webrtcvad, or RMS to focus only on speech.
- Chunking & Stitching — Handles long audio with intelligent segmentation, accurate timestamps.
- Speaker Diarization — Word-level speakers with pyannote.
- LLM Repair — Optional transcript refinement via your preferred chat endpoint.
- Versatile Export — Output to JSON, SRT, and VTT including rich metadata.
- Lightweight — Orchestrator only; no model weights required.
- CLI & Python API — Use via terminal or integrate in code.
Architecture & design decisions: docs/design.md
CLI reference: docs/cli.md
Quickstart
Prerequisites: Python 3.11 or 3.12, ffmpeg on PATH.
pip install "resilient-stt[full]" # minimal: pip install resilient-stt (no pyannote/Silero)
resilient-stt \
--audio /path/to/audio.wav \
--output /path/to/output-dir \
--language hi
# Output (under --output):
# transcript.json — segments, words, speakers, repair metadata
# transcript.srt — subtitles
# transcript.vtt — WebVTT
#
# Example transcript.json (truncated):
# {
# "audio_file": "/path/to/audio.wav",
# "duration": 142.5,
# "language": "hi",
# "asr_provider": "qwen-asr-fallback",
# "asr_model": "Qwen/Qwen3-ASR-0.6B",
# "segments": [{
# "speaker": "SPEAKER_00",
# "start": 0.12,
# "end": 4.85,
# "raw_text": "Namaste, aaj hum meeting shuru karte hain.",
# "clean_text": "Namaste, aaj hum meeting shuru karte hain.",
# "repair_status": "unchanged",
# "words": [{ "word": "Namaste,", "start": 0.12, "end": 0.58, "speaker": "SPEAKER_00" }]
# }]
# }
No ASR setup required — the CLI auto-detects local workers and hosted APIs, or starts a bundled Qwen worker. Quick smoke test without pyannote or repair: add --skip-diarization --repair false. For OpenAI or OpenRouter, set OPENAI_API_KEY or OPENROUTER_API_KEY in .env and pass --no-asr-fallback. Full flags: docs/cli.md. Repository layout: docs/design.md.
Note
🚧 Active development – not production-ready.
Expect changes, incomplete features, and ongoing improvements.
Architecture
Audio input
-> ffmpeg normalize (mono 16 kHz WAV; optional --enhance-audio)
-> VAD (Silero if installed, else webrtcvad / RMS; skip silent regions)
-> chunk speech regions (fixed: 60s/2s overlap; or pause-aligned: ~120s at onsets)
-> ASR microservice calls (OpenAI-compatible)
-> normalize + stitch global timestamps
-> pyannote diarization on the full normalized audio
-> speaker assignment (word IoU + segment fallback)
-> optional forced alignment
-> optional LLM transcript repair (two passes, validated)
-> export JSON / SRT / VTT
The orchestrator never embeds ASR model weights. Point --asr-endpoint at any
service that implements POST /v1/audio/transcriptions per the OpenAI
contract.
Supported models
The orchestrator is provider-agnostic: any service that implements
POST /v1/audio/transcriptions (OpenAI multipart shape) or OpenRouter STT
(JSON + base64) works when you set the right endpoint, model id, and
API key. The lists below are examples, not an exhaustive allowlist —
use whatever model id your provider documents.
Configure via --asr-endpoint, --model, and env vars (see
docs/cli.md and ASR auto-detection).
ASR — OpenAI API
| Setting | Value |
|---|---|
| Endpoint | https://api.openai.com/v1 or omit + OPENAI_API_KEY for auto-detect |
| API key | OPENAI_API_KEY or ASR_API_KEY |
Default --model |
whisper-1 |
Example transcription models on OpenAI (Audio API):
| Model | Timestamps |
|---|---|
whisper-1 |
Word + segment (verbose_json) |
gpt-4o-transcribe |
Text only |
gpt-4o-mini-transcribe |
Text only |
gpt-4o-transcribe-diarize |
Speaker labels in response |
New OpenAI transcription models generally work with --model as long as they
use the same endpoint.
ASR — OpenRouter
| Setting | Value |
|---|---|
| Endpoint | https://openrouter.ai/api/v1 or omit + OPENROUTER_API_KEY for auto-detect |
| API key | OPENROUTER_API_KEY or ASR_API_KEY |
Default --model |
openai/whisper-1 |
Example STT models on OpenRouter (STT docs; browse speech-to-text models):
| Model slug | Notes |
|---|---|
openai/whisper-1 |
Default auto-detect |
openai/whisper-large-v3 |
Whisper via OpenRouter |
google/chirp-3 |
Google Chirp |
mistralai/voxtral-mini-transcribe |
Mistral Voxtral |
Any OpenRouter model with transcription output modality works — pass its slug
to --model. Responses are text-only (no segment timestamps); the pipeline
sets weak_timestamps and may run optional alignment.
ASR — Local OpenAI-compatible workers
Point --asr-endpoint at any local or remote server that speaks the
worker contract (multipart upload, verbose_json preferred).
The orchestrator sends 16 kHz mono WAV chunks.
| Worker | Endpoint (default) | Example --model |
Status |
|---|---|---|---|
| qwen_transformers_service | http://127.0.0.1:8002/v1 |
Qwen/Qwen3-ASR-0.6B |
Bundled — auto-started when no other ASR is reachable |
| qwen_vllm_service | http://127.0.0.1:8001/v1 |
Qwen/Qwen3-ASR-1.7B |
Bundled bootstrap — Linux + NVIDIA GPU |
| Custom (faster-whisper, vLLM, hosted proxy, etc.) | Your URL | Your model id | Supported — implement or deploy separately |
| whisper_openai_service | — | — | Planned Roadmap |
| parakeet_openai_service | — | — | Planned Roadmap |
For a third-party server, only the API shape must match; the model name is whatever that server expects.
Diarization (orchestrator-local)
Not routed through OpenAI-compatible ASR — runs inside the pipeline on the full normalized file.
| Model | Configure |
|---|---|
pyannote/speaker-diarization-community-1 (default) |
HF_TOKEN or --diarization-model-path; skip with --skip-diarization |
LLM transcript repair
Any OpenAI-compatible /chat/completions endpoint. Set REPAIR_BASE_URL,
REPAIR_MODEL, and REPAIR_API_KEY (auto-filled from OPENAI_API_KEY or
OPENROUTER_API_KEY).
| Provider | Default REPAIR_MODEL |
Example alternatives |
|---|---|---|
| OpenAI API | gpt-4o-mini |
gpt-4o, gpt-4.1-mini, … |
| OpenRouter | openai/gpt-4o-mini |
Any chat model slug your key can access |
| Other | — | Any id your endpoint accepts |
Forced alignment
| Component | Status |
|---|---|
Qwen forced aligner (src/resilient_stt/alignment/qwen_aligner.py) |
Planned Roadmap |
| WhisperX | Planned Roadmap |
Today, alignment runs only when --align is set or ASR returned weak
timestamps; the default aligner is a no-op pass-through.
Install (PyPI)
# Minimal orchestrator (webrtcvad VAD; ASR via API or bundled qwen worker)
pip install resilient-stt
# Recommended on Apple Silicon / Linux (Silero VAD + pyannote diarization + torch)
pip install "resilient-stt[full]"
# Contributors
pip install "resilient-stt[full,dev]"
Verify the CLI:
resilient-stt --help
Usage after pip install
Run from any directory (creates data/work/ under the current working dir unless you pass --work-root):
resilient-stt \
--audio /path/to/audio.wav \
--output /path/to/output-dir \
--language hi
With diarization and Silero VAD you need the [full] extra and usually HF_TOKEN in .env (see Environment variables). Quick smoke test without pyannote:
resilient-stt \
--audio /path/to/audio.wav \
--output /path/to/output-dir \
--language hi \
--skip-diarization \
--repair false
Hosted ASR (no local qwen worker) — set OPENAI_API_KEY or OPENROUTER_API_KEY in .env:
resilient-stt \
--audio /path/to/audio.wav \
--output /path/to/output-dir \
--no-asr-fallback \
--skip-diarization
Bundled qwen-asr worker: On first auto-start, the CLI creates an isolated venv at
~/.cache/resilient-stt/qwen-transformers-worker/.venv (requires network to download
qwen-asr and model weights). If inference fails on Apple Silicon, start the worker
manually with python scripts/bootstrap_qwen_asr_fallback.py --no-aligner (from a git
checkout) or see the bundled worker README.
Platform notes for [full] match Platform notes (torch / diarization) below (Intel Mac: base install + --skip-diarization).
Install from source (uv)
Prerequisites: uv, ffmpeg on PATH. ASR is optional to configure manually — see ASR auto-detection below.
cd resilient-stt
# 1) Create a project virtualenv (recommended; uv also creates one on sync if missing)
uv venv
# 2) Install dependencies into .venv
# --extra full = Silero VAD + diarization + torch (recommended on Apple Silicon / Linux)
uv sync --extra full --extra dev
# 3) Configure secrets (optional)
cp .env.example .env
# edit .env — loaded automatically on each run
ASR-only install (no torch / pyannote):
uv venv
uv sync --extra dev
Activate the venv if you prefer plain python without uv run:
source .venv/bin/activate # Windows: .venv\Scripts\activate
Extras: silero (Qwen-aligned VAD), diarization (pyannote), full (both + torch).
Without silero, VAD falls back to webrtcvad.
Platform notes (torch / diarization)
pyannote.audio 4.x requires torch ≥ 2.8. On Apple Silicon (M1–M4), PyTorch uses the Metal MPS backend (Accelerated PyTorch on Mac). Use a native arm64 Python/venv — not Rosetta x86_64.
| Platform | Install |
|---|---|
| Apple Silicon (M1–M4) | Native arm64 venv, then uv sync --extra full --extra dev. Optional GPU: --diarization-device mps |
| Linux or Windows | uv sync --extra full --extra dev |
| Intel Mac (x86_64) | Diarization extra is not supported (no compatible torch wheels). Use uv sync --extra dev and --skip-diarization. |
If uv errors mention x86_64 on an M-series Mac, your Python/venv is Rosetta. Recreate it:
rm -rf .venv uv.lock
uv venv --python 3.12
python -c "import platform; print(platform.machine())" # must print arm64
uv sync --extra full --extra dev
Verify MPS after install:
uv run python -c "import torch; print('mps', torch.backends.mps.is_available())"
Usage
From a git checkout, prefix commands with uv run (or activate .venv and use resilient-stt directly). After pip install, use resilient-stt only.
Minimal run (no ASR setup — auto-starts local qwen-asr on CPU/MPS when needed):
uv run resilient-stt \
--audio data/input/meeting.mp3 \
--output data/output/meeting \
--language hi
With an external ASR service (vLLM, hosted API, etc.) — the endpoint must already be running and respond to GET {base}/v1/models:
uv run resilient-stt \
--audio data/input/meeting.mp3 \
--asr-endpoint http://localhost:8001/v1 \
--model Qwen/Qwen3-ASR-1.7B \
--language hi \
--repair true \
--output data/output/meeting
OpenAI API (whisper-1) — set OPENAI_API_KEY in .env, then either
auto-detect (no local ASR on :8001/:8002) or point at the API explicitly:
# Auto-detect when no local ASR is running (--no-asr-fallback avoids starting qwen-asr)
uv run resilient-stt \
--audio data/input/speech.wav \
--output data/output/openai \
--no-asr-fallback \
--skip-diarization
# Explicit endpoint
uv run resilient-stt \
--audio data/input/speech.wav \
--output data/output/openai \
--asr-endpoint https://api.openai.com/v1 \
--model whisper-1 \
--skip-diarization
OpenRouter (google/chirp-3) — set OPENROUTER_API_KEY in .env or pass
the key via ASR_API_KEY:
uv run resilient-stt \
--audio data/input/speech.wav \
--output data/output/openrouter/chirp \
--asr-endpoint https://openrouter.ai/api/v1 \
--model google/chirp-3 \
--skip-diarization \
--repair false
For music or other non-speech audio, add --no-vad so VAD does not skip the
file. Full flag reference: docs/cli.md.
ASR auto-detection
When --asr-endpoint is omitted (and ASR_BASE_URL / ASR_ENDPOINT are unset):
- Probe vLLM at
http://127.0.0.1:8001/v1(optional bootstrap) - Probe an existing qwen-asr worker at
http://127.0.0.1:8002/v1 - OpenRouter when
OPENROUTER_API_KEYis set (no--model) - OpenAI when
OPENAI_API_KEYis set (no--model) - Otherwise start the local qwen-asr fallback (bundled worker) — slow on CPU/MPS but needs no NVIDIA GPU
Use --no-asr-fallback to require an explicit or already-running ASR service.
If you pass --asr-endpoint explicitly, that URL must respond to GET …/v1/models before the pipeline runs (otherwise the CLI exits with “endpoint configured but unreachable”).
Useful flags:
| Flag | Effect |
|---|---|
--no-asr-fallback |
Require explicit/running ASR; do not auto-start qwen-asr on :8002 |
--no-vad |
Disable VAD; transcribe the entire timeline |
--vad-backend |
auto, silero, webrtcvad, or rms (default auto) |
--chunk-mode |
fixed (60s/2s overlap) or pause-aligned (~120s at onsets, max 180s) |
--enhance-audio |
High-pass + denoise + loudnorm during normalize |
--skip-diarization |
Skip pyannote; export without speaker labels |
--diarization-model-path PATH |
Load a local git clone of the pyannote model (offline) |
--num-speakers / --min-speakers / --max-speakers |
Hint pyannote speaker count |
--align |
Force the optional alignment stage even when ASR returned timestamps |
--repair true |
Run the LLM repair stage (needs REPAIR_BASE_URL/REPAIR_MODEL) |
--resume |
Reuse existing intermediates under data/work/<job_id>/ |
Environment variables
Copy [.env.example](.env.example) to .env and fill in what you need. The CLI
loads .env on startup; shell exports take precedence.
| Variable | Purpose |
|---|---|
ASR_BASE_URL / ASR_ENDPOINT |
Optional fixed ASR base URL (same as --asr-endpoint) |
ASR_API_KEY |
Optional Bearer token for the ASR endpoint |
OPENROUTER_API_KEY |
OpenRouter key; enables hosted ASR/repair presets |
OPENAI_API_KEY |
OpenAI key; enables hosted ASR/repair presets |
REPAIR_BASE_URL |
OpenAI-compatible chat endpoint (e.g. https://api.openai.com/v1) |
REPAIR_MODEL |
Repair model id (e.g. gpt-4o-mini) |
REPAIR_API_KEY |
Bearer token for the repair endpoint |
HF_TOKEN |
Used only to download gated pyannote weights. Skip with --skip-diarization or use --diarization-model-path after a local clone. |
The default diarization model is
[pyannote/speaker-diarization-community-1](https://huggingface.co/pyannote/speaker-diarization-community-1)
(CC-BY-4.0). Accept the model card terms once on Hugging Face, then either
provide HF_TOKEN for the first download or follow the model card's "Offline
use" instructions to clone the repo and point --diarization-model-path at
the local copy.
Privacy / telemetry
On startup the orchestrator disables optional usage metrics from dependencies
(pyannote [PYANNOTE_METRICS_ENABLED=0](https://github.com/pyannote/pyannote-audio#telemetry),
Hugging Face Hub HF_HUB_DISABLE_TELEMETRY=1). Shell exports and .env values
take precedence — set PYANNOTE_METRICS_ENABLED=1 to opt back in.
Intermediate artifacts
Each run materializes everything under data/work/<job_id>/:
normalized.wav
speech_regions.json # {regions, speech_onsets_samples}
chunks/ # per-chunk WAV slices
chunks.json
asr_raw/<chunk_id>.json
asr_normalized.json
diarization.json
speaker_segments_raw.json
speaker_segments_repaired.json # only when --repair is on
Final exports land under --output: transcript.json, transcript.srt,
transcript.vtt.
Constraints (what not to do)
- Do not embed ASR models inside the orchestrator process.
- Do not chunk inside the ASR microservice; the orchestrator owns chunking.
- Do not diarize per chunk; pyannote runs on the full normalized audio.
- Do not discard raw ASR responses — they live under
asr_raw/. - Do not let LLM repair change timestamps, speaker labels, or segment count.
Tests
From source:
uv sync --extra dev
uv run pytest
After pip install "resilient-stt[dev]" from a checkout (with src/ on PYTHONPATH) or when developing in the repo, the same pytest command applies if src is configured in pyproject.toml (pythonpath = ["src"]).
The included tests use synthetic fixtures and mocked HTTP responses; they do not invoke ffmpeg, pyannote, or any LLM.
Publishing (maintainers)
- Commits on
main— use Conventional Commits in PR titles or squash messages (feat:,fix:,chore:, etc.). - release-please — manifest config in
release-please-config.json(python strategy, package name). Last released version in.release-please-manifest.json. Opens a Release PR that bumpspyproject.toml,CHANGELOG.md, and the manifest. - Ship — merge the Release PR; release-please creates GitHub Release + tag
resilient-stt-vX.Y.Z. - PyPI — runs on
release: createdwhen the tag matchesversioninpyproject.toml.
Do not set release-type on the GitHub Action — that disables manifest mode and can reset versions to 0.1.0. If a mistaken 0.1.0 release exists on GitHub, delete it and its tag before the next ship.
Retry a missed publish — re-run the failed Publish to PyPI workflow from Actions (the release event is preserved).
One-time setup:
- PyPI — trusted publisher for workflow
publish.ymlon repogitcommitshow/resilient-stt. - release-please — repo secret
RELEASE_PLEASE_TOKEN: fine-grained PAT (or classicrepo) with Contents and Pull requests read/write on this repo. Without it, releases are created withGITHUB_TOKENand Publish to PyPI will not run.
In org/repo settings, allow GitHub Actions to create and approve pull requests if Release PRs do not appear.
See CHANGELOG.md for release history.
License
This project is licensed under the GNU General Public License v3.0 or later (GPL-3.0-or-later). See LICENSE for the full text.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file resilient_stt-0.4.2.tar.gz.
File metadata
- Download URL: resilient_stt-0.4.2.tar.gz
- Upload date:
- Size: 65.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3ebfe2968983a60826747a6dc56015044f8a7fec808ba9203dfe8707bbab73ee
|
|
| MD5 |
1ac1de6478f917d8566e45024353ee5d
|
|
| BLAKE2b-256 |
297833b2412ccc4063bb38cae7652f9c4023b0abe3623c0e19f11b12566b1480
|
Provenance
The following attestation bundles were made for resilient_stt-0.4.2.tar.gz:
Publisher:
publish.yml on gitcommitshow/resilient-stt
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
resilient_stt-0.4.2.tar.gz -
Subject digest:
3ebfe2968983a60826747a6dc56015044f8a7fec808ba9203dfe8707bbab73ee - Sigstore transparency entry: 1628384326
- Sigstore integration time:
-
Permalink:
gitcommitshow/resilient-stt@64777ded316267e5f545949307b96491d51fbf3b -
Branch / Tag:
refs/tags/resilient-stt-v0.4.2 - Owner: https://github.com/gitcommitshow
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@64777ded316267e5f545949307b96491d51fbf3b -
Trigger Event:
release
-
Statement type:
File details
Details for the file resilient_stt-0.4.2-py3-none-any.whl.
File metadata
- Download URL: resilient_stt-0.4.2-py3-none-any.whl
- Upload date:
- Size: 70.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
49c74bedaf920d13c273730bf07739223dd6f6bdba539779a1dee1a644882c5b
|
|
| MD5 |
ed128695d1f225127f442544175637e8
|
|
| BLAKE2b-256 |
90f2278eec852fb8022a6f434fc3fe27504410a228ec3b558b73e4453ebf4e43
|
Provenance
The following attestation bundles were made for resilient_stt-0.4.2-py3-none-any.whl:
Publisher:
publish.yml on gitcommitshow/resilient-stt
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
resilient_stt-0.4.2-py3-none-any.whl -
Subject digest:
49c74bedaf920d13c273730bf07739223dd6f6bdba539779a1dee1a644882c5b - Sigstore transparency entry: 1628384366
- Sigstore integration time:
-
Permalink:
gitcommitshow/resilient-stt@64777ded316267e5f545949307b96491d51fbf3b -
Branch / Tag:
refs/tags/resilient-stt-v0.4.2 - Owner: https://github.com/gitcommitshow
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@64777ded316267e5f545949307b96491d51fbf3b -
Trigger Event:
release
-
Statement type: