Local TTS and audio transcription web app
Project description
VocalFlow
Local text-to-speech and audio transcription — all in one web app.
What is VocalFlow?
VocalFlow is a self-hosted web application that runs entirely on your local machine. It provides two core capabilities:
- Text-to-Speech (TTS) — generate speech with voice design or voice cloning
- Audio Transcription — transcribe audio files with word-level timestamps
No cloud APIs. No data leaves your machine. Just your GPU doing the work.
Features
Speech Generation (TTS)
| Mode | Description |
|---|---|
| Voice Design | Describe the voice you want in natural language (age, accent, tone, emotion) and the model generates it |
| Voice Cloning | Upload a short reference audio clip and clone the speaker's voice for new text |
- 10+ languages: English, Chinese, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
- Auto language detection
- Download generated audio as
.wav
Audio Transcription
- Upload any audio file (
.wav,.mp3,.flac,.ogg) - Choose from 6 Whisper model sizes (tiny → large) based on your speed/accuracy needs
- Get full transcript with word-level timestamps
- Copy transcript to clipboard or download as structured JSON
- Auto-detect language or specify manually
Interface
- Clean dark-themed UI built with FastHTML + HTMX
- Sidebar with live-updating generation history
- Inline audio playback from history
- One-click downloads with human-readable filenames
Models
VocalFlow uses two open-source model families:
Qwen3-TTS (Text-to-Speech)
| Model | Parameters | Purpose |
|---|---|---|
| Qwen3-TTS-12Hz-1.7B-Base | 1.7B | Voice cloning from reference audio |
| Qwen3-TTS-12Hz-1.7B-VoiceDesign | 1.7B | Voice generation from text descriptions |
- Runs in
bfloat16with SDPA attention for efficient inference - Models are downloaded automatically on first use (~3.5 GB each)
- Requires a CUDA GPU with at least 6 GB VRAM
OpenAI Whisper (Transcription)
| Model | Parameters | VRAM | Relative Speed |
|---|---|---|---|
tiny |
39M | ~1 GB | ~32x realtime |
base |
74M | ~1 GB | ~16x realtime |
small |
244M | ~2 GB | ~6x realtime |
medium |
769M | ~5 GB | ~2x realtime |
large |
1.5B | ~10 GB | ~1x realtime |
turbo |
1.5B | ~6 GB | ~8x realtime |
- Models are downloaded automatically on first use
- Falls back to CPU if no CUDA GPU is detected
Requirements
- Python 3.11+
- CUDA GPU with 6+ GB VRAM (for TTS; transcription can run on CPU)
- FFmpeg — required by Whisper for audio processing
- uv package manager (recommended)
Installing FFmpeg
# Windows
winget install ffmpeg
# macOS
brew install ffmpeg
# Ubuntu/Debian
sudo apt install ffmpeg
Quick Start
Option 1: From source (recommended)
git clone https://github.com/0xBinayak/VocalFlow.git
cd VocalFlow
uv sync
uv run app.py
Option 2: From PyPI
pip install vocalflow
# Launch the web app
vocalflow
Option 3: Transcription only (CLI)
pip install vocalflow
# Transcribe an audio file
vocalflow-transcribe audio.mp3 --model medium --language en
Then open http://localhost:5001 in your browser.
First run: The TTS and Whisper models will be downloaded automatically. This may take a few minutes depending on your connection.
Project Structure
VocalFlow/
├── app.py # Main web application (FastHTML + routes + UI)
├── transcribe.py # Whisper transcription module (also usable as CLI)
├── pyproject.toml # Dependencies and build config
├── uv.lock # Locked dependency versions
├── audio/ # Generated speech files (gitignored, created at runtime)
├── uploads/ # Temporary uploaded files (gitignored, created at runtime)
├── transcripts/ # Transcription JSON files (gitignored, created at runtime)
└── .github/workflows/
├── ci.yml # Lint + syntax checks on push/PR
├── release.yml # Auto GitHub Release on version tags
└── publish.yml # Auto PyPI publish on version tags
CLI Transcription
The transcriber can also be used standalone from the command line:
# Basic usage
uv run python transcribe.py audio.mp3
# With options
uv run python transcribe.py audio.mp3 --model medium --language en --device cuda
Output is a JSON file with word-level timestamps:
[
{ "word": "Hello", "start": 0.0, "end": 0.32 },
{ "word": "world", "start": 0.34, "end": 0.68 }
]
GPU Verification
To verify your GPU is detected:
uv run python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else \"N/A\"}')"
Contributing
Contributions are welcome! Here's how to get started:
Setting up for development
git clone https://github.com/0xBinayak/VocalFlow.git
cd VocalFlow
uv sync
Code style
We use ruff for linting:
uvx ruff check app.py transcribe.py
CI runs this automatically on every push and PR.
Making changes
- Fork the repository
- Create a branch for your feature or fix:
git checkout -b feature/my-feature
- Make your changes and ensure lint passes:
uvx ruff check app.py transcribe.py
- Commit with a clear message describing what and why
- Open a Pull Request against
main
Areas where help is appreciated
- Additional TTS model backends (e.g., Bark, StyleTTS2, F5-TTS)
- Speaker diarization in transcription
- Batch processing (multiple files at once)
- Audio post-processing (noise reduction, normalization)
- UI improvements and accessibility
- Documentation and tutorials
- Testing and bug reports
- Packaging for different platforms
Reporting issues
Found a bug or have a feature request? Open an issue with:
- What you expected vs. what happened
- Steps to reproduce
- Your OS, Python version, and GPU info
Releasing
Maintainers can create a new release by bumping the version in pyproject.toml and tagging:
# 1. Update version in pyproject.toml
# 2. Update lock file
uv lock
# 3. Commit, tag, and push
git add pyproject.toml uv.lock
git commit -m "Bump version to X.Y.Z"
git tag vX.Y.Z
git push origin main --tags
This automatically:
- Runs CI checks
- Creates a GitHub Release with build artifacts
- Publishes to PyPI
License
MIT
Acknowledgments
- Qwen3-TTS by Alibaba Cloud — text-to-speech models
- OpenAI Whisper — speech recognition
- FastHTML — Python web framework
- HTMX — frontend interactivity
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vocalflow-0.3.0.tar.gz.
File metadata
- Download URL: vocalflow-0.3.0.tar.gz
- Upload date:
- Size: 183.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
24e3abd252c65fc5fde72c8eb08d3dfd0b0b666a2aced1f340f92de06184d19d
|
|
| MD5 |
59dc454b7ed746bbe263157613972439
|
|
| BLAKE2b-256 |
42e33e109cd323758c226b2a8b18efc48de57e88e703dfa7839e2bb5b41d8340
|
Provenance
The following attestation bundles were made for vocalflow-0.3.0.tar.gz:
Publisher:
publish.yml on 0xBinayak/VocalFlow
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vocalflow-0.3.0.tar.gz -
Subject digest:
24e3abd252c65fc5fde72c8eb08d3dfd0b0b666a2aced1f340f92de06184d19d - Sigstore transparency entry: 1118483154
- Sigstore integration time:
-
Permalink:
0xBinayak/VocalFlow@ddcbfab9db39b4c27116679d24c9f8425e6c9b06 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/0xBinayak
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@ddcbfab9db39b4c27116679d24c9f8425e6c9b06 -
Trigger Event:
push
-
Statement type:
File details
Details for the file vocalflow-0.3.0-py3-none-any.whl.
File metadata
- Download URL: vocalflow-0.3.0-py3-none-any.whl
- Upload date:
- Size: 14.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ca39036a32430610fef090bacfffc5cc4c60d237be7832cf8b08b3e3b05541b1
|
|
| MD5 |
3801739c009049165f094267abdb452e
|
|
| BLAKE2b-256 |
fbba53704cb3362313f4e7a162c9585207904987eb25d27df69e18c72a9ae466
|
Provenance
The following attestation bundles were made for vocalflow-0.3.0-py3-none-any.whl:
Publisher:
publish.yml on 0xBinayak/VocalFlow
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vocalflow-0.3.0-py3-none-any.whl -
Subject digest:
ca39036a32430610fef090bacfffc5cc4c60d237be7832cf8b08b3e3b05541b1 - Sigstore transparency entry: 1118483350
- Sigstore integration time:
-
Permalink:
0xBinayak/VocalFlow@ddcbfab9db39b4c27116679d24c9f8425e6c9b06 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/0xBinayak
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@ddcbfab9db39b4c27116679d24c9f8425e6c9b06 -
Trigger Event:
push
-
Statement type: