Local TTS and audio transcription web app
Project description
VocalFlow
Local voice cloning, custom voices, voice design, and audio transcription — all in one web app.
What is VocalFlow?
VocalFlow is a self-hosted web application that runs entirely on your local machine. It provides four core capabilities powered by the full Qwen3-TTS model family:
- Voice Cloning — upload a short reference audio clip and clone the speaker's voice for new text
- Custom Voice — use 9 premium preset speakers with instruction control for emotion, tone, and style
- Voice Design — create entirely new voices from natural language descriptions, no reference audio needed
- Audio Transcription — transcribe audio files with word-level timestamps
No cloud APIs. No data leaves your machine. Just your GPU doing the work.
Features
Voice Cloning (Base model)
- Upload a reference audio clip (3-10 seconds) and clone the speaker's voice
- Optionally provide a transcript of the reference audio for higher quality (ICL mode)
- Choose between 1.7B (best quality) and 0.6B (faster) model sizes
- Save cloned voice prompts as
.ptfiles and reuse them without re-uploading audio - 10+ languages: English, Chinese, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
- Auto language detection
Custom Voice (9 preset speakers)
- Speakers: Vivian, Serena, Uncle Fu, Dylan, Eric (Chinese), Ryan, Aiden (English), Ono Anna (Japanese), Sohee (Korean)
- Instruction control (1.7B only): tell the model how to speak — "say it angrily", "whisper softly", "speak with excitement"
- Choose between 1.7B (instructions + speakers) and 0.6B (speakers only) model sizes
Voice Design (create a voice from text)
- Describe the voice you want in natural language: age, gender, tone, emotion, speaking style
- No reference audio needed — the model invents a completely new voice
- Uses the 1.7B VoiceDesign model
Audio Transcription
- Upload any audio file (
.wav,.mp3,.flac,.ogg) - Choose from 6 Whisper model sizes (tiny to large) based on your speed/accuracy needs
- Full transcript with word-level timestamps
- Download as structured JSON
- Auto-detect language or specify manually
Smart Model Management
- Automatic model switching — only one model is kept in VRAM at a time; switching between models (e.g. Voice Clone → Transcribe) automatically unloads the previous model and frees GPU memory
- Manual unload — one-click button to release all GPU memory
Interface
- Clean dark-themed UI built with Gradio
- Compact sidebar with live-updating generation history
- Per-item download and delete buttons in the history sidebar
- Color-coded history entries per mode (Clone, Custom Voice, Voice Design, Transcribe)
- Inline audio playback
- Flash Attention 2 support for faster inference
Models
Qwen3-TTS (Speech Synthesis)
| Model | Parameters | Purpose |
|---|---|---|
| Qwen3-TTS-12Hz-1.7B-Base | 1.7B | Voice cloning from reference audio |
| Qwen3-TTS-12Hz-0.6B-Base | 0.6B | Faster voice cloning |
| Qwen3-TTS-12Hz-1.7B-CustomVoice | 1.7B | 9 preset speakers + instruction control |
| Qwen3-TTS-12Hz-0.6B-CustomVoice | 0.6B | 9 preset speakers (no instructions) |
| Qwen3-TTS-12Hz-1.7B-VoiceDesign | 1.7B | Create voices from text descriptions |
- All models run in
bfloat16with SDPA attention - Flash Attention 2 enabled when available
- Downloaded automatically on first use
- Requires a CUDA GPU with at least 6 GB VRAM
OpenAI Whisper (Transcription)
| Model | Parameters | VRAM | Relative Speed |
|---|---|---|---|
tiny |
39M | ~1 GB | ~32x realtime |
base |
74M | ~1 GB | ~16x realtime |
small |
244M | ~2 GB | ~6x realtime |
medium |
769M | ~5 GB | ~2x realtime |
large |
1.5B | ~10 GB | ~1x realtime |
turbo |
1.5B | ~6 GB | ~8x realtime |
- Models are downloaded automatically on first use
- Falls back to CPU if no CUDA GPU is detected
Requirements
- Python 3.11 and FFmpeg (
winget install ffmpeg/brew install ffmpeg/apt install ffmpeg) - SoX (
winget install soxor install from sourceforge) - CUDA GPU with 6+ GB VRAM for TTS (transcription can run on CPU)
Quick Start
Install
pip install vocalflow
Launch the web app
vocalflow
Then open http://localhost:5001 in your browser.
From source
git clone https://github.com/0xBinayak/VocalFlow.git
cd VocalFlow
uv sync
uv run app.py
First run: The TTS and Whisper models will be downloaded automatically on first use. Only the model you select is downloaded — you don't need all of them.
Project Structure
VocalFlow/
├── app.py # Main web application (Gradio UI + all TTS modes)
├── transcribe.py # Whisper transcription module (also usable as CLI)
├── main.py # CLI entry points (vocalflow, vocalflow-transcribe)
├── pyproject.toml # Dependencies and build config
├── uv.lock # Locked dependency versions
├── models/ # Cached model weights (gitignored, created at runtime)
├── audio/ # Generated speech files (gitignored, created at runtime)
├── voices/ # Saved voice prompt files (gitignored, created at runtime)
├── uploads/ # Temporary uploaded files (gitignored, created at runtime)
├── transcripts/ # Transcription JSON files (gitignored, created at runtime)
└── .github/workflows/
├── ci.yml # Lint + syntax checks on push/PR
├── release.yml # Auto GitHub Release on version tags
└── publish.yml # Auto PyPI publish on version tags
Contributing
Contributions are welcome!
- Fork the repo and clone it
uv syncto install dependencies- Create a branch, make changes, ensure
uvx ruff check app.py transcribe.py main.pypasses - Open a Pull Request against
main
Areas where help is appreciated: streaming TTS, batch processing, audio post-processing, UI improvements, and testing.
Found a bug? Open an issue.
License
MIT
Acknowledgments
- Qwen3-TTS by Alibaba Cloud — voice cloning, custom voice, and voice design models
- OpenAI Whisper — speech recognition
- Gradio — web UI framework
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vocalflow-0.6.1.tar.gz.
File metadata
- Download URL: vocalflow-0.6.1.tar.gz
- Upload date:
- Size: 69.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
74bbf0cd203efbce5f4642b3830f4641f2d0b79947e06e161216c89bf79e4de3
|
|
| MD5 |
69fa9f18c6ac189389f1864125fe4311
|
|
| BLAKE2b-256 |
b1898ccc3754c9e39d5c344710b7bef2cbd53a5aa88f3fb5dc8a47ebe2cf5087
|
Provenance
The following attestation bundles were made for vocalflow-0.6.1.tar.gz:
Publisher:
publish.yml on 0xBinayak/VocalFlow
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vocalflow-0.6.1.tar.gz -
Subject digest:
74bbf0cd203efbce5f4642b3830f4641f2d0b79947e06e161216c89bf79e4de3 - Sigstore transparency entry: 1156193199
- Sigstore integration time:
-
Permalink:
0xBinayak/VocalFlow@a79fe66490a89601229b12cbe609bccfa0484496 -
Branch / Tag:
refs/tags/v0.6.1 - Owner: https://github.com/0xBinayak
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@a79fe66490a89601229b12cbe609bccfa0484496 -
Trigger Event:
push
-
Statement type:
File details
Details for the file vocalflow-0.6.1-py3-none-any.whl.
File metadata
- Download URL: vocalflow-0.6.1-py3-none-any.whl
- Upload date:
- Size: 16.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8d5f8a4c937c82f132a6fdaffc2cb3bd8b95920cb72c0528889be5def3def74d
|
|
| MD5 |
816fda66e2b72a3d5033279996b51986
|
|
| BLAKE2b-256 |
492bfca40cde4d7a652303448c192c3d31680d9df5fd9da0c7153f323c2952e4
|
Provenance
The following attestation bundles were made for vocalflow-0.6.1-py3-none-any.whl:
Publisher:
publish.yml on 0xBinayak/VocalFlow
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vocalflow-0.6.1-py3-none-any.whl -
Subject digest:
8d5f8a4c937c82f132a6fdaffc2cb3bd8b95920cb72c0528889be5def3def74d - Sigstore transparency entry: 1156193201
- Sigstore integration time:
-
Permalink:
0xBinayak/VocalFlow@a79fe66490a89601229b12cbe609bccfa0484496 -
Branch / Tag:
refs/tags/v0.6.1 - Owner: https://github.com/0xBinayak
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@a79fe66490a89601229b12cbe609bccfa0484496 -
Trigger Event:
push
-
Statement type: