Local TTS and audio transcription web app

Project description

VocalFlow

Local voice cloning, custom voices, voice design, and audio transcription — all in one web app.

Python

What is VocalFlow?

VocalFlow is a self-hosted web application that runs entirely on your local machine. It provides four core capabilities powered by the full Qwen3-TTS model family:

Voice Cloning — upload a short reference audio clip and clone the speaker's voice for new text
Custom Voice — use 9 premium preset speakers with instruction control for emotion, tone, and style
Voice Design — create entirely new voices from natural language descriptions, no reference audio needed
Audio Transcription — transcribe audio files with word-level timestamps

No cloud APIs. No data leaves your machine. Just your GPU doing the work.

Features

Voice Cloning (Base model)

Upload a reference audio clip (3-10 seconds) and clone the speaker's voice
Optionally provide a transcript of the reference audio for higher quality (ICL mode)
Choose between 1.7B (best quality) and 0.6B (faster) model sizes
Save cloned voice prompts as .pt files and reuse them without re-uploading audio
10+ languages: English, Chinese, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
Auto language detection

Custom Voice (9 preset speakers)

Speakers: Vivian, Serena, Uncle Fu, Dylan, Eric (Chinese), Ryan, Aiden (English), Ono Anna (Japanese), Sohee (Korean)
Instruction control (1.7B only): tell the model how to speak — "say it angrily", "whisper softly", "speak with excitement"
Choose between 1.7B (instructions + speakers) and 0.6B (speakers only) model sizes

Voice Design (create a voice from text)

Describe the voice you want in natural language: age, gender, tone, emotion, speaking style
No reference audio needed — the model invents a completely new voice
Uses the 1.7B VoiceDesign model

Audio Transcription

Upload any audio file (.wav, .mp3, .flac, .ogg)
Choose from 6 Whisper model sizes (tiny to large) based on your speed/accuracy needs
Full transcript with word-level timestamps
Download as structured JSON
Auto-detect language or specify manually

Smart Model Management

Automatic model switching — only one model is kept in VRAM at a time; switching between models (e.g. Voice Clone → Transcribe) automatically unloads the previous model and frees GPU memory
Manual unload — one-click button to release all GPU memory

Interface

Clean dark-themed UI built with Gradio
Compact sidebar with live-updating generation history
Per-item download and delete buttons in the history sidebar
Color-coded history entries per mode (Clone, Custom Voice, Voice Design, Transcribe)
Inline audio playback
Flash Attention 2 support for faster inference

Models

Qwen3-TTS (Speech Synthesis)

Model	Parameters	Purpose
Qwen3-TTS-12Hz-1.7B-Base	1.7B	Voice cloning from reference audio
Qwen3-TTS-12Hz-0.6B-Base	0.6B	Faster voice cloning
Qwen3-TTS-12Hz-1.7B-CustomVoice	1.7B	9 preset speakers + instruction control
Qwen3-TTS-12Hz-0.6B-CustomVoice	0.6B	9 preset speakers (no instructions)
Qwen3-TTS-12Hz-1.7B-VoiceDesign	1.7B	Create voices from text descriptions

All models run in bfloat16 with SDPA attention
Flash Attention 2 enabled when available
Downloaded automatically on first use
Requires a CUDA GPU with at least 6 GB VRAM

OpenAI Whisper (Transcription)

Model	Parameters	VRAM	Relative Speed
`tiny`	39M	~1 GB	~32x realtime
`base`	74M	~1 GB	~16x realtime
`small`	244M	~2 GB	~6x realtime
`medium`	769M	~5 GB	~2x realtime
`large`	1.5B	~10 GB	~1x realtime
`turbo`	1.5B	~6 GB	~8x realtime

Models are downloaded automatically on first use
Falls back to CPU if no CUDA GPU is detected

Requirements

Python 3.11 and FFmpeg (winget install ffmpeg / brew install ffmpeg / apt install ffmpeg)
SoX (winget install sox or install from sourceforge)
CUDA GPU with 6+ GB VRAM for TTS (transcription can run on CPU)

Quick Start

Install

pip install vocalflow

Launch the web app

vocalflow

Then open http://localhost:5001 in your browser.

From source

git clone https://github.com/0xBinayak/VocalFlow.git
cd VocalFlow
uv sync
uv run app.py

First run: The TTS and Whisper models will be downloaded automatically on first use. Only the model you select is downloaded — you don't need all of them.

Project Structure

VocalFlow/
├── app.py              # Main web application (Gradio UI + all TTS modes)
├── transcribe.py       # Whisper transcription module (also usable as CLI)
├── main.py             # CLI entry points (vocalflow, vocalflow-transcribe)
├── pyproject.toml      # Dependencies and build config
├── uv.lock             # Locked dependency versions
├── models/             # Cached model weights (gitignored, created at runtime)
├── audio/              # Generated speech files (gitignored, created at runtime)
├── voices/             # Saved voice prompt files (gitignored, created at runtime)
├── uploads/            # Temporary uploaded files (gitignored, created at runtime)
├── transcripts/        # Transcription JSON files (gitignored, created at runtime)
└── .github/workflows/
    ├── ci.yml          # Lint + syntax checks on push/PR
    ├── release.yml     # Auto GitHub Release on version tags
    └── publish.yml     # Auto PyPI publish on version tags

Contributing

Contributions are welcome!

Fork the repo and clone it
uv sync to install dependencies
Create a branch, make changes, ensure uvx ruff check app.py transcribe.py main.py passes
Open a Pull Request against main

Areas where help is appreciated: streaming TTS, batch processing, audio post-processing, UI improvements, and testing.

Found a bug? Open an issue.

License

MIT

Acknowledgments

Qwen3-TTS by Alibaba Cloud — voice cloning, custom voice, and voice design models
OpenAI Whisper — speech recognition
Gradio — web UI framework

Project details

Release history Release notifications | RSS feed

0.7.1

Mar 23, 2026

0.7.0

Mar 23, 2026

0.6.2

Mar 23, 2026

0.6.1

Mar 23, 2026

This version

0.6.0

Mar 23, 2026

0.3.2

Mar 18, 2026

0.3.1

Mar 17, 2026

0.3.0

Mar 17, 2026

0.2.1

Mar 17, 2026

0.1.0

Mar 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vocalflow-0.6.0.tar.gz (69.3 kB view details)

Uploaded Mar 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vocalflow-0.6.0-py3-none-any.whl (16.3 kB view details)

Uploaded Mar 23, 2026 Python 3

File details

Details for the file vocalflow-0.6.0.tar.gz.

File metadata

Download URL: vocalflow-0.6.0.tar.gz
Upload date: Mar 23, 2026
Size: 69.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vocalflow-0.6.0.tar.gz
Algorithm	Hash digest
SHA256	`95249ba09ec0686f05289311338f67e2305ccdebc001a4fafe92197ef1a7dd23`
MD5	`8f2cfa44a6277c18e1c2712bee2db9d2`
BLAKE2b-256	`ebf9813b8083ad6e54a3934d1672cd3701ed5d3ddff7afcd23d625cebdeb0580`

See more details on using hashes here.

Provenance

The following attestation bundles were made for vocalflow-0.6.0.tar.gz:

Publisher: publish.yml on 0xBinayak/VocalFlow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: vocalflow-0.6.0.tar.gz
- Subject digest: 95249ba09ec0686f05289311338f67e2305ccdebc001a4fafe92197ef1a7dd23
- Sigstore transparency entry: 1156192147
- Sigstore integration time: Mar 23, 2026
Source repository:
- Permalink: 0xBinayak/VocalFlow@4d5ba223436d1f4e0a2215c3d614e7b0fe0409df
- Branch / Tag: refs/tags/v0.6.0
- Owner: https://github.com/0xBinayak
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@4d5ba223436d1f4e0a2215c3d614e7b0fe0409df
- Trigger Event: push

File details

Details for the file vocalflow-0.6.0-py3-none-any.whl.

File metadata

Download URL: vocalflow-0.6.0-py3-none-any.whl
Upload date: Mar 23, 2026
Size: 16.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vocalflow-0.6.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`128b39faf462c72706c1ffd52dd381e976301d9a2d704f72c8e1c2012fa1481c`
MD5	`df375d72b80723babcef9b1e03dda772`
BLAKE2b-256	`ee98d9ff62c78f80f1d387bd42ece97fe3abc014dc9c46c04fe62b2b25b0beb2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for vocalflow-0.6.0-py3-none-any.whl:

Publisher: publish.yml on 0xBinayak/VocalFlow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: vocalflow-0.6.0-py3-none-any.whl
- Subject digest: 128b39faf462c72706c1ffd52dd381e976301d9a2d704f72c8e1c2012fa1481c
- Sigstore transparency entry: 1156192149
- Sigstore integration time: Mar 23, 2026
Source repository:
- Permalink: 0xBinayak/VocalFlow@4d5ba223436d1f4e0a2215c3d614e7b0fe0409df
- Branch / Tag: refs/tags/v0.6.0
- Owner: https://github.com/0xBinayak
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@4d5ba223436d1f4e0a2215c3d614e7b0fe0409df
- Trigger Event: push

vocalflow 0.6.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

VocalFlow

What is VocalFlow?

Features

Voice Cloning (Base model)

Custom Voice (9 preset speakers)

Voice Design (create a voice from text)

Audio Transcription

Smart Model Management

Interface

Models

Qwen3-TTS (Speech Synthesis)

OpenAI Whisper (Transcription)

Requirements

Quick Start

Install

Launch the web app

From source

Project Structure

Contributing

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance