Skip to main content

Local TTS and audio transcription web app

Project description

VocalFlow

Local voice cloning, custom voices, voice design, and audio transcription — all in one web app.

PyPI CI License Python


What is VocalFlow?

VocalFlow is a self-hosted web application that runs entirely on your local machine. It provides four core capabilities powered by the full Qwen3-TTS model family:

  • Voice Cloning — upload a short reference audio clip and clone the speaker's voice for new text
  • Custom Voice — use 9 premium preset speakers with instruction control for emotion, tone, and style
  • Voice Design — create entirely new voices from natural language descriptions, no reference audio needed
  • Audio Transcription — transcribe audio files with word-level timestamps

No cloud APIs. No data leaves your machine. Just your GPU doing the work.

Features

Voice Cloning (Base model)

  • Upload a reference audio clip (3-10 seconds) and clone the speaker's voice
  • Optionally provide a transcript of the reference audio for higher quality (ICL mode)
  • Choose between 1.7B (best quality) and 0.6B (faster) model sizes
  • Save cloned voice prompts as .pt files and reuse them without re-uploading audio
  • 10+ languages: English, Chinese, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
  • Auto language detection

Custom Voice (9 preset speakers)

  • Speakers: Vivian, Serena, Uncle Fu, Dylan, Eric (Chinese), Ryan, Aiden (English), Ono Anna (Japanese), Sohee (Korean)
  • Instruction control (1.7B only): tell the model how to speak — "say it angrily", "whisper softly", "speak with excitement"
  • Choose between 1.7B (instructions + speakers) and 0.6B (speakers only) model sizes

Voice Design (create a voice from text)

  • Describe the voice you want in natural language: age, gender, tone, emotion, speaking style
  • No reference audio needed — the model invents a completely new voice
  • Uses the 1.7B VoiceDesign model

Audio Transcription

  • Upload any audio file (.wav, .mp3, .flac, .ogg)
  • Choose from 6 Whisper model sizes (tiny to large) based on your speed/accuracy needs
  • Full transcript with word-level timestamps
  • Download as structured JSON
  • Auto-detect language or specify manually

Smart Model Management

  • Automatic model switching — only one model is kept in VRAM at a time; switching between models (e.g. Voice Clone → Transcribe) automatically unloads the previous model and frees GPU memory
  • Manual unload — one-click button to release all GPU memory

Interface

  • Clean dark-themed UI built with Gradio
  • Compact sidebar with live-updating generation history
  • Per-item download and delete buttons in the history sidebar
  • Color-coded history entries per mode (Clone, Custom Voice, Voice Design, Transcribe)
  • Inline audio playback
  • Flash Attention 2 support for faster inference

Models

Qwen3-TTS (Speech Synthesis)

Model Parameters Purpose
Qwen3-TTS-12Hz-1.7B-Base 1.7B Voice cloning from reference audio
Qwen3-TTS-12Hz-0.6B-Base 0.6B Faster voice cloning
Qwen3-TTS-12Hz-1.7B-CustomVoice 1.7B 9 preset speakers + instruction control
Qwen3-TTS-12Hz-0.6B-CustomVoice 0.6B 9 preset speakers (no instructions)
Qwen3-TTS-12Hz-1.7B-VoiceDesign 1.7B Create voices from text descriptions
  • All models run in bfloat16 with SDPA attention
  • Flash Attention 2 enabled when available
  • Downloaded automatically on first use
  • Requires a CUDA GPU with at least 6 GB VRAM

OpenAI Whisper (Transcription)

Model Parameters VRAM Relative Speed
tiny 39M ~1 GB ~32x realtime
base 74M ~1 GB ~16x realtime
small 244M ~2 GB ~6x realtime
medium 769M ~5 GB ~2x realtime
large 1.5B ~10 GB ~1x realtime
turbo 1.5B ~6 GB ~8x realtime
  • Models are downloaded automatically on first use
  • Falls back to CPU if no CUDA GPU is detected

Requirements

  • Python 3.11 and FFmpeg (winget install ffmpeg / brew install ffmpeg / apt install ffmpeg)
  • SoX (winget install sox or install from sourceforge)
  • CUDA GPU with 6+ GB VRAM for TTS (transcription can run on CPU)

Quick Start

Install

pip install vocalflow

Launch the web app

vocalflow

Then open http://localhost:5001 in your browser.

From source

git clone https://github.com/0xBinayak/VocalFlow.git
cd VocalFlow
uv sync
uv run app.py

First run: The TTS and Whisper models will be downloaded automatically on first use. Only the model you select is downloaded — you don't need all of them.

Project Structure

VocalFlow/
├── app.py              # Main web application (Gradio UI + all TTS modes)
├── transcribe.py       # Whisper transcription module (also usable as CLI)
├── main.py             # CLI entry points (vocalflow, vocalflow-transcribe)
├── pyproject.toml      # Dependencies and build config
├── uv.lock             # Locked dependency versions
├── models/             # Cached model weights (gitignored, created at runtime)
├── audio/              # Generated speech files (gitignored, created at runtime)
├── voices/             # Saved voice prompt files (gitignored, created at runtime)
├── uploads/            # Temporary uploaded files (gitignored, created at runtime)
├── transcripts/        # Transcription JSON files (gitignored, created at runtime)
└── .github/workflows/
    ├── ci.yml          # Lint + syntax checks on push/PR
    ├── release.yml     # Auto GitHub Release on version tags
    └── publish.yml     # Auto PyPI publish on version tags

Contributing

Contributions are welcome!

  1. Fork the repo and clone it
  2. uv sync to install dependencies
  3. Create a branch, make changes, ensure uvx ruff check app.py transcribe.py main.py passes
  4. Open a Pull Request against main

Areas where help is appreciated: streaming TTS, batch processing, audio post-processing, UI improvements, and testing.

Found a bug? Open an issue.

License

MIT

Acknowledgments

  • Qwen3-TTS by Alibaba Cloud — voice cloning, custom voice, and voice design models
  • OpenAI Whisper — speech recognition
  • Gradio — web UI framework

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vocalflow-0.6.0.tar.gz (69.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vocalflow-0.6.0-py3-none-any.whl (16.3 kB view details)

Uploaded Python 3

File details

Details for the file vocalflow-0.6.0.tar.gz.

File metadata

  • Download URL: vocalflow-0.6.0.tar.gz
  • Upload date:
  • Size: 69.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vocalflow-0.6.0.tar.gz
Algorithm Hash digest
SHA256 95249ba09ec0686f05289311338f67e2305ccdebc001a4fafe92197ef1a7dd23
MD5 8f2cfa44a6277c18e1c2712bee2db9d2
BLAKE2b-256 ebf9813b8083ad6e54a3934d1672cd3701ed5d3ddff7afcd23d625cebdeb0580

See more details on using hashes here.

Provenance

The following attestation bundles were made for vocalflow-0.6.0.tar.gz:

Publisher: publish.yml on 0xBinayak/VocalFlow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file vocalflow-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: vocalflow-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 16.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vocalflow-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 128b39faf462c72706c1ffd52dd381e976301d9a2d704f72c8e1c2012fa1481c
MD5 df375d72b80723babcef9b1e03dda772
BLAKE2b-256 ee98d9ff62c78f80f1d387bd42ece97fe3abc014dc9c46c04fe62b2b25b0beb2

See more details on using hashes here.

Provenance

The following attestation bundles were made for vocalflow-0.6.0-py3-none-any.whl:

Publisher: publish.yml on 0xBinayak/VocalFlow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page