Skip to main content

Local TTS and audio transcription web app

Project description

VocalFlow

Local voice cloning, custom voices, voice design, and audio transcription — all in one web app.

PyPI CI License Python


What is VocalFlow?

VocalFlow is a self-hosted web application that runs entirely on your local machine. It provides four core capabilities powered by the full Qwen3-TTS model family:

  • Voice Cloning — upload a short reference audio clip and clone the speaker's voice for new text
  • Custom Voice — use 9 premium preset speakers with instruction control for emotion, tone, and style
  • Voice Design — create entirely new voices from natural language descriptions, no reference audio needed
  • Audio Transcription — transcribe audio files with word-level timestamps

No cloud APIs. No data leaves your machine. Just your GPU doing the work.

Features

Voice Cloning (Base model)

  • Upload a reference audio clip (3-10 seconds) and clone the speaker's voice
  • Optionally provide a transcript of the reference audio for higher quality (ICL mode)
  • Choose between 1.7B (best quality) and 0.6B (faster) model sizes
  • Save cloned voice prompts as .pt files and reuse them without re-uploading audio
  • 10+ languages: English, Chinese, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
  • Auto language detection

Custom Voice (9 preset speakers)

  • Speakers: Vivian, Serena, Uncle Fu, Dylan, Eric (Chinese), Ryan, Aiden (English), Ono Anna (Japanese), Sohee (Korean)
  • Instruction control (1.7B only): tell the model how to speak — "say it angrily", "whisper softly", "speak with excitement"
  • Choose between 1.7B (instructions + speakers) and 0.6B (speakers only) model sizes

Voice Design (create a voice from text)

  • Describe the voice you want in natural language: age, gender, tone, emotion, speaking style
  • No reference audio needed — the model invents a completely new voice
  • Uses the 1.7B VoiceDesign model

Audio Transcription

  • Upload any audio file (.wav, .mp3, .flac, .ogg)
  • Choose from 6 Whisper model sizes (tiny to large) based on your speed/accuracy needs
  • Full transcript with word-level timestamps
  • Download as structured JSON
  • Auto-detect language or specify manually

Smart Model Management

  • Automatic model switching — only one model is kept in VRAM at a time; switching between models (e.g. Voice Clone → Transcribe) automatically unloads the previous model and frees GPU memory
  • Manual unload — one-click button to release all GPU memory

Interface

  • Clean dark-themed UI built with Gradio
  • Compact sidebar with live-updating generation history
  • Per-item download and delete buttons in the history sidebar
  • Color-coded history entries per mode (Clone, Custom Voice, Voice Design, Transcribe)
  • Inline audio playback
  • Flash Attention 2 support for faster inference

Models

Qwen3-TTS (Speech Synthesis)

Model Parameters Purpose
Qwen3-TTS-12Hz-1.7B-Base 1.7B Voice cloning from reference audio
Qwen3-TTS-12Hz-0.6B-Base 0.6B Faster voice cloning
Qwen3-TTS-12Hz-1.7B-CustomVoice 1.7B 9 preset speakers + instruction control
Qwen3-TTS-12Hz-0.6B-CustomVoice 0.6B 9 preset speakers (no instructions)
Qwen3-TTS-12Hz-1.7B-VoiceDesign 1.7B Create voices from text descriptions
  • All models run in bfloat16 with SDPA attention
  • Flash Attention 2 enabled when available
  • Downloaded automatically on first use
  • Requires a CUDA GPU with at least 6 GB VRAM

OpenAI Whisper (Transcription)

Model Parameters VRAM Relative Speed
tiny 39M ~1 GB ~32x realtime
base 74M ~1 GB ~16x realtime
small 244M ~2 GB ~6x realtime
medium 769M ~5 GB ~2x realtime
large 1.5B ~10 GB ~1x realtime
turbo 1.5B ~6 GB ~8x realtime
  • Models are downloaded automatically on first use
  • Falls back to CPU if no CUDA GPU is detected

Requirements

  • Python 3.11 and FFmpeg (winget install ffmpeg / brew install ffmpeg / apt install ffmpeg)
  • SoX (winget install sox or install from sourceforge)
  • CUDA GPU with 6+ GB VRAM for TTS (transcription can run on CPU)

Quick Start

Install

pip install vocalflow

Launch the web app

vocalflow

Then open http://localhost:5001 in your browser.

From source

git clone https://github.com/0xBinayak/VocalFlow.git
cd VocalFlow
uv sync
uv run app.py

First run: The TTS and Whisper models will be downloaded automatically on first use. Only the model you select is downloaded — you don't need all of them.

Project Structure

VocalFlow/
├── app.py              # Main web application (Gradio UI + all TTS modes)
├── transcribe.py       # Whisper transcription module (also usable as CLI)
├── main.py             # CLI entry points (vocalflow, vocalflow-transcribe)
├── pyproject.toml      # Dependencies and build config
├── uv.lock             # Locked dependency versions
├── models/             # Cached model weights (gitignored, created at runtime)
├── audio/              # Generated speech files (gitignored, created at runtime)
├── voices/             # Saved voice prompt files (gitignored, created at runtime)
├── uploads/            # Temporary uploaded files (gitignored, created at runtime)
├── transcripts/        # Transcription JSON files (gitignored, created at runtime)
└── .github/workflows/
    ├── ci.yml          # Lint + syntax checks on push/PR
    ├── release.yml     # Auto GitHub Release on version tags
    └── publish.yml     # Auto PyPI publish on version tags

Contributing

Contributions are welcome!

  1. Fork the repo and clone it
  2. uv sync to install dependencies
  3. Create a branch, make changes, ensure uvx ruff check app.py transcribe.py main.py passes
  4. Open a Pull Request against main

Areas where help is appreciated: streaming TTS, batch processing, audio post-processing, UI improvements, and testing.

Found a bug? Open an issue.

License

MIT

Acknowledgments

  • Qwen3-TTS by Alibaba Cloud — voice cloning, custom voice, and voice design models
  • OpenAI Whisper — speech recognition
  • Gradio — web UI framework

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vocalflow-0.6.2.tar.gz (69.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vocalflow-0.6.2-py3-none-any.whl (16.4 kB view details)

Uploaded Python 3

File details

Details for the file vocalflow-0.6.2.tar.gz.

File metadata

  • Download URL: vocalflow-0.6.2.tar.gz
  • Upload date:
  • Size: 69.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vocalflow-0.6.2.tar.gz
Algorithm Hash digest
SHA256 5cf27434c93a96cb52a0883b9925851b5f498155c3ebf476de339bddbf37b3e1
MD5 135cd1ac66cf4e4500e9cb8d48802fb7
BLAKE2b-256 64451c5d29808648bac1eaf2bee52ebd88c3b5027a67542e035c28f2e70c9b93

See more details on using hashes here.

Provenance

The following attestation bundles were made for vocalflow-0.6.2.tar.gz:

Publisher: publish.yml on 0xBinayak/VocalFlow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file vocalflow-0.6.2-py3-none-any.whl.

File metadata

  • Download URL: vocalflow-0.6.2-py3-none-any.whl
  • Upload date:
  • Size: 16.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vocalflow-0.6.2-py3-none-any.whl
Algorithm Hash digest
SHA256 29df0480404ab290d1a0ef0c9e5132572fff6b383caacc00a1037d4b1fb145f0
MD5 a017751cdf5afc7efb4dcab27c0d2b0c
BLAKE2b-256 b4d6ba32b0b84df9070b25483d91271acc128e3290d01abb6a0b01aca9648862

See more details on using hashes here.

Provenance

The following attestation bundles were made for vocalflow-0.6.2-py3-none-any.whl:

Publisher: publish.yml on 0xBinayak/VocalFlow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page