Local TTS and audio transcription web app

Project description

VocalFlow

Local text-to-speech and audio transcription — all in one web app.

Python

What is VocalFlow?

VocalFlow is a self-hosted web application that runs entirely on your local machine. It provides two core capabilities:

Text-to-Speech (TTS) — generate speech with voice design or voice cloning
Audio Transcription — transcribe audio files with word-level timestamps

No cloud APIs. No data leaves your machine. Just your GPU doing the work.

Features

Speech Generation (TTS)

Mode	Description
Voice Design	Describe the voice you want in natural language (age, accent, tone, emotion) and the model generates it
Voice Cloning	Upload a short reference audio clip and clone the speaker's voice for new text

10+ languages: English, Chinese, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
Auto language detection
Download generated audio as .wav

Audio Transcription

Upload any audio file (.wav, .mp3, .flac, .ogg)
Choose from 6 Whisper model sizes (tiny → large) based on your speed/accuracy needs
Get full transcript with word-level timestamps
Copy transcript to clipboard or download as structured JSON
Auto-detect language or specify manually

Interface

Clean dark-themed UI built with FastHTML + HTMX
Sidebar with live-updating generation history
Inline audio playback from history
One-click downloads with human-readable filenames

Models

VocalFlow uses two open-source model families:

Qwen3-TTS (Text-to-Speech)

Model	Parameters	Purpose
Qwen3-TTS-12Hz-1.7B-Base	1.7B	Voice cloning from reference audio
Qwen3-TTS-12Hz-1.7B-VoiceDesign	1.7B	Voice generation from text descriptions

Runs in bfloat16 with SDPA attention for efficient inference
Models are downloaded automatically on first use (~3.5 GB each)
Requires a CUDA GPU with at least 6 GB VRAM

OpenAI Whisper (Transcription)

Model	Parameters	VRAM	Relative Speed
`tiny`	39M	~1 GB	~32x realtime
`base`	74M	~1 GB	~16x realtime
`small`	244M	~2 GB	~6x realtime
`medium`	769M	~5 GB	~2x realtime
`large`	1.5B	~10 GB	~1x realtime
`turbo`	1.5B	~6 GB	~8x realtime

Models are downloaded automatically on first use
Falls back to CPU if no CUDA GPU is detected

Requirements

Python 3.11+
CUDA GPU with 6+ GB VRAM (for TTS; transcription can run on CPU)
FFmpeg — required by Whisper for audio processing
uv package manager (recommended)

Installing FFmpeg

# Windows
winget install ffmpeg

# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt install ffmpeg

Quick Start

Option 1: From source (recommended)

git clone https://github.com/0xBinayak/VocalFlow.git
cd VocalFlow
uv sync
uv run app.py

Option 2: From PyPI

pip install vocalflow

# Launch the web app
vocalflow

Option 3: Transcription only (CLI)

pip install vocalflow

# Transcribe an audio file
vocalflow-transcribe audio.mp3 --model medium --language en

Then open http://localhost:5001 in your browser.

First run: The TTS and Whisper models will be downloaded automatically. This may take a few minutes depending on your connection.

Project Structure

VocalFlow/
├── app.py              # Main web application (FastHTML + routes + UI)
├── transcribe.py       # Whisper transcription module (also usable as CLI)
├── pyproject.toml      # Dependencies and build config
├── uv.lock             # Locked dependency versions
├── audio/              # Generated speech files (gitignored, created at runtime)
├── uploads/            # Temporary uploaded files (gitignored, created at runtime)
├── transcripts/        # Transcription JSON files (gitignored, created at runtime)
└── .github/workflows/
    ├── ci.yml          # Lint + syntax checks on push/PR
    ├── release.yml     # Auto GitHub Release on version tags
    └── publish.yml     # Auto PyPI publish on version tags

CLI Transcription

The transcriber can also be used standalone from the command line:

# Basic usage
uv run python transcribe.py audio.mp3

# With options
uv run python transcribe.py audio.mp3 --model medium --language en --device cuda

Output is a JSON file with word-level timestamps:

[
  { "word": "Hello", "start": 0.0, "end": 0.32 },
  { "word": "world", "start": 0.34, "end": 0.68 }
]

GPU Verification

To verify your GPU is detected:

uv run python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else \"N/A\"}')"

Contributing

Contributions are welcome! Here's how to get started:

Setting up for development

git clone https://github.com/0xBinayak/VocalFlow.git
cd VocalFlow
uv sync

Code style

We use ruff for linting:

uvx ruff check app.py transcribe.py

CI runs this automatically on every push and PR.

Making changes

Fork the repository
Create a branch for your feature or fix:
```
git checkout -b feature/my-feature
```
Make your changes and ensure lint passes:
```
uvx ruff check app.py transcribe.py
```
Commit with a clear message describing what and why
Open a Pull Request against main

Areas where help is appreciated

Additional TTS model backends (e.g., Bark, StyleTTS2, F5-TTS)
Speaker diarization in transcription
Batch processing (multiple files at once)
Audio post-processing (noise reduction, normalization)
UI improvements and accessibility
Documentation and tutorials
Testing and bug reports
Packaging for different platforms

Reporting issues

Found a bug or have a feature request? Open an issue with:

What you expected vs. what happened
Steps to reproduce
Your OS, Python version, and GPU info

Releasing

Maintainers can create a new release by bumping the version in pyproject.toml and tagging:

# 1. Update version in pyproject.toml
# 2. Update lock file
uv lock

# 3. Commit, tag, and push
git add pyproject.toml uv.lock
git commit -m "Bump version to X.Y.Z"
git tag vX.Y.Z
git push origin main --tags

This automatically:

Runs CI checks
Creates a GitHub Release with build artifacts
Publishes to PyPI

License

MIT

Acknowledgments

Qwen3-TTS by Alibaba Cloud — text-to-speech models
OpenAI Whisper — speech recognition
FastHTML — Python web framework
HTMX — frontend interactivity

Project details

Release history Release notifications | RSS feed

0.7.1

Mar 23, 2026

0.7.0

Mar 23, 2026

0.6.2

Mar 23, 2026

0.6.1

Mar 23, 2026

0.6.0

Mar 23, 2026

0.3.2

Mar 18, 2026

0.3.1

Mar 17, 2026

This version

0.3.0

Mar 17, 2026

0.2.1

Mar 17, 2026

0.1.0

Mar 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vocalflow-0.3.0.tar.gz (183.7 kB view details)

Uploaded Mar 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vocalflow-0.3.0-py3-none-any.whl (14.6 kB view details)

Uploaded Mar 17, 2026 Python 3

File details

Details for the file vocalflow-0.3.0.tar.gz.

File metadata

Download URL: vocalflow-0.3.0.tar.gz
Upload date: Mar 17, 2026
Size: 183.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vocalflow-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`24e3abd252c65fc5fde72c8eb08d3dfd0b0b666a2aced1f340f92de06184d19d`
MD5	`59dc454b7ed746bbe263157613972439`
BLAKE2b-256	`42e33e109cd323758c226b2a8b18efc48de57e88e703dfa7839e2bb5b41d8340`

See more details on using hashes here.

Provenance

The following attestation bundles were made for vocalflow-0.3.0.tar.gz:

Publisher: publish.yml on 0xBinayak/VocalFlow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: vocalflow-0.3.0.tar.gz
- Subject digest: 24e3abd252c65fc5fde72c8eb08d3dfd0b0b666a2aced1f340f92de06184d19d
- Sigstore transparency entry: 1118483154
- Sigstore integration time: Mar 17, 2026
Source repository:
- Permalink: 0xBinayak/VocalFlow@ddcbfab9db39b4c27116679d24c9f8425e6c9b06
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/0xBinayak
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@ddcbfab9db39b4c27116679d24c9f8425e6c9b06
- Trigger Event: push

File details

Details for the file vocalflow-0.3.0-py3-none-any.whl.

File metadata

Download URL: vocalflow-0.3.0-py3-none-any.whl
Upload date: Mar 17, 2026
Size: 14.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vocalflow-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ca39036a32430610fef090bacfffc5cc4c60d237be7832cf8b08b3e3b05541b1`
MD5	`3801739c009049165f094267abdb452e`
BLAKE2b-256	`fbba53704cb3362313f4e7a162c9585207904987eb25d27df69e18c72a9ae466`

See more details on using hashes here.

Provenance

The following attestation bundles were made for vocalflow-0.3.0-py3-none-any.whl:

Publisher: publish.yml on 0xBinayak/VocalFlow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: vocalflow-0.3.0-py3-none-any.whl
- Subject digest: ca39036a32430610fef090bacfffc5cc4c60d237be7832cf8b08b3e3b05541b1
- Sigstore transparency entry: 1118483350
- Sigstore integration time: Mar 17, 2026
Source repository:
- Permalink: 0xBinayak/VocalFlow@ddcbfab9db39b4c27116679d24c9f8425e6c9b06
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/0xBinayak
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@ddcbfab9db39b4c27116679d24c9f8425e6c9b06
- Trigger Event: push

vocalflow 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

VocalFlow

What is VocalFlow?

Features

Speech Generation (TTS)

Audio Transcription

Interface

Models

Qwen3-TTS (Text-to-Speech)

OpenAI Whisper (Transcription)

Requirements

Installing FFmpeg

Quick Start

Option 1: From source (recommended)

Option 2: From PyPI

Option 3: Transcription only (CLI)

Project Structure

CLI Transcription

GPU Verification

Contributing

Setting up for development

Code style

Making changes

Areas where help is appreciated

Reporting issues

Releasing

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance