Skip to main content

Comprehensive bidirectional voice-text CLI tool with Whisper and VibeVoice

Project description

VoiceBridge ๐ŸŽ™๏ธ โ†”๏ธ ๐Ÿ“

License: MIT Python 3.10+ Platform Support

The ultimate bidirectional voice-text bridge. Seamlessly convert speech to text and text to speech with professional-grade accuracy, real-time processing, and hotkey-driven workflows.

๐Ÿš€ What is VoiceBridge?

VoiceBridge eliminates the friction between voice and text. Whether you're transcribing interviews, creating accessible content, building voice-driven workflows, or simply need hands-free text input, VoiceBridge provides a powerful, flexible CLI that adapts to your needs.

Built on OpenAI's Whisper for world-class speech recognition and VibeVoice for natural text-to-speech synthesis.

๐ŸŽฏ What Problems Does It Solve?

  • Content Creators: Transcribe podcasts, interviews, and videos with timestamp precision
  • Accessibility: Convert text to natural speech for screen readers and audio content
  • Productivity: Voice-to-text note-taking with hotkey triggers during meetings
  • Developers: Integrate speech processing into applications and workflows
  • Researchers: Batch process audio data with confidence analysis and quality metrics
  • Writers: Dictate drafts and have articles read back with custom voices

โœจ Key Features

๐ŸŽค Speech-to-Text (STT)

  • Real-time transcription with hotkeys (F9 toggle/hold modes)
  • Interactive mode with press-and-hold 'r' to record
  • File processing (MP3, WAV, M4A, FLAC, OGG) with chunked processing
  • Batch transcription of entire directories with parallel workers
  • Resume capability for interrupted long transcriptions with session management
  • Streaming transcription with real-time output and live updates
  • GPU acceleration (CUDA/Metal) with automatic device detection
  • Memory optimization with configurable limits and streaming
  • Custom vocabulary management for domain-specific terms
  • Export formats: JSON, SRT, VTT, plain text, CSV with timestamps and confidence
  • Confidence analysis and quality assessment with detailed reporting
  • Webhook integration for external notifications and automation
  • Post-processing with spell check, grammar correction, and custom rules
  • Profile management for different use cases and configurations
  • Performance monitoring with comprehensive metrics and benchmarking

๐Ÿ—ฃ๏ธ Text-to-Speech (TTS)

  • High-quality voice synthesis with VibeVoice neural models
  • Multiple input modes: clipboard monitoring, text selection, direct input
  • Custom voice samples with automatic detection and voice cloning
  • Streaming and non-streaming modes for real-time or complete generation
  • Daemon mode for background processing and system integration
  • Hotkey controls for hands-free operation (F12 generate, Ctrl+Alt+S stop)
  • Voice management with sample validation and quality checks
  • GPU acceleration for faster synthesis and model loading
  • Configuration profiles for different voice settings and use cases
  • Audio output options: play immediately, save to file, or both

๐Ÿ”ง Advanced Processing

  • Audio enhancement: noise reduction, normalization, silence trimming, fade effects
  • Audio splitting: by duration, silence detection, or file size with smart segmentation
  • Confidence analysis and quality assessment with detailed statistics
  • Session management with progress tracking, resume capability, and persistence
  • Performance monitoring with GPU benchmarking, memory usage, and operation tracking
  • Webhook integration for external notifications and workflow automation
  • Profile management for different use cases and quick configuration switching
  • Vocabulary management for improved recognition of technical terms and proper nouns
  • Post-processing pipeline with spell check, grammar correction, and custom rules
  • API server for integration with external applications and services
  • Comprehensive testing with E2E test suites for all major functionality

๐Ÿš€ Quick Start

Installation

VoiceBridge uses uv for fast dependency management. Install uv first if you don't have it:

# Install uv (fast Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone the repository
git clone https://github.com/yourusername/voicebridge.git
cd voicebridge

# Set up environment (CPU version)
make prepare

# Or with CUDA support for GPU acceleration
make prepare-cuda

# Or with system tray support
make prepare-tray

Basic Usage

# Listen for speech and transcribe with hotkeys
uv run python -m voicebridge stt listen

# Transcribe an audio file
uv run python -m voicebridge stt transcribe audio.mp3 --output transcript.txt

# Generate speech from text
uv run python -m voicebridge tts generate "Hello, this is VoiceBridge!"

# Start clipboard monitoring for TTS
uv run python -m voicebridge tts listen-clipboard

๐Ÿ“– Examples

1. Content Creator Workflow

# Transcribe a podcast episode with timestamps
uv run python -m voicebridge stt transcribe podcast_episode.mp3 \
  --format srt \
  --output episode_subtitles.srt \
  --language en

# Analyze transcription quality
uv run python -m voicebridge stt confidence analyze session_12345 --detailed

2. Accessibility Content

# Convert article to speech with custom voice
uv run python -m voicebridge tts generate \
  --voice en-Alice_woman \
  --output article_audio.wav \
  "$(cat article.txt)"

# Batch convert multiple documents
uv run python -m voicebridge stt batch-transcribe articles/ \
  --output-dir transcripts/ \
  --workers 4

3. Developer Integration

# Start TTS daemon for background processing
uv run python -m voicebridge tts daemon start --mode clipboard

# Set up webhook notifications
uv run python -m voicebridge stt webhook add https://api.example.com/transcription-complete

# Real-time transcription with streaming
uv run python -m voicebridge stt realtime \
  --chunk-duration 2.0 \
  --output-format live

4. Research & Analysis

# Process interview recordings with resumable capability
uv run python -m voicebridge stt listen-resumable interview.wav \
  --session-name "interview-2024-01-15" \
  --language en

# Export results in multiple formats
uv run python -m voicebridge stt export session session_12345 \
  --format json \
  --include-confidence \
  --output transcript.json

๐Ÿ› ๏ธ Local Development Setup

Prerequisites

  • Python 3.10+
  • uv (Python package manager)
  • FFmpeg (for audio processing)
  • CUDA (optional, for GPU acceleration)

Installation

# 1. Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Clone and setup
git clone https://github.com/yourusername/voicebridge.git
cd voicebridge

# 3. Choose your setup:
make prepare        # CPU version
make prepare-cuda   # With CUDA support  
make prepare-tray   # With system tray support

# 4. Install system dependencies
# Ubuntu/Debian:
sudo apt update && sudo apt install ffmpeg

# macOS:
brew install ffmpeg

# Windows (with Chocolatey):
choco install ffmpeg

TTS Setup

VoiceBridge includes comprehensive text-to-speech capabilities powered by VibeVoice.

Prerequisites

  1. Install VibeVoice dependencies (if using local model):

    # Clone and install VibeVoice
    git clone https://github.com/WestZhang/VibeVoice.git
    cd VibeVoice
    pip install -e .
    
  2. Voice Samples: Voice samples are included in voices/ directory:

    voices/
    โ”œโ”€โ”€ en-Alice_woman.wav
    โ”œโ”€โ”€ en-Carter_man.wav
    โ”œโ”€โ”€ en-Frank_man.wav
    โ”œโ”€โ”€ en-Maya_woman.wav
    โ”œโ”€โ”€ en-Patrick.wav
    โ””โ”€โ”€ ... (additional voices)
    

Configuration

VoiceBridge works out-of-the-box with sensible defaults. Configuration can be set via:

  1. Config file (~/.config/whisper-cli/config.json):

    {
      "tts_enabled": true,
      "tts_config": {
        "model_path": "aoi-ot/VibeVoice-7B",
        "voice_samples_dir": "voices",
        "default_voice": "en-Alice_woman",
        "cfg_scale": 1.3,
        "inference_steps": 10,
        "tts_mode": "clipboard",
        "streaming_mode": "non_streaming",
        "output_mode": "play",
        "tts_toggle_key": "f11",
        "tts_generate_key": "f12",
        "tts_stop_key": "ctrl+alt+s",
        "sample_rate": 24000,
        "auto_play": true,
        "use_gpu": true,
        "max_text_length": 2000,
        "chunk_text_threshold": 500
      }
    }
    
  2. Command-line flags (override config file):

    # Generate with custom settings
    uv run python -m voicebridge tts generate "Hello world" \
      --voice en-Patrick \
      --streaming \
      --output speech.wav \
      --cfg-scale 1.5 \
      --inference-steps 15
    

Voice Sample Requirements

  • Format: WAV (recommended), MP3, FLAC
  • Sample Rate: 24kHz (recommended), 16kHz-48kHz supported
  • Channels: Mono (preferred)
  • Duration: 3-10 seconds
  • Quality: Clear, single speaker, minimal background noise
  • Naming: language-name_gender.wav (e.g., en-Alice_woman.wav)

Quick Test

# Test TTS with default settings
uv run python -m voicebridge tts generate "Hello, this is VoiceBridge text-to-speech!"

# List available voices
uv run python -m voicebridge tts voices

# Show current TTS configuration
uv run python -m voicebridge tts config show

Development Commands

make help           # Show all available commands
make lint           # Run ruff linting and formatting
make test           # Run all tests with coverage
make test-fast      # Quick tests without coverage
make test-unit      # Run only unit tests (exclude e2e)
make test-e2e       # Run comprehensive end-to-end tests
make test-e2e-smoke # Run quick E2E smoke tests
make test-e2e-stt   # Run STT E2E tests only
make test-e2e-tts   # Run TTS E2E tests only
make test-e2e-audio # Run audio E2E tests only
make test-e2e-gpu   # Run GPU E2E tests only
make test-e2e-api   # Run API E2E tests only
make clean          # Clean cache and temporary files

Configuration

# Show current STT configuration
uv run python -m voicebridge stt config show

# Set STT configuration values
uv run python -m voicebridge stt config set use_gpu true

# Show TTS configuration
uv run python -m voicebridge tts config show

# Set up profiles for different use cases
uv run python -m voicebridge stt profile save research-setup
uv run python -m voicebridge stt profile load research-setup

๐ŸŽฎ Usage Guide

Speech-to-Text (STT) Commands

Real-time Recognition

# Listen with hotkeys (F9 to start/stop)
uv run python -m voicebridge stt listen

# Interactive mode (press 'r' to record)
uv run python -m voicebridge stt interactive

# Global hotkey listener with custom key
uv run python -m voicebridge stt hotkey --key f9 --mode toggle

File Processing

# Transcribe single file
uv run python -m voicebridge stt transcribe audio.mp3 --output transcript.txt

# Batch process directory
uv run python -m voicebridge stt batch-transcribe /path/to/audio/ --workers 4

# Long file with resume capability
uv run python -m voicebridge stt listen-resumable large_file.wav --session-name "my-session"

# Real-time streaming
uv run python -m voicebridge stt realtime --chunk-duration 2.0 --output-format live

Session Management

# List all sessions
uv run python -m voicebridge stt sessions list

# Resume interrupted session
uv run python -m voicebridge stt sessions resume --session-name "my-session"

# Clean up old sessions
uv run python -m voicebridge stt sessions cleanup

# Delete specific session
uv run python -m voicebridge stt sessions delete session_id

Advanced Features

# Add vocabulary words for better recognition
uv run python -m voicebridge stt vocabulary add "technical,terms,here" --type technical

# Export with confidence analysis
uv run python -m voicebridge stt export session session_id --format srt --confidence

# Set up webhooks for notifications
uv run python -m voicebridge stt webhook add https://api.example.com/notify

Text-to-Speech (TTS) Commands

Basic Generation

# Generate speech from text
uv run python -m voicebridge tts generate "Hello, this is VoiceBridge!"

# Use specific voice and save to file
uv run python -m voicebridge tts generate "Hello world" --voice en-Alice_woman --output speech.wav

# List available voices
uv run python -m voicebridge tts voices

Background Monitoring

# Monitor clipboard for text changes
uv run python -m voicebridge tts listen-clipboard --streaming

# Monitor text selections (use hotkey to trigger)
uv run python -m voicebridge tts listen-selection

# Start TTS daemon for background processing
uv run python -m voicebridge tts daemon start --mode clipboard
uv run python -m voicebridge tts daemon status
uv run python -m voicebridge tts daemon stop

Configuration

# Show TTS settings
uv run python -m voicebridge tts config show

# Configure TTS settings
uv run python -m voicebridge tts config set --default-voice en-Alice_woman --cfg-scale 1.5

Audio Processing

# Get audio file information
uv run python -m voicebridge audio info audio.mp3

# List supported formats
uv run python -m voicebridge audio formats

# Split large audio file
uv run python -m voicebridge audio split recording.mp3 \
  --method duration \
  --chunk-duration 300

# Enhance audio quality
uv run python -m voicebridge audio preprocess input.wav output.wav \
  --noise-reduction 0.8 \
  --normalize \
  --trim-silence

# Test audio setup
uv run python -m voicebridge audio test

System & Performance

# Check GPU status and acceleration
uv run python -m voicebridge gpu status

# Benchmark GPU performance
uv run python -m voicebridge gpu benchmark --model base

# View STT performance statistics
uv run python -m voicebridge stt performance stats

# Manage active operations
uv run python -m voicebridge stt operations list
uv run python -m voicebridge stt operations cancel operation_id

API Server

# Start API server
uv run python -m voicebridge api start --host localhost --port 8000

# Check API status
uv run python -m voicebridge api status

# Get API information
uv run python -m voicebridge api info

# Stop API server
uv run python -m voicebridge api stop

๐Ÿ“‹ Complete Command Reference

VoiceBridge uses a hierarchical command structure with five main categories:

๐ŸŽค stt - Speech-to-Text Commands

stt listen              # Real-time transcription with hotkeys
stt interactive         # Press-and-hold 'r' to record mode
stt hotkey              # Global hotkey listener
stt transcribe          # Transcribe single audio file
stt batch-transcribe    # Batch process directory
stt listen-resumable    # Long file with resume capability
stt realtime            # Real-time streaming transcription

# Session Management
stt sessions list       # List all sessions
stt sessions resume     # Resume interrupted session
stt sessions cleanup    # Clean up old sessions
stt sessions delete     # Delete specific session

# Advanced Features
stt vocabulary add      # Add custom vocabulary
stt vocabulary remove   # Remove vocabulary
stt vocabulary list     # List vocabulary
stt vocabulary import   # Import from file
stt vocabulary export   # Export to file

stt export session      # Export session data
stt export formats      # List export formats

stt confidence analyze  # Analyze transcription confidence
stt confidence analyze-all # Analyze all sessions

stt postproc config     # Configure post-processing
stt postproc test       # Test post-processing

stt webhook add         # Add webhook notification
stt webhook remove      # Remove webhook
stt webhook list        # List webhooks
stt webhook test        # Test webhook

stt performance stats   # Performance statistics
stt operations list     # List active operations
stt operations cancel   # Cancel operation
stt operations status   # Check operation status

stt config show         # Show configuration
stt config set          # Set configuration

stt profile save        # Save configuration profile
stt profile load        # Load configuration profile
stt profile list        # List profiles
stt profile delete      # Delete profile

๐Ÿ—ฃ๏ธ tts - Text-to-Speech Commands

tts generate            # Generate speech from text
tts listen-clipboard    # Monitor clipboard changes
tts listen-selection    # Monitor text selections with hotkey
tts voices              # List available voices

# Daemon Management
tts daemon start        # Start TTS daemon
tts daemon stop         # Stop TTS daemon
tts daemon status       # Check daemon status

# Configuration
tts config show         # Show TTS configuration
tts config set          # Configure TTS settings

๐Ÿ”Š audio - Audio Processing Commands

audio info              # Show audio file information
audio formats           # List supported formats
audio split             # Split audio file into chunks
audio preprocess        # Enhance audio quality
audio test              # Test audio setup

๐Ÿ–ฅ๏ธ gpu - GPU and System Commands

gpu status              # Show GPU status
gpu benchmark           # Benchmark GPU performance

๐ŸŒ api - API Server Management

api start               # Start API server
api stop                # Stop API server
api status              # Check API status
api info                # Show API information

๐Ÿ—๏ธ Architecture

VoiceBridge follows hexagonal architecture principles:

voicebridge/
โ”œโ”€โ”€ domain/          # Core business logic and models
โ”œโ”€โ”€ ports/           # Interfaces and abstractions
โ”œโ”€โ”€ adapters/        # External integrations (Whisper, VibeVoice, etc.)
โ”œโ”€โ”€ services/        # Application services and orchestration
โ”œโ”€โ”€ cli/             # Command-line interface
โ””โ”€โ”€ tests/          # Comprehensive test suite

Key Components

  • Domain Layer: Core models, configurations, and business rules
  • Ports: Abstract interfaces for transcription, TTS, audio processing
  • Adapters: Concrete implementations for Whisper, VibeVoice, FFmpeg
  • Services: Orchestration, session management, performance monitoring
  • CLI: Typer-based command interface with sub-commands

๐Ÿค Contributing

We welcome contributions! Here's how to get started:

Development Workflow

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/amazing-feature
  3. Install development dependencies: make install-dev
  4. Make your changes following our coding standards
  5. Test your changes: make test
  6. Lint your code: make lint
  7. Commit your changes: git commit -m 'Add amazing feature'
  8. Push to your branch: git push origin feature/amazing-feature
  9. Open a Pull Request

Coding Standards

  • Python 3.10+ with comprehensive type hints
  • uv for fast dependency management and virtual environments
  • Ruff for linting and formatting (replaces Black and isort)
  • Pytest for testing with >90% coverage target
  • Hexagonal architecture for new features and clean separation of concerns
  • Comprehensive documentation for public APIs and CLI commands
  • E2E testing for all major CLI workflows and functionality
  • Makefile for standardized development commands

Areas for Contribution

  • ๐ŸŽฏ New audio formats and processing capabilities
  • ๐ŸŒ Language support and localization
  • ๐Ÿ”ง Performance optimizations and GPU utilization
  • ๐Ÿ“ฑ Platform integrations (mobile, web interfaces)
  • ๐Ÿงช Test coverage and edge case handling
  • ๐Ÿ“š Documentation and usage examples
  • ๐ŸŽจ Voice samples and TTS improvements

Reporting Issues

Please use our issue templates:

  • ๐Ÿ› Bug Report: Describe the issue with reproduction steps
  • ๐Ÿ’ก Feature Request: Propose new functionality
  • ๐Ÿ“š Documentation: Report unclear or missing docs
  • ๐Ÿƒ Performance: Report slow or resource-intensive operations

๐Ÿ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

  • OpenAI Whisper - State-of-the-art speech recognition
  • VibeVoice - High-quality text-to-speech synthesis
  • FFmpeg - Comprehensive audio processing
  • Typer - Modern CLI framework
  • PyTorch - Machine learning infrastructure

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

voicebridge-0.0.1.tar.gz (276.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

voicebridge-0.0.1-py3-none-any.whl (327.9 kB view details)

Uploaded Python 3

File details

Details for the file voicebridge-0.0.1.tar.gz.

File metadata

  • Download URL: voicebridge-0.0.1.tar.gz
  • Upload date:
  • Size: 276.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for voicebridge-0.0.1.tar.gz
Algorithm Hash digest
SHA256 005afa7ee77e0d9bf8419906998b044bd95a5b24c3107ba84009f207bc25b08c
MD5 14a6a4d3fe609e9a3012b8b1f57fa844
BLAKE2b-256 7e28dbc561423e3de459c2776247b3de0d3506ad2337cf1c202987006c60c0fd

See more details on using hashes here.

File details

Details for the file voicebridge-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: voicebridge-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 327.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for voicebridge-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 25e8c9e614a7c63004b4db647a25713ed68407647ecba31a127f8c9ab90658e6
MD5 ae9e12bec2bf1bd618c8144de731b785
BLAKE2b-256 b35f5a245d72f580078692f48028638329bc220200604df40b6beb4cfc0e315f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page