AI-powered Text-to-Speech web application with multiple provider support
Project description
๐๏ธ AI Voice Generator
Transform text into high-quality speech using multiple AI-powered TTS services with an intuitive ChatGPT-inspired interface.
AI Voice Generator is a production-ready web application that converts text to speech using multiple TTS providers including OpenAI, ElevenLabs, Google Cloud, and Gemini. Features intelligent text optimization, advanced voice controls, persistent history, and a sleek dark-themed interface.
๐ Table of Contents
- โจ Key Features
- ๐ Quick Start
- ๐ฆ Installation
- โ๏ธ Configuration
- ๐ฏ Usage Guide
- ๐งช Testing
- ๐๏ธ Architecture
- ๐ง API Reference
- ๐จ Customization
- ๐ Performance
- ๐ Security
- ๐ค Contributing
- ๐ License
- ๐ Documentation
โจ Key Features
๐ค Multi-Provider TTS Support
- OpenAI: High-quality voices with speed control (tts-1, tts-1-hd)
- ElevenLabs: Premium voices with advanced settings (stability, similarity, style)
- Google Cloud: 200+ voices in 40+ languages
- Gemini: Preview TTS with unique voice options
๐ง Intelligent Text Enhancement
- AI-powered optimization using OpenAI GPT or Gemini
- Multiple enhancement modes: Default, Shorter, Longer, Retry
- TTS-focused improvements: Grammar, flow, conversational tone
- Basic optimization for all services (abbreviation expansion, formatting)
๐๏ธ Advanced Voice Controls
- Service-specific settings: Speed (OpenAI), Voice parameters (ElevenLabs)
- Real-time parameter adjustment with instant feedback
- Voice search and filtering across all providers
- Dynamic voice loading based on service availability
๐ญ Emotion-Aware TTS
- Intelligent emotion detection from text content
- 7 emotion types: Joy, Sadness, Anger, Fear, Surprise, Calm, Neutral
- Context-sensitive voice modulation based on detected emotion
- Confidence scoring with manual override options
- Auto-analysis with real-time text processing
๐ก Real-Time Streaming
- Chunk-based audio generation for long texts
- Live progress tracking with detailed status updates
- Streaming-optimized for OpenAI TTS services
- Cancellable operations with real-time feedback
- Enhanced UX for processing large content
๐พ Persistent Data Management
- SQLite database for reliable history storage
- Automatic file management in organized outputs directory
- Server-side audio storage with secure file serving
- Comprehensive generation metadata tracking
๐จ Premium User Experience
- ChatGPT-inspired dark theme with Tailwind CSS
- Responsive design for desktop and mobile
- Auto-playing history with visual indicators
- Real-time status updates and notifications
- Drag-and-drop file support for text input
๐ง Developer-Friendly
- Comprehensive test suite (unit, integration, performance)
- Provider pattern architecture for easy service addition
- RESTful API design with clear documentation
- Environment-based configuration with sample files
- Extensive error handling and logging
๐ Quick Start
๐ฆ Installation from PyPI (Recommended)
# Install from PyPI
pip install aivoicegen
# Run the application
aivoicegen
# Or run directly with Python
python -m aivoicegen
๐ ๏ธ Development Installation
Prerequisites
- Python 3.8+
- Git
- FFmpeg (for audio processing)
Setup
# Clone the repository
git clone https://github.com/yourusername/aivoicegen.git
cd aivoicegen
# Set up virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install in development mode
pip install -e .
# Configure environment
cp .env.sample .env
# Edit .env with your API keys
# Run the application
aivoicegen
Visit http://127.0.0.1:5000 and start generating speech! ๐
๐ฆ Installation
System Requirements
| Component | Minimum | Recommended |
|---|---|---|
| Python | 3.8+ | 3.11+ |
| RAM | 512MB | 2GB+ |
| Storage | 1GB | 5GB+ |
| Network | Basic | Broadband |
Detailed Installation
1. Clone Repository
git clone https://github.com/yourusername/aivoicegen.git
cd aivoicegen
2. Virtual Environment
# Create environment
python -m venv venv
# Activate (choose your platform)
source venv/bin/activate # macOS/Linux
venv\Scripts\activate # Windows
conda activate venv # Conda
3. Install Dependencies
# Core dependencies
pip install -r requirements.txt
# Development dependencies (optional)
pip install -r requirements-dev.txt
4. Install FFmpeg
macOS:
brew install ffmpeg
Ubuntu/Debian:
sudo apt update && sudo apt install ffmpeg
Windows:
- Download from FFmpeg website
- Extract to
C:\ffmpeg - Add
C:\ffmpeg\binto PATH
Verify installation:
ffmpeg -version
โ๏ธ Configuration
Environment Setup
Copy the sample environment file and configure your API keys:
cp .env.sample .env
API Provider Configuration
๐ค OpenAI Configuration
OPENAI_API_KEY=sk-your-openai-api-key-here
Features:
- 6 high-quality voices (alloy, echo, fable, onyx, nova, shimmer)
- 2 models (tts-1 for speed, tts-1-hd for quality)
- Speed control (0.25x - 4x)
- Text optimization with GPT
Setup:
- Visit OpenAI Platform
- Create API key
- Add to
.envfile
Pricing: ~$0.015 per 1K characters
๐ญ ElevenLabs Configuration
ELEVENLABS_API_KEY=your-elevenlabs-api-key-here
Features:
- Premium voice cloning
- Advanced voice controls (stability, similarity, style)
- Multilingual support
- Custom voice training
Setup:
- Visit ElevenLabs
- Get API key from profile
- Add to
.envfile
Pricing: Freemium model, paid plans available
โ๏ธ Google Cloud Configuration
GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
Features:
- 200+ voices in 40+ languages
- WaveNet, Neural2, and Standard voices
- SSML support
- High-quality synthesis
Setup:
- Create Google Cloud Project
- Enable Text-to-Speech API
- Create service account with TTS permissions
- Download JSON key file
- Set path in
.env
Pricing: $4 per 1M characters (1M free/month)
๐ Gemini Configuration
GEMINI_API_KEY=your-gemini-api-key-here
Features:
- Preview TTS service
- Unique voice options (Kore, Puck, Charon, Fenrir, Aoede)
- Text optimization with Gemini Pro
- Integration with Google AI ecosystem
Setup:
- Visit Google AI Studio
- Generate API key
- Add to
.envfile
Status: Preview (subject to changes)
Optional Configuration
# Application settings
FLASK_ENV=development
FLASK_DEBUG=true
PORT=5000
# Database
DATABASE_PATH=./history.db
# Cache settings
MAX_CACHE_SIZE=100
# Storage
OUTPUTS_DIR=outputs
# Security
SECRET_KEY=your-secret-key-here
๐ฏ Usage Guide
Basic Workflow
- Enter Text: Type or upload your content
- Enhance (Optional): Use AI to optimize text for speech
- Configure Voice: Select service and voice options
- Adjust Settings: Fine-tune voice parameters
- Generate: Create high-quality speech
- Review History: Access previous generations
Text Enhancement Modes
| Mode | Purpose | Best For |
|---|---|---|
| Default | Balanced enhancement | General use |
| Shorter | Concise version (50-70% length) | Quick summaries |
| Longer | Expanded version (130-150% length) | Detailed narration |
| Retry | Alternative enhancement | Finding better phrasing |
Advanced Voice Settings
OpenAI Settings
- Speech Speed: 0.25x (very slow) to 4x (very fast)
- Model Selection: tts-1 (fast) vs tts-1-hd (high quality)
ElevenLabs Settings
- Stability: 0-1 (consistent vs varied delivery)
- Similarity Boost: 0-1 (voice matching strength)
- Style: 0-1 (expressive vs neutral)
- Speaker Boost: Enable for clarity improvement
File Management
- Auto-save: All generated audio saved to
outputs/directory - Persistent History: SQLite database tracks all generations
- Organized Storage: Files named with timestamps and service info
- Easy Access: Direct play/download from history sidebar
๐งช Testing
Test Suite Overview
Our comprehensive test suite ensures reliability and performance:
# Run all tests
python -m unittest discover tests
# Run specific test categories
python tests/test_app.py # Core functionality
python tests/test_integration.py # End-to-end workflows
python tests/test_performance.py # Performance benchmarks
# Run with verbose output
python -m unittest discover tests -v
Test Categories
Unit Tests (test_app.py)
- โ TTS provider functionality
- โ Text optimization algorithms
- โ Database operations
- โ API endpoint validation
- โ Error handling
- โ Configuration management
Integration Tests (test_integration.py)
- โ End-to-end workflows
- โ Service integration
- โ Data persistence
- โ Concurrent operations
- โ Error recovery
Performance Tests (test_performance.py)
- โ Response time benchmarks
- โ Memory usage monitoring
- โ Scalability testing
- โ Cache effectiveness
- โ Database performance
Continuous Integration
# Example GitHub Actions workflow
name: Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: 3.11
- name: Install dependencies
run: |
pip install -r requirements.txt
sudo apt-get install ffmpeg
- name: Run tests
run: python -m unittest discover tests
๐๏ธ Architecture
System Overview
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ Frontend โ โ Flask API โ โ TTS Services โ
โ (HTML/JS) โโโโโบโ (Python) โโโโโบโ (OpenAI, etc) โ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ โ โ
โผ โผ โผ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ Browser โ โ SQLite DB โ โ File System โ
โ Storage โ โ (History) โ โ (Audio Files) โ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
Provider Pattern
Each TTS service implements a consistent interface:
class TTSProvider:
def __init__(self):
"""Initialize with API credentials"""
def list_voices(self) -> List[Dict]:
"""Return available voices"""
def generate_speech(self, text: str, title: str, options: Dict) -> Tuple[bytes, str]:
"""Generate audio and return (content, filename)"""
Key Components
Backend (Flask)
- Provider Management: Dynamic service loading and configuration
- API Endpoints: RESTful design with comprehensive error handling
- Database Layer: SQLite with connection pooling and transactions
- File Management: Organized storage with secure serving
- Caching System: In-memory LRU cache for performance
Frontend (Vanilla JS + Tailwind)
- Responsive Design: Mobile-first approach with dark theme
- Real-time Updates: Dynamic service/voice loading
- State Management: Local state with server synchronization
- User Experience: Intuitive controls with immediate feedback
Data Layer
- SQLite Database: Lightweight, serverless, ACID-compliant
- File System: Organized audio storage with metadata
- Caching: Multi-level caching strategy
- Configuration: Environment-based with validation
Database Schema
CREATE TABLE generation_history (
id INTEGER PRIMARY KEY AUTOINCREMENT,
title TEXT NOT NULL,
service TEXT NOT NULL,
voice TEXT NOT NULL,
text_snippet TEXT,
filename TEXT,
timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
file_path TEXT
);
๐ง API Reference
Endpoints Overview
| Endpoint | Method | Purpose | Authentication |
|---|---|---|---|
/ |
GET | Serve frontend | None |
/services |
GET | List available TTS services | None |
/voices/<service> |
GET | Get voices for service | None |
/generate |
POST | Generate speech | None |
/generate-stream |
POST | Generate speech with streaming | None |
/analyze-emotion |
POST | Analyze text emotion | None |
/optimize |
POST | Enhance text | None |
/history |
GET | Get generation history | None |
/outputs/<filename> |
GET | Serve audio files | None |
/api-config |
GET/POST | Manage API configuration | None |
Detailed API Documentation
Generate Speech
POST /generate
Content-Type: application/json
{
"service": "openai",
"title": "My Audio",
"text": "Hello, world!",
"voice_model_options": {
"api_params": {
"voice": "alloy",
"model": "tts-1"
},
"speed": 1.0
}
}
Response:
200 OK
Content-Type: audio/mp3
Content-Disposition: attachment; filename="My_Audio_openai_20231201123456.mp3"
[Binary audio data]
Optimize Text
POST /optimize
Content-Type: application/json
{
"text": "This is a test about AI",
"service": "openai",
"mode": "default"
}
Response:
{
"optimized_text": "Here's a test story about artificial intelligence and its fascinating applications."
}
Get Services
GET /services
Response:
{
"openai": {
"label": "OpenAI",
"configured": true,
"voices_endpoint": "/voices/openai",
"voice_type": "static",
"unavailable_reason": null
},
"elevenlabs": {
"label": "ElevenLabs",
"configured": false,
"unavailable_reason": "API key not configured"
}
}
Streaming Speech Generation
POST /generate-stream
Content-Type: application/json
{
"service": "openai",
"title": "My Project",
"text": "Long text content for streaming generation...",
"voice_model_options": {
"api_params": {"voice": "alloy", "model": "tts-1"},
"speed": 1.0
}
}
Response: Text stream with JSON objects
{"type": "status", "message": "Processing 3 text chunks...", "progress": 0, "total_chunks": 3}
{"type": "status", "message": "Generating audio for chunk 1/3...", "progress": 20}
{"type": "chunk_complete", "chunk_index": 0, "message": "Chunk 1/3 completed"}
{"type": "complete", "filename": "audio_file.mp3", "message": "Speech generated successfully!"}
Emotion Analysis
POST /analyze-emotion
Content-Type: application/json
{
"text": "I'm so excited about this amazing opportunity!",
"service": "openai",
"voice_model_options": {
"api_params": {"voice": "alloy"},
"speed": 1.0
}
}
Response:
{
"emotion": "joy",
"confidence": 0.95,
"original_params": {
"api_params": {"voice": "alloy"},
"speed": 1.0
},
"suggested_params": {
"api_params": {"voice": "alloy"},
"speed": 1.1
},
"emotion_description": "Positive, upbeat, and energetic tone"
}
๐จ Customization
Theming
The application uses a ChatGPT-inspired dark theme built with Tailwind CSS:
/* Custom color palette */
:root {
--gpt-dark: #202123;
--gpt-dark-secondary: #343541;
--gpt-dark-tertiary: #40414F;
--gpt-accent: #10A37F;
--gpt-light-gray: #D1D5DB;
}
Adding TTS Providers
- Create Provider Class:
class NewTTSProvider:
def __init__(self):
# Initialize with API credentials
pass
def list_voices(self):
# Return voice list
return []
def generate_speech(self, text, title, options):
# Generate audio
return audio_bytes, filename
- Register Provider:
TTS_PROVIDERS_CONFIG["new_service"] = {
"instance": new_provider_instance,
"label": "New Service",
"voice_type": "dynamic"
}
- Add Frontend Support:
// Add service-specific controls
case 'new_service':
showNewServiceControls();
break;
Environment Customization
# Custom branding
APP_NAME=My Voice Generator
APP_LOGO=/path/to/logo.png
# Feature flags
ENABLE_FILE_UPLOAD=true
ENABLE_HISTORY=true
ENABLE_OPTIMIZATION=true
# Performance tuning
MAX_TEXT_LENGTH=10000
MAX_CACHE_SIZE=500
CACHE_TTL=3600
๐ Performance
Benchmarks
| Operation | Average Time | 95th Percentile |
|---|---|---|
| Page Load | 150ms | 300ms |
| Service List | 50ms | 100ms |
| Voice List | 200ms | 500ms |
| Text Optimization | 800ms | 2000ms |
| Audio Generation | 2-5s | 10s |
Optimization Features
Caching Strategy
- In-memory LRU cache for TTS results
- Browser caching for static assets
- Service response caching for voice lists
- Database query optimization with proper indexing
Performance Monitoring
# Built-in performance logging
@app.before_request
def log_request_info():
app.logger.info('Request: %s %s', request.method, request.url)
@app.after_request
def log_response_info(response):
app.logger.info('Response: %s', response.status_code)
return response
Scalability Considerations
- Stateless design for horizontal scaling
- Database connection pooling for concurrent requests
- Async processing potential for background tasks
- CDN integration ready for static assets
Resource Usage
| Component | CPU Usage | Memory Usage | Storage |
|---|---|---|---|
| Flask App | Low | 50-200MB | Minimal |
| SQLite DB | Minimal | 10-50MB | Growing |
| Audio Files | None | None | 1-10MB/file |
| Cache | Low | 10-100MB | Temporary |
๐ Security
Security Features
API Security
- Input validation on all endpoints
- SQL injection prevention with parameterized queries
- File upload restrictions (type, size validation)
- Rate limiting capability (configurable)
Data Protection
- Environment variable isolation for API keys
- Secure file serving with path validation
- No credential logging in application logs
- Temporary file cleanup after processing
Best Practices
# Example security measures
@app.before_request
def security_headers():
# CSRF protection
if request.method == 'POST':
validate_csrf_token(request)
# File upload validation
if 'file' in request.files:
validate_file_upload(request.files['file'])
@app.after_request
def add_security_headers(response):
response.headers['X-Content-Type-Options'] = 'nosniff'
response.headers['X-Frame-Options'] = 'DENY'
return response
Production Deployment
Environment Security
# Production .env example
FLASK_ENV=production
FLASK_DEBUG=false
SECRET_KEY=complex-random-string-here
# Use secrets management
OPENAI_API_KEY=${OPENAI_API_KEY}
ELEVENLABS_API_KEY=${ELEVENLABS_API_KEY}
Server Configuration
- HTTPS enforcement with SSL certificates
- Reverse proxy (nginx) for static file serving
- Process management with gunicorn/uwsgi
- Monitoring with application performance tools
Database Security
- Regular backups of SQLite database
- File permissions properly configured
- Database encryption at rest (if required)
- Access logging for audit trails
๐ค Contributing
We welcome contributions! Please follow our guidelines:
Development Setup
# Fork and clone repository
git clone https://github.com/yourusername/aivoicegen.git
cd aivoicegen
# Create feature branch
git checkout -b feature/amazing-feature
# Set up development environment
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
pip install -r requirements-dev.txt
# Run tests
python -m unittest discover tests
# Start development server
python app.py
Contribution Types
- ๐ Bug Reports: Use GitHub issues with detailed descriptions
- โจ Feature Requests: Propose new functionality with use cases
- ๐ Documentation: Improve README, API docs, or code comments
- ๐งช Tests: Add test coverage for existing or new functionality
- ๐จ UI/UX: Enhance the user interface and experience
- ๐ง Performance: Optimize code for better performance
Code Standards
- Python: Follow PEP 8 style guide
- JavaScript: Use ES6+ features consistently
- HTML/CSS: Semantic markup with Tailwind classes
- Documentation: Clear docstrings and comments
- Testing: Maintain 80%+ test coverage
Pull Request Process
- Create descriptive PR title and detailed description
- Reference related issues with closing keywords
- Ensure all tests pass before requesting review
- Add tests for new functionality
- Update documentation as needed
- Follow security guidelines for sensitive changes
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
MIT License Summary
- โ Commercial use allowed
- โ Modification allowed
- โ Distribution allowed
- โ Private use allowed
- โ Liability not provided
- โ Warranty not provided
๐ Documentation
Additional documentation is available in the docs/ directory:
| Document | Description |
|---|---|
todolist.md |
Development tasks organized by priority |
secrets.md |
Environment variables and API setup guide |
PRD.md |
Product Requirements Document with user stories |
design.md |
System architecture and enhancement roadmap |
Made with โค๏ธ for the community
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aivoicegen-1.0.0.tar.gz.
File metadata
- Download URL: aivoicegen-1.0.0.tar.gz
- Upload date:
- Size: 52.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6460b7d1f6f10f47d325609207f423e680a45faaeb8b290055599786d418fda2
|
|
| MD5 |
0c9ea1bd7e47a92d95ebe9a40ac5fe77
|
|
| BLAKE2b-256 |
9dbab766528c3ee89b47b9c24fa961b3c10a617081fc251869cb3d1207dd2516
|
File details
Details for the file aivoicegen-1.0.0-py3-none-any.whl.
File metadata
- Download URL: aivoicegen-1.0.0-py3-none-any.whl
- Upload date:
- Size: 39.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
785c2e8423023b994498154a631adedb28f042bf396e09ddea0069f3e59e1dcb
|
|
| MD5 |
2dc4e002f862e828f0007cef75ced9fa
|
|
| BLAKE2b-256 |
47ba9464b2c8ef2fdd7de4ab18047e4127a04d73e97d965ed7ab5d635315c9bd
|