A comprehensive Python library for building production-ready voice agents with multi-provider support. Features real-time streaming TTS/STT, OpenAI, ElevenLabs, and Groq integration, audio processing, and seamless conversational AI capabilities.
Project description
🗣️ Voice-Agents
Enterprise-Grade Voice Agent Framework
🏠 Swarms Website • 📚 Documentation • 📦 Examples
Overview
Voice-Agents is a production-ready Python library for building enterprise-grade voice-enabled agentic applications. Built by Swarms Corporation, it provides seamless integration with multiple TTS/STT providers including OpenAI, ElevenLabs, and Groq, with real-time streaming capabilities optimized for agent-based architectures.
Voice-Agents delivers the infrastructure required to build conversational agentic assistants, voice-enabled agents, and real-time audio processing systems, enabling rapid deployment from prototype to production.
Built by Swarms Corporation
Voice-Agents is part of the Swarms ecosystem—the enterprise-grade, production-ready multi-agent orchestration framework. Learn more at swarms.ai and docs.swarms.world.
Features
Core Capabilities
| Feature | Description |
|---|---|
| Multi-Provider TTS Support | Seamlessly switch between OpenAI, ElevenLabs, and Groq |
| Real-Time Streaming | Low-latency audio streaming for live agent interactions |
| Speech-to-Text | High-accuracy transcription using OpenAI Whisper |
| Audio Processing | Built-in utilities for recording, playback, and format conversion |
| Production-Ready | Enterprise-grade error handling, authentication, and logging |
Advanced Features
| Feature | Description |
|---|---|
| Streaming Callbacks | Real-time TTS callbacks for agent streaming outputs |
| Multiple Audio Formats | Support for PCM, MP3, Opus, AAC, FLAC, and more |
| Voice Customization | 10+ OpenAI voices and 30+ ElevenLabs voices |
| Sentence Detection | Intelligent text formatting for natural speech pauses |
| FastAPI Integration | Generator-based streaming for web applications |
| Type Safety | Full type hints and Literal types for better IDE support |
Installation
Basic Installation
pip install voice-agents
Development Installation
git clone https://github.com/The-Swarm-Corporation/Voice-Agents.git
cd Voice-Agents
pip install -e .
Requirements
- Python 3.10+
- API keys for your chosen providers:
- OpenAI API key (for TTS and Whisper STT)
- ElevenLabs API key (optional, for ElevenLabs TTS)
Quick Start
Environment Setup
Create a .env file or set environment variables:
export OPENAI_API_KEY="your-openai-api-key"
export ELEVENLABS_API_KEY="your-elevenlabs-api-key" # Optional
export GROQ_API_KEY="your-groq-api-key" # Optional
Basic Text-to-Speech
from voice_agents import stream_tts, format_text_for_speech
# Format text for natural speech
text = "Hello! This is a voice agent speaking. How can I help you today?"
chunks = format_text_for_speech(text)
# Convert to speech and play
stream_tts(chunks, model="openai/tts-1", voice="alloy")
Speech-to-Text
from voice_agents import speech_to_text, record_audio
# Record audio from microphone
audio = record_audio(duration=5.0, sample_rate=16000)
# Transcribe to text
transcription = speech_to_text(audio_data=audio, sample_rate=16000)
print(f"Transcribed: {transcription}")
Core Functions
Text-to-Speech (OpenAI)
from voice_agents import stream_tts, format_text_for_speech, VOICES
# Available voices: alloy, ash, ballad, coral, echo, fable, nova, onyx, sage, shimmer
text_chunks = format_text_for_speech("Your text here")
# Basic usage - plays audio
stream_tts(text_chunks, model="openai/tts-1", voice="nova")
# Streaming mode for real-time processing
stream_tts(
text_chunks,
model="openai/tts-1",
voice="alloy",
stream_mode=True, # Process chunks as they arrive
response_format="pcm"
)
# For FastAPI/web streaming
from fastapi.responses import StreamingResponse
def audio_endpoint():
generator = stream_tts(
text_chunks,
voice="alloy",
return_generator=True
)
return StreamingResponse(generator, media_type="audio/pcm")
Text-to-Speech (ElevenLabs)
from voice_agents import stream_tts_elevenlabs, ELEVENLABS_VOICE_NAMES
# Available voices: rachel, domi, bella, antoni, elli, josh, and 25+ more
print(f"Available voices: {ELEVENLABS_VOICE_NAMES}")
# Basic usage
stream_tts_elevenlabs(
text_chunks,
voice_id="rachel", # Use friendly name or voice ID
model_id="eleven_multilingual_v2",
stability=0.5,
similarity_boost=0.75
)
# High-quality streaming for web
generator = stream_tts_elevenlabs(
text_chunks,
voice_id="domi",
output_format="mp3_44100_128", # Recommended for web
return_generator=True
)
Speech-to-Text
from voice_agents import speech_to_text, record_audio
import numpy as np
# From audio file
transcription = speech_to_text(
audio_file_path="recording.wav",
model="whisper-1",
language="en", # Optional: auto-detect if None
response_format="text"
)
# From numpy array (recorded audio)
audio = record_audio(duration=5.0, sample_rate=16000)
transcription = speech_to_text(
audio_data=audio,
sample_rate=16000,
prompt="This is a technical conversation about AI" # Optional context
)
# Get structured output
result = speech_to_text(
audio_file_path="meeting.mp3",
response_format="verbose_json" # Returns detailed metadata
)
Audio Recording
from voice_agents import record_audio
# Record 5 seconds of audio
audio = record_audio(duration=5.0, sample_rate=16000, channels=1)
# Use with speech-to-text
from voice_agents import speech_to_text
text = speech_to_text(audio_data=audio, sample_rate=16000)
Streaming TTS Callback for Agents
from voice_agents import StreamingTTSCallback
# Create callback for real-time agent responses
tts_callback = StreamingTTSCallback(
voice="alloy",
model="openai/tts-1",
min_sentence_length=10 # Minimum chars before speaking
)
# Use with any streaming text generator
def agent_stream():
for chunk in some_agent.generate():
tts_callback(chunk) # Automatically speaks complete sentences
tts_callback.flush() # Speak any remaining text
Audio Format Utilities
from voice_agents import get_media_type_for_format
# Get MIME type for FastAPI
media_type = get_media_type_for_format("mp3_44100_128")
# Returns: "audio/mpeg"
media_type = get_media_type_for_format("pcm_44100")
# Returns: "audio/pcm"
Swarms Integration
Voice-Agents is designed to work seamlessly with Swarms, the enterprise-grade multi-agent orchestration framework.
Complete Example: Voice-Enabled Trading Agent
from swarms import Agent
from voice_agents import StreamingTTSCallback, format_text_for_speech
# Initialize the Swarms agent
agent = Agent(
agent_name="Quantitative-Trading-Agent",
agent_description="Advanced quantitative trading and algorithmic analysis agent",
model_name="gpt-4",
dynamic_temperature_enabled=True,
max_loops=1,
dynamic_context_window=True,
top_p=None,
streaming_on=True,
interactive=False,
)
# Create the streaming TTS callback
tts_callback = StreamingTTSCallback(voice="alloy", model="openai/tts-1")
# Run the agent with streaming TTS callback
out = agent.run(
task="What are the top five best energy stocks across nuclear, solar, gas, and other energy sources?",
streaming_callback=tts_callback,
)
# Flush any remaining text in the buffer
tts_callback.flush()
print(out)
Examples
The examples/ directory contains comprehensive examples demonstrating all features of Voice-Agents, organized into logical categories. See the Examples README for detailed documentation.
Examples by Category
Text-to-Speech (examples/text_to_speech/)
| Example File | Description |
|---|---|
example_stream_tts.py |
Unified TTS with OpenAI models, list_models() |
example_stream_tts_elevenlabs.py |
ElevenLabs TTS, unified and direct functions |
example_streaming_tts_callback.py |
StreamingTTSCallback for real-time TTS |
example_voice_selection.py |
Voice selection with list_voices() |
Speech-to-Text (examples/speech_to_text/)
| Example File | Description |
|---|---|
example_speech_to_text.py |
OpenAI Whisper transcription |
example_speech_to_text_elevenlabs_file.py |
ElevenLabs STT from audio file |
example_speech_to_text_elevenlabs_audio.py |
ElevenLabs STT from audio data |
Utilities (examples/utilities/)
| Example File | Description |
|---|---|
example_format_text_for_speech.py |
Text formatting for speech with abbreviation handling |
example_play_audio.py |
Audio playback and tone generation |
example_record_audio.py |
Microphone audio recording |
example_get_media_type.py |
Media type (MIME) utilities for FastAPI |
Workflows (examples/workflows/)
| Example File | Description |
|---|---|
example_complete_voice_agent.py |
Complete voice agent workflows |
Running Examples
# Text-to-Speech examples
python examples/text_to_speech/example_stream_tts.py
python examples/text_to_speech/example_stream_tts_elevenlabs.py
# Speech-to-Text examples
python examples/speech_to_text/example_speech_to_text.py
python examples/speech_to_text/example_speech_to_text_elevenlabs_file.py
# Utility examples
python examples/utilities/example_format_text_for_speech.py
python examples/utilities/example_record_audio.py
# Workflow examples
python examples/workflows/example_complete_voice_agent.py
For more details, see the Examples README.
API Reference
Constants
SAMPLE_RATE: Default sample rate (24000 Hz)VOICES: List of available OpenAI voicesELEVENLABS_VOICES: Dictionary mapping friendly names to ElevenLabs voice IDsELEVENLABS_VOICE_NAMES: List of available ElevenLabs voice namesOPENAI_TTS_MODELS: List of available OpenAI TTS modelsELEVENLABS_TTS_MODELS: List of available ElevenLabs TTS modelsVoiceType: Type alias for OpenAI voice options
Functions
format_text_for_speech(text: str) -> List[str]
Intelligently formats text into speech-friendly chunks by detecting sentence boundaries, handling abbreviations, and preserving natural pauses.
stream_tts(text_chunks, model, voice, stream_mode, response_format, return_generator)
Unified TTS function supporting both OpenAI and ElevenLabs. Model format: "provider/model_name" (e.g., "openai/tts-1", "elevenlabs/eleven_multilingual_v2"). Returns generator for web streaming or plays audio directly.
list_models() -> List[dict]
List all available TTS models with their providers. Returns list of dictionaries with model, provider, and model_name keys.
list_voices() -> List[dict]
List all available voices from all providers. Returns list of dictionaries with voice, provider, voice_id, and description keys.
stream_tts_openai(text_chunks, voice, model, stream_mode, response_format, return_generator)
OpenAI TTS with streaming support. Returns generator for web streaming or plays audio directly.
stream_tts_elevenlabs(text_chunks, voice_id, model_id, stability, similarity_boost, output_format, return_generator)
ElevenLabs TTS with advanced voice control and multiple output formats.
speech_to_text(audio_file_path, audio_data, sample_rate, model, language, prompt, response_format)
OpenAI Whisper transcription with support for files or numpy arrays.
speech_to_text_elevenlabs(audio_file_path, audio_data, sample_rate, realtime, model_id, ...)
ElevenLabs Speech-to-Text with support for both real-time (WebSocket) and non-real-time (file upload) modes. Supports speaker diarization, timestamps, and language detection.
record_audio(duration, sample_rate, channels) -> np.ndarray
Record audio from default microphone. Returns numpy array.
play_audio(audio_data: np.ndarray)
Play audio data using sounddevice.
get_media_type_for_format(output_format: str) -> str
Get MIME type for audio format (useful for FastAPI).
Classes
StreamingTTSCallback
Real-time TTS callback for agent streaming outputs. Automatically detects complete sentences and converts them to speech.
Methods:
__call__(chunk: str): Process streaming text chunkflush(): Speak any remaining buffered text
Use Cases
Conversational AI Assistants
Build voice-enabled chatbots and virtual assistants with natural, real-time speech synthesis.
Agent Narration
Provide audio feedback for long-running agent tasks, making agent behavior transparent and engaging.
Voice-Enabled Analytics
Create voice interfaces for data analysis, trading systems, and business intelligence tools.
Real-Time Transcription
Transcribe meetings, interviews, and conversations with high accuracy using Whisper.
Multi-Modal Applications
Combine voice input/output with visual interfaces for rich, interactive experiences.
Configuration
Environment Variables
# Required for OpenAI TTS and STT
OPENAI_API_KEY=your-key-here
# Required for ElevenLabs TTS
ELEVENLABS_API_KEY=your-key-here
Voice Selection
OpenAI Voices:
alloy,ash,ballad,coral,echo,fable,nova,onyx,sage,shimmer
ElevenLabs Voices:
- Professional:
rachel,nicole,grace - Expressive:
domi,elli,bella - Deep:
antoni,josh,clyde - And 20+ more (see
ELEVENLABS_VOICE_NAMES)
Contributing
Voice-Agents is built by the community, for the community. We welcome contributions!
How to Contribute
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Make your changes and add tests
- Commit your changes:
git commit -m 'Add amazing feature' - Push to the branch:
git push origin feature/amazing-feature - Open a Pull Request
Development Setup
git clone https://github.com/The-Swarm-Corporation/Voice-Agents.git
cd Voice-Agents
pip install -e ".[dev]"
pre-commit install
Code Standards
- Follow PEP 8 style guidelines
- Add type hints to all functions
- Include docstrings for all public APIs
- Write tests for new features
License
This project is licensed under the Apache License - see the LICENSE file for details.
Acknowledgments
- Built by Swarms Corporation
- Part of the Swarms ecosystem
- Powered by OpenAI, ElevenLabs, and Groq APIs
Support & Community
- Documentation: GitHub Repository
- Swarms Documentation: docs.swarms.world
- Swarms Community: Discord
- Issues: GitHub Issues
Made by Swarms Corporation
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file voice_agents-0.2.0.tar.gz.
File metadata
- Download URL: voice_agents-0.2.0.tar.gz
- Upload date:
- Size: 40.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.12.3 Darwin/24.5.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d09b44f7f7f1576934e7178c097d9962f036f639012511ad9cf68b333e07f67d
|
|
| MD5 |
86c2df068e2bf5e83c9c967fffb7f05f
|
|
| BLAKE2b-256 |
f705a10e57ef55aab47c18c213e9044ceb14cf752221fd94926b8704e7bd0c32
|
File details
Details for the file voice_agents-0.2.0-py3-none-any.whl.
File metadata
- Download URL: voice_agents-0.2.0-py3-none-any.whl
- Upload date:
- Size: 39.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.12.3 Darwin/24.5.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9258cb22e12ad60141a9b773b76c67ca154c1d43666271b976b4cfceb4009bba
|
|
| MD5 |
800f70448c01137b2c061610f071c1ba
|
|
| BLAKE2b-256 |
f229ee42f9d0e61e2b6250d0c4aa6a0893e454450a31ddf580fbd809dce8f34b
|