Skip to main content

A comprehensive Python library for building production-ready voice agents with multi-provider support. Features real-time streaming TTS/STT, OpenAI, ElevenLabs, and Groq integration, audio processing, and seamless conversational AI capabilities.

Project description

🗣️ Voice-Agents

Enterprise-Grade Voice Agent Framework


🏠 Swarms Website • 📚 Documentation • 📦 Examples


License: MIT Python 3.10+ GitHub stars


Overview

Voice-Agents is a production-ready Python library for building enterprise-grade voice-enabled agentic applications. Built by Swarms Corporation, it provides seamless integration with multiple TTS/STT providers including OpenAI, ElevenLabs, and Groq, with real-time streaming capabilities optimized for agent-based architectures.

Voice-Agents delivers the infrastructure required to build conversational agentic assistants, voice-enabled agents, and real-time audio processing systems, enabling rapid deployment from prototype to production.

Built by Swarms Corporation

Voice-Agents is part of the Swarms ecosystem—the enterprise-grade, production-ready multi-agent orchestration framework. Learn more at swarms.ai and docs.swarms.world.


Features

Core Capabilities

Feature Description
Multi-Provider TTS Support Seamlessly switch between OpenAI, ElevenLabs, and Groq
Real-Time Streaming Low-latency audio streaming for live agent interactions
Speech-to-Text High-accuracy transcription using OpenAI Whisper
Audio Processing Built-in utilities for recording, playback, and format conversion
Production-Ready Enterprise-grade error handling, authentication, and logging

Advanced Features

Feature Description
Streaming Callbacks Real-time TTS callbacks for agent streaming outputs
Multiple Audio Formats Support for PCM, MP3, Opus, AAC, FLAC, and more
Voice Customization 10+ OpenAI voices and 30+ ElevenLabs voices
Sentence Detection Intelligent text formatting for natural speech pauses
FastAPI Integration Generator-based streaming for web applications
Type Safety Full type hints and Literal types for better IDE support

Installation

Basic Installation

pip install voice-agents

Development Installation

git clone https://github.com/The-Swarm-Corporation/Voice-Agents.git
cd Voice-Agents
pip install -e .

Requirements

  • Python 3.10+
  • API keys for your chosen providers:
    • OpenAI API key (for TTS and Whisper STT)
    • ElevenLabs API key (optional, for ElevenLabs TTS)

Quick Start

Environment Setup

Create a .env file or set environment variables:

export OPENAI_API_KEY="your-openai-api-key"
export ELEVENLABS_API_KEY="your-elevenlabs-api-key"  # Optional
export GROQ_API_KEY="your-groq-api-key"              # Optional

Basic Text-to-Speech

from voice_agents import stream_tts, format_text_for_speech

# Format text for natural speech
text = "Hello! This is a voice agent speaking. How can I help you today?"
chunks = format_text_for_speech(text)

# Convert to speech and play
stream_tts(chunks, model="openai/tts-1", voice="alloy")

Speech-to-Text

from voice_agents import speech_to_text, record_audio

# Record audio from microphone
audio = record_audio(duration=5.0, sample_rate=16000)

# Transcribe to text
transcription = speech_to_text(audio_data=audio, sample_rate=16000)
print(f"Transcribed: {transcription}")

Core Functions

Text-to-Speech (OpenAI)

from voice_agents import stream_tts, format_text_for_speech, VOICES

# Available voices: alloy, ash, ballad, coral, echo, fable, nova, onyx, sage, shimmer
text_chunks = format_text_for_speech("Your text here")

# Basic usage - plays audio
stream_tts(text_chunks, model="openai/tts-1", voice="nova")

# Streaming mode for real-time processing
stream_tts(
    text_chunks,
    model="openai/tts-1",
    voice="alloy",
    stream_mode=True,  # Process chunks as they arrive
    response_format="pcm"
)

# For FastAPI/web streaming
from fastapi.responses import StreamingResponse

def audio_endpoint():
    generator = stream_tts(
        text_chunks,
        voice="alloy",
        return_generator=True
    )
    return StreamingResponse(generator, media_type="audio/pcm")

Text-to-Speech (ElevenLabs)

from voice_agents import stream_tts_elevenlabs, ELEVENLABS_VOICE_NAMES

# Available voices: rachel, domi, bella, antoni, elli, josh, and 25+ more
print(f"Available voices: {ELEVENLABS_VOICE_NAMES}")

# Basic usage
stream_tts_elevenlabs(
    text_chunks,
    voice_id="rachel",  # Use friendly name or voice ID
    model_id="eleven_multilingual_v2",
    stability=0.5,
    similarity_boost=0.75
)

# High-quality streaming for web
generator = stream_tts_elevenlabs(
    text_chunks,
    voice_id="domi",
    output_format="mp3_44100_128",  # Recommended for web
    return_generator=True
)

Speech-to-Text

from voice_agents import speech_to_text, record_audio
import numpy as np

# From audio file
transcription = speech_to_text(
    audio_file_path="recording.wav",
    model="whisper-1",
    language="en",  # Optional: auto-detect if None
    response_format="text"
)

# From numpy array (recorded audio)
audio = record_audio(duration=5.0, sample_rate=16000)
transcription = speech_to_text(
    audio_data=audio,
    sample_rate=16000,
    prompt="This is a technical conversation about AI"  # Optional context
)

# Get structured output
result = speech_to_text(
    audio_file_path="meeting.mp3",
    response_format="verbose_json"  # Returns detailed metadata
)

Audio Recording

from voice_agents import record_audio

# Record 5 seconds of audio
audio = record_audio(duration=5.0, sample_rate=16000, channels=1)

# Use with speech-to-text
from voice_agents import speech_to_text
text = speech_to_text(audio_data=audio, sample_rate=16000)

Streaming TTS Callback for Agents

from voice_agents import StreamingTTSCallback

# Create callback for real-time agent responses
tts_callback = StreamingTTSCallback(
    voice="alloy",
    model="openai/tts-1",
    min_sentence_length=10  # Minimum chars before speaking
)

# Use with any streaming text generator
def agent_stream():
    for chunk in some_agent.generate():
        tts_callback(chunk)  # Automatically speaks complete sentences
    tts_callback.flush()  # Speak any remaining text

Audio Format Utilities

from voice_agents import get_media_type_for_format

# Get MIME type for FastAPI
media_type = get_media_type_for_format("mp3_44100_128")
# Returns: "audio/mpeg"

media_type = get_media_type_for_format("pcm_44100")
# Returns: "audio/pcm"

Swarms Integration

Voice-Agents is designed to work seamlessly with Swarms, the enterprise-grade multi-agent orchestration framework.

Complete Example: Voice-Enabled Trading Agent

from swarms import Agent
from voice_agents import StreamingTTSCallback, format_text_for_speech

# Initialize the Swarms agent
agent = Agent(
    agent_name="Quantitative-Trading-Agent",
    agent_description="Advanced quantitative trading and algorithmic analysis agent",
    model_name="gpt-4",
    dynamic_temperature_enabled=True,
    max_loops=1,
    dynamic_context_window=True,
    top_p=None,
    streaming_on=True,
    interactive=False,
)

# Create the streaming TTS callback
tts_callback = StreamingTTSCallback(voice="alloy", model="openai/tts-1")

# Run the agent with streaming TTS callback
out = agent.run(
    task="What are the top five best energy stocks across nuclear, solar, gas, and other energy sources?",
    streaming_callback=tts_callback,
)

# Flush any remaining text in the buffer
tts_callback.flush()

print(out)

Examples

The examples/ directory contains comprehensive examples demonstrating all features of Voice-Agents, organized into logical categories. See the Examples README for detailed documentation.

Examples by Category

Text-to-Speech (examples/text_to_speech/)

Example File Description
example_stream_tts.py Unified TTS with OpenAI models, list_models()
example_stream_tts_elevenlabs.py ElevenLabs TTS, unified and direct functions
example_streaming_tts_callback.py StreamingTTSCallback for real-time TTS
example_voice_selection.py Voice selection with list_voices()

Speech-to-Text (examples/speech_to_text/)

Example File Description
example_speech_to_text.py OpenAI Whisper transcription
example_speech_to_text_elevenlabs_file.py ElevenLabs STT from audio file
example_speech_to_text_elevenlabs_audio.py ElevenLabs STT from audio data

Utilities (examples/utilities/)

Example File Description
example_format_text_for_speech.py Text formatting for speech with abbreviation handling
example_play_audio.py Audio playback and tone generation
example_record_audio.py Microphone audio recording
example_get_media_type.py Media type (MIME) utilities for FastAPI

Workflows (examples/workflows/)

Example File Description
example_complete_voice_agent.py Complete voice agent workflows

Running Examples

# Text-to-Speech examples
python examples/text_to_speech/example_stream_tts.py
python examples/text_to_speech/example_stream_tts_elevenlabs.py

# Speech-to-Text examples
python examples/speech_to_text/example_speech_to_text.py
python examples/speech_to_text/example_speech_to_text_elevenlabs_file.py

# Utility examples
python examples/utilities/example_format_text_for_speech.py
python examples/utilities/example_record_audio.py

# Workflow examples
python examples/workflows/example_complete_voice_agent.py

For more details, see the Examples README.


API Reference

Constants

  • SAMPLE_RATE: Default sample rate (24000 Hz)
  • VOICES: List of available OpenAI voices
  • ELEVENLABS_VOICES: Dictionary mapping friendly names to ElevenLabs voice IDs
  • ELEVENLABS_VOICE_NAMES: List of available ElevenLabs voice names
  • OPENAI_TTS_MODELS: List of available OpenAI TTS models
  • ELEVENLABS_TTS_MODELS: List of available ElevenLabs TTS models
  • VoiceType: Type alias for OpenAI voice options

Functions

format_text_for_speech(text: str) -> List[str]

Intelligently formats text into speech-friendly chunks by detecting sentence boundaries, handling abbreviations, and preserving natural pauses.

stream_tts(text_chunks, model, voice, stream_mode, response_format, return_generator)

Unified TTS function supporting both OpenAI and ElevenLabs. Model format: "provider/model_name" (e.g., "openai/tts-1", "elevenlabs/eleven_multilingual_v2"). Returns generator for web streaming or plays audio directly.

list_models() -> List[dict]

List all available TTS models with their providers. Returns list of dictionaries with model, provider, and model_name keys.

list_voices() -> List[dict]

List all available voices from all providers. Returns list of dictionaries with voice, provider, voice_id, and description keys.

stream_tts_openai(text_chunks, voice, model, stream_mode, response_format, return_generator)

OpenAI TTS with streaming support. Returns generator for web streaming or plays audio directly.

stream_tts_elevenlabs(text_chunks, voice_id, model_id, stability, similarity_boost, output_format, return_generator)

ElevenLabs TTS with advanced voice control and multiple output formats.

speech_to_text(audio_file_path, audio_data, sample_rate, model, language, prompt, response_format)

OpenAI Whisper transcription with support for files or numpy arrays.

speech_to_text_elevenlabs(audio_file_path, audio_data, sample_rate, realtime, model_id, ...)

ElevenLabs Speech-to-Text with support for both real-time (WebSocket) and non-real-time (file upload) modes. Supports speaker diarization, timestamps, and language detection.

record_audio(duration, sample_rate, channels) -> np.ndarray

Record audio from default microphone. Returns numpy array.

play_audio(audio_data: np.ndarray)

Play audio data using sounddevice.

get_media_type_for_format(output_format: str) -> str

Get MIME type for audio format (useful for FastAPI).

Classes

StreamingTTSCallback

Real-time TTS callback for agent streaming outputs. Automatically detects complete sentences and converts them to speech.

Methods:

  • __call__(chunk: str): Process streaming text chunk
  • flush(): Speak any remaining buffered text

Use Cases

Conversational AI Assistants

Build voice-enabled chatbots and virtual assistants with natural, real-time speech synthesis.

Agent Narration

Provide audio feedback for long-running agent tasks, making agent behavior transparent and engaging.

Voice-Enabled Analytics

Create voice interfaces for data analysis, trading systems, and business intelligence tools.

Real-Time Transcription

Transcribe meetings, interviews, and conversations with high accuracy using Whisper.

Multi-Modal Applications

Combine voice input/output with visual interfaces for rich, interactive experiences.


Configuration

Environment Variables

# Required for OpenAI TTS and STT
OPENAI_API_KEY=your-key-here

# Required for ElevenLabs TTS
ELEVENLABS_API_KEY=your-key-here

Voice Selection

OpenAI Voices:

  • alloy, ash, ballad, coral, echo, fable, nova, onyx, sage, shimmer

ElevenLabs Voices:

  • Professional: rachel, nicole, grace
  • Expressive: domi, elli, bella
  • Deep: antoni, josh, clyde
  • And 20+ more (see ELEVENLABS_VOICE_NAMES)

Contributing

Voice-Agents is built by the community, for the community. We welcome contributions!

How to Contribute

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/amazing-feature
  3. Make your changes and add tests
  4. Commit your changes: git commit -m 'Add amazing feature'
  5. Push to the branch: git push origin feature/amazing-feature
  6. Open a Pull Request

Development Setup

git clone https://github.com/The-Swarm-Corporation/Voice-Agents.git
cd Voice-Agents
pip install -e ".[dev]"
pre-commit install

Code Standards

  • Follow PEP 8 style guidelines
  • Add type hints to all functions
  • Include docstrings for all public APIs
  • Write tests for new features

License

This project is licensed under the Apache License - see the LICENSE file for details.


Acknowledgments


Support & Community


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

voice_agents-0.2.0.tar.gz (40.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

voice_agents-0.2.0-py3-none-any.whl (39.0 kB view details)

Uploaded Python 3

File details

Details for the file voice_agents-0.2.0.tar.gz.

File metadata

  • Download URL: voice_agents-0.2.0.tar.gz
  • Upload date:
  • Size: 40.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.12.3 Darwin/24.5.0

File hashes

Hashes for voice_agents-0.2.0.tar.gz
Algorithm Hash digest
SHA256 d09b44f7f7f1576934e7178c097d9962f036f639012511ad9cf68b333e07f67d
MD5 86c2df068e2bf5e83c9c967fffb7f05f
BLAKE2b-256 f705a10e57ef55aab47c18c213e9044ceb14cf752221fd94926b8704e7bd0c32

See more details on using hashes here.

File details

Details for the file voice_agents-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: voice_agents-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 39.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.12.3 Darwin/24.5.0

File hashes

Hashes for voice_agents-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9258cb22e12ad60141a9b773b76c67ca154c1d43666271b976b4cfceb4009bba
MD5 800f70448c01137b2c061610f071c1ba
BLAKE2b-256 f229ee42f9d0e61e2b6250d0c4aa6a0893e454450a31ddf580fbd809dce8f34b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page