A comprehensive Python library for building production-ready voice agents with multi-provider support. Features real-time streaming TTS/STT, OpenAI, ElevenLabs, and Groq integration, audio processing, and seamless conversational AI capabilities.

These details have not been verified by PyPI

Project links

Project description

🗣️ Voice-Agents

Enterprise-Grade Voice Agent Framework

🏠 Swarms Website • 📚 Documentation • 📦 Examples

Overview

Voice-Agents is a production-ready Python library for building enterprise-grade voice-enabled agentic applications. Built by Swarms Corporation, it provides seamless integration with multiple TTS/STT providers including OpenAI, ElevenLabs, and Groq, with real-time streaming capabilities optimized for agent-based architectures.

Voice-Agents delivers the infrastructure required to build conversational agentic assistants, voice-enabled agents, and real-time audio processing systems, enabling rapid deployment from prototype to production.

Built by Swarms Corporation

Voice-Agents is part of the Swarms ecosystem—the enterprise-grade, production-ready multi-agent orchestration framework. Learn more at swarms.ai and docs.swarms.world.

Features

Core Capabilities

Feature	Description
Multi-Provider TTS Support	Seamlessly switch between OpenAI, ElevenLabs, and Groq
Real-Time Streaming	Low-latency audio streaming for live agent interactions
Speech-to-Text	High-accuracy transcription using OpenAI Whisper
Audio Processing	Built-in utilities for recording, playback, and format conversion
Production-Ready	Enterprise-grade error handling, authentication, and logging

Advanced Features

Feature	Description
Streaming Callbacks	Real-time TTS callbacks for agent streaming outputs
Multiple Audio Formats	Support for PCM, MP3, Opus, AAC, FLAC, and more
Voice Customization	10+ OpenAI voices and 30+ ElevenLabs voices
Sentence Detection	Intelligent text formatting for natural speech pauses
FastAPI Integration	Generator-based streaming for web applications
Type Safety	Full type hints and Literal types for better IDE support

Installation

Basic Installation

pip install voice-agents

Development Installation

git clone https://github.com/The-Swarm-Corporation/Voice-Agents.git
cd Voice-Agents
pip install -e .

Requirements

Python 3.10+
API keys for your chosen providers:
- OpenAI API key (for TTS and Whisper STT)
- ElevenLabs API key (optional, for ElevenLabs TTS)

Quick Start

Environment Setup

Create a .env file or set environment variables:

export OPENAI_API_KEY="your-openai-api-key"
export ELEVENLABS_API_KEY="your-elevenlabs-api-key"  # Optional
export GROQ_API_KEY="your-groq-api-key"              # Optional

Basic Text-to-Speech

from voice_agents import stream_tts, format_text_for_speech

# Format text for natural speech
text = "Hello! This is a voice agent speaking. How can I help you today?"
chunks = format_text_for_speech(text)

# Convert to speech and play
stream_tts(chunks, model="openai/tts-1", voice="alloy")

Speech-to-Text

from voice_agents import speech_to_text, record_audio

# Record audio from microphone
audio = record_audio(duration=5.0, sample_rate=16000)

# Transcribe to text
transcription = speech_to_text(audio_data=audio, sample_rate=16000)
print(f"Transcribed: {transcription}")

Core Functions

Text-to-Speech (OpenAI)

from voice_agents import stream_tts, format_text_for_speech, VOICES

# Available voices: alloy, ash, ballad, coral, echo, fable, nova, onyx, sage, shimmer
text_chunks = format_text_for_speech("Your text here")

# Basic usage - plays audio
stream_tts(text_chunks, model="openai/tts-1", voice="nova")

# Streaming mode for real-time processing
stream_tts(
    text_chunks,
    model="openai/tts-1",
    voice="alloy",
    stream_mode=True,  # Process chunks as they arrive
    response_format="pcm"
)

# For FastAPI/web streaming
from fastapi.responses import StreamingResponse

def audio_endpoint():
    generator = stream_tts(
        text_chunks,
        voice="alloy",
        return_generator=True
    )
    return StreamingResponse(generator, media_type="audio/pcm")

Text-to-Speech (ElevenLabs)

from voice_agents import stream_tts_elevenlabs, ELEVENLABS_VOICE_NAMES

# Available voices: rachel, domi, bella, antoni, elli, josh, and 25+ more
print(f"Available voices: {ELEVENLABS_VOICE_NAMES}")

# Basic usage
stream_tts_elevenlabs(
    text_chunks,
    voice_id="rachel",  # Use friendly name or voice ID
    model_id="eleven_multilingual_v2",
    stability=0.5,
    similarity_boost=0.75
)

# High-quality streaming for web
generator = stream_tts_elevenlabs(
    text_chunks,
    voice_id="domi",
    output_format="mp3_44100_128",  # Recommended for web
    return_generator=True
)

Speech-to-Text

from voice_agents import speech_to_text, record_audio
import numpy as np

# From audio file
transcription = speech_to_text(
    audio_file_path="recording.wav",
    model="whisper-1",
    language="en",  # Optional: auto-detect if None
    response_format="text"
)

# From numpy array (recorded audio)
audio = record_audio(duration=5.0, sample_rate=16000)
transcription = speech_to_text(
    audio_data=audio,
    sample_rate=16000,
    prompt="This is a technical conversation about AI"  # Optional context
)

# Get structured output
result = speech_to_text(
    audio_file_path="meeting.mp3",
    response_format="verbose_json"  # Returns detailed metadata
)

Audio Recording

from voice_agents import record_audio

# Record 5 seconds of audio
audio = record_audio(duration=5.0, sample_rate=16000, channels=1)

# Use with speech-to-text
from voice_agents import speech_to_text
text = speech_to_text(audio_data=audio, sample_rate=16000)

Streaming TTS Callback for Agents

from voice_agents import StreamingTTSCallback

# Create callback for real-time agent responses
tts_callback = StreamingTTSCallback(
    voice="alloy",
    model="openai/tts-1",
    min_sentence_length=10  # Minimum chars before speaking
)

# Use with any streaming text generator
def agent_stream():
    for chunk in some_agent.generate():
        tts_callback(chunk)  # Automatically speaks complete sentences
    tts_callback.flush()  # Speak any remaining text

Audio Format Utilities

from voice_agents import get_media_type_for_format

# Get MIME type for FastAPI
media_type = get_media_type_for_format("mp3_44100_128")
# Returns: "audio/mpeg"

media_type = get_media_type_for_format("pcm_44100")
# Returns: "audio/pcm"

Swarms Integration

Voice-Agents is designed to work seamlessly with Swarms, the enterprise-grade multi-agent orchestration framework.

Complete Example: Voice-Enabled Trading Agent

from swarms import Agent
from voice_agents import StreamingTTSCallback, format_text_for_speech

# Initialize the Swarms agent
agent = Agent(
    agent_name="Quantitative-Trading-Agent",
    agent_description="Advanced quantitative trading and algorithmic analysis agent",
    model_name="gpt-4",
    dynamic_temperature_enabled=True,
    max_loops=1,
    dynamic_context_window=True,
    top_p=None,
    streaming_on=True,
    interactive=False,
)

# Create the streaming TTS callback
tts_callback = StreamingTTSCallback(voice="alloy", model="openai/tts-1")

# Run the agent with streaming TTS callback
out = agent.run(
    task="What are the top five best energy stocks across nuclear, solar, gas, and other energy sources?",
    streaming_callback=tts_callback,
)

# Flush any remaining text in the buffer
tts_callback.flush()

print(out)

Examples

The examples/ directory contains comprehensive examples demonstrating all features of Voice-Agents, organized into logical categories. See the Examples README for detailed documentation.

Examples by Category

Text-to-Speech (`examples/text_to_speech/`)

Example File	Description
`example_stream_tts.py`	Unified TTS with OpenAI models, `list_models()`
`example_stream_tts_elevenlabs.py`	ElevenLabs TTS, unified and direct functions
`example_streaming_tts_callback.py`	StreamingTTSCallback for real-time TTS
`example_voice_selection.py`	Voice selection with `list_voices()`

Speech-to-Text (`examples/speech_to_text/`)

Example File	Description
`example_speech_to_text.py`	OpenAI Whisper transcription
`example_speech_to_text_elevenlabs_file.py`	ElevenLabs STT from audio file
`example_speech_to_text_elevenlabs_audio.py`	ElevenLabs STT from audio data

Utilities (`examples/utilities/`)

Example File	Description
`example_format_text_for_speech.py`	Text formatting for speech with abbreviation handling
`example_play_audio.py`	Audio playback and tone generation
`example_record_audio.py`	Microphone audio recording
`example_get_media_type.py`	Media type (MIME) utilities for FastAPI

Workflows (`examples/workflows/`)

Example File	Description
`example_complete_voice_agent.py`	Complete voice agent workflows

Running Examples

# Text-to-Speech examples
python examples/text_to_speech/example_stream_tts.py
python examples/text_to_speech/example_stream_tts_elevenlabs.py

# Speech-to-Text examples
python examples/speech_to_text/example_speech_to_text.py
python examples/speech_to_text/example_speech_to_text_elevenlabs_file.py

# Utility examples
python examples/utilities/example_format_text_for_speech.py
python examples/utilities/example_record_audio.py

# Workflow examples
python examples/workflows/example_complete_voice_agent.py

For more details, see the Examples README.

API Reference

Constants

SAMPLE_RATE: Default sample rate (24000 Hz)
VOICES: List of available OpenAI voices
ELEVENLABS_VOICES: Dictionary mapping friendly names to ElevenLabs voice IDs
ELEVENLABS_VOICE_NAMES: List of available ElevenLabs voice names
OPENAI_TTS_MODELS: List of available OpenAI TTS models
ELEVENLABS_TTS_MODELS: List of available ElevenLabs TTS models
VoiceType: Type alias for OpenAI voice options

Functions

`format_text_for_speech(text: str) -> List[str]`

Intelligently formats text into speech-friendly chunks by detecting sentence boundaries, handling abbreviations, and preserving natural pauses.

`stream_tts(text_chunks, model, voice, stream_mode, response_format, return_generator)`

Unified TTS function supporting both OpenAI and ElevenLabs. Model format: "provider/model_name" (e.g., "openai/tts-1", "elevenlabs/eleven_multilingual_v2"). Returns generator for web streaming or plays audio directly.

`list_models() -> List[dict]`

List all available TTS models with their providers. Returns list of dictionaries with model, provider, and model_name keys.

`list_voices() -> List[dict]`

List all available voices from all providers. Returns list of dictionaries with voice, provider, voice_id, and description keys.

`stream_tts_openai(text_chunks, voice, model, stream_mode, response_format, return_generator)`

OpenAI TTS with streaming support. Returns generator for web streaming or plays audio directly.

`stream_tts_elevenlabs(text_chunks, voice_id, model_id, stability, similarity_boost, output_format, return_generator)`

ElevenLabs TTS with advanced voice control and multiple output formats.

`speech_to_text(audio_file_path, audio_data, sample_rate, model, language, prompt, response_format)`

OpenAI Whisper transcription with support for files or numpy arrays.

`speech_to_text_elevenlabs(audio_file_path, audio_data, sample_rate, realtime, model_id, ...)`

ElevenLabs Speech-to-Text with support for both real-time (WebSocket) and non-real-time (file upload) modes. Supports speaker diarization, timestamps, and language detection.

`record_audio(duration, sample_rate, channels) -> np.ndarray`

Record audio from default microphone. Returns numpy array.

`play_audio(audio_data: np.ndarray)`

Play audio data using sounddevice.

`get_media_type_for_format(output_format: str) -> str`

Get MIME type for audio format (useful for FastAPI).

Classes

`StreamingTTSCallback`

Real-time TTS callback for agent streaming outputs. Automatically detects complete sentences and converts them to speech.

Methods:

__call__(chunk: str): Process streaming text chunk
flush(): Speak any remaining buffered text

Use Cases

Conversational AI Assistants

Build voice-enabled chatbots and virtual assistants with natural, real-time speech synthesis.

Agent Narration

Provide audio feedback for long-running agent tasks, making agent behavior transparent and engaging.

Voice-Enabled Analytics

Create voice interfaces for data analysis, trading systems, and business intelligence tools.

Real-Time Transcription

Transcribe meetings, interviews, and conversations with high accuracy using Whisper.

Multi-Modal Applications

Combine voice input/output with visual interfaces for rich, interactive experiences.

Configuration

Environment Variables

# Required for OpenAI TTS and STT
OPENAI_API_KEY=your-key-here

# Required for ElevenLabs TTS
ELEVENLABS_API_KEY=your-key-here

Voice Selection

OpenAI Voices:

alloy, ash, ballad, coral, echo, fable, nova, onyx, sage, shimmer

ElevenLabs Voices:

Professional: rachel, nicole, grace
Expressive: domi, elli, bella
Deep: antoni, josh, clyde
And 20+ more (see ELEVENLABS_VOICE_NAMES)

Contributing

Voice-Agents is built by the community, for the community. We welcome contributions!

How to Contribute

Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Make your changes and add tests
Commit your changes: git commit -m 'Add amazing feature'
Push to the branch: git push origin feature/amazing-feature
Open a Pull Request

Development Setup

git clone https://github.com/The-Swarm-Corporation/Voice-Agents.git
cd Voice-Agents
pip install -e ".[dev]"
pre-commit install

Code Standards

Follow PEP 8 style guidelines
Add type hints to all functions
Include docstrings for all public APIs
Write tests for new features

License

This project is licensed under the Apache License - see the LICENSE file for details.

Acknowledgments

Built by Swarms Corporation
Part of the Swarms ecosystem
Powered by OpenAI, ElevenLabs, and Groq APIs

Support & Community

Documentation: GitHub Repository
Swarms Documentation: docs.swarms.world
Swarms Community: Discord
Issues: GitHub Issues

Made by Swarms Corporation

Website • Documentation • GitHub

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Dec 29, 2025

0.1.9

Dec 29, 2025

0.1.8

Dec 29, 2025

0.1.7

Dec 29, 2025

0.1.6

Dec 27, 2025

0.1.4

Dec 27, 2025

0.1.3

Dec 27, 2025

0.1.0

Dec 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

voice_agents-0.2.0.tar.gz (40.5 kB view details)

Uploaded Dec 29, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

voice_agents-0.2.0-py3-none-any.whl (39.0 kB view details)

Uploaded Dec 29, 2025 Python 3

File details

Details for the file voice_agents-0.2.0.tar.gz.

File metadata

Download URL: voice_agents-0.2.0.tar.gz
Upload date: Dec 29, 2025
Size: 40.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.12.3 Darwin/24.5.0

File hashes

Hashes for voice_agents-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`d09b44f7f7f1576934e7178c097d9962f036f639012511ad9cf68b333e07f67d`
MD5	`86c2df068e2bf5e83c9c967fffb7f05f`
BLAKE2b-256	`f705a10e57ef55aab47c18c213e9044ceb14cf752221fd94926b8704e7bd0c32`

See more details on using hashes here.

File details

Details for the file voice_agents-0.2.0-py3-none-any.whl.

File metadata

Download URL: voice_agents-0.2.0-py3-none-any.whl
Upload date: Dec 29, 2025
Size: 39.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.12.3 Darwin/24.5.0

File hashes

Hashes for voice_agents-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9258cb22e12ad60141a9b773b76c67ca154c1d43666271b976b4cfceb4009bba`
MD5	`800f70448c01137b2c061610f071c1ba`
BLAKE2b-256	`f229ee42f9d0e61e2b6250d0c4aa6a0893e454450a31ddf580fbd809dce8f34b`

See more details on using hashes here.

voice-agents 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🗣️ Voice-Agents

Overview

Built by Swarms Corporation

Features

Core Capabilities

Advanced Features

Installation

Basic Installation

Development Installation

Requirements

Quick Start

Environment Setup

Basic Text-to-Speech

Speech-to-Text

Core Functions

Text-to-Speech (OpenAI)

Text-to-Speech (ElevenLabs)

Speech-to-Text

Audio Recording

Streaming TTS Callback for Agents

Audio Format Utilities

Swarms Integration

Complete Example: Voice-Enabled Trading Agent

Examples

Examples by Category

Text-to-Speech (examples/text_to_speech/)

Speech-to-Text (examples/speech_to_text/)

Utilities (examples/utilities/)

Workflows (examples/workflows/)

Running Examples

API Reference

Constants

Functions

format_text_for_speech(text: str) -> List[str]

stream_tts(text_chunks, model, voice, stream_mode, response_format, return_generator)

list_models() -> List[dict]

list_voices() -> List[dict]

stream_tts_openai(text_chunks, voice, model, stream_mode, response_format, return_generator)

stream_tts_elevenlabs(text_chunks, voice_id, model_id, stability, similarity_boost, output_format, return_generator)

speech_to_text(audio_file_path, audio_data, sample_rate, model, language, prompt, response_format)

speech_to_text_elevenlabs(audio_file_path, audio_data, sample_rate, realtime, model_id, ...)

record_audio(duration, sample_rate, channels) -> np.ndarray

play_audio(audio_data: np.ndarray)

get_media_type_for_format(output_format: str) -> str

Classes

StreamingTTSCallback

Use Cases

Conversational AI Assistants

Agent Narration

Voice-Enabled Analytics

Real-Time Transcription

Multi-Modal Applications

Configuration

Environment Variables

Voice Selection

Contributing

How to Contribute

Development Setup

Code Standards

License

Acknowledgments

Support & Community

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Text-to-Speech (`examples/text_to_speech/`)

Speech-to-Text (`examples/speech_to_text/`)

Utilities (`examples/utilities/`)

Workflows (`examples/workflows/`)

`format_text_for_speech(text: str) -> List[str]`

`stream_tts(text_chunks, model, voice, stream_mode, response_format, return_generator)`

`list_models() -> List[dict]`

`list_voices() -> List[dict]`

`stream_tts_openai(text_chunks, voice, model, stream_mode, response_format, return_generator)`

`stream_tts_elevenlabs(text_chunks, voice_id, model_id, stability, similarity_boost, output_format, return_generator)`

`speech_to_text(audio_file_path, audio_data, sample_rate, model, language, prompt, response_format)`

`speech_to_text_elevenlabs(audio_file_path, audio_data, sample_rate, realtime, model_id, ...)`

`record_audio(duration, sample_rate, channels) -> np.ndarray`

`play_audio(audio_data: np.ndarray)`

`get_media_type_for_format(output_format: str) -> str`

`StreamingTTSCallback`