Microsoft Corporation Azure Ai Voicelive Client Library for Python
Project description
Azure AI VoiceLive client library for Python
This package provides a real-time, speech-to-speech client for Azure AI VoiceLive. It opens a WebSocket session to stream microphone audio to the service and receive typed server events (including audio) for responsive, interruptible conversations.
Status: General Availability (GA). This is a stable release suitable for production use.
Important: As of version 1.0.0, this SDK is async-only. The synchronous API has been removed to focus exclusively on async patterns. All examples and samples use
async/awaitsyntax.
Getting started
Prerequisites
- Python 3.9+
- An Azure subscription
- A VoiceLive resource and endpoint
- A working microphone and speakers/headphones if you run the voice samples
Install
Install the stable GA version:
# Base install (core client only)
python -m pip install azure-ai-voicelive
# For asynchronous streaming (uses aiohttp)
python -m pip install "azure-ai-voicelive[aiohttp]"
# For voice samples (includes audio processing)
python -m pip install azure-ai-voicelive[aiohttp] pyaudio python-dotenv
The SDK provides async-only WebSocket connections using aiohttp for optimal performance and reliability.
Authenticate
You can authenticate with an API key or an Azure Active Directory (AAD) token.
API Key Authentication (Quick Start)
Set environment variables in a .env file or directly in your environment:
# In your .env file or environment variables
AZURE_VOICELIVE_API_KEY="your-api-key"
AZURE_VOICELIVE_ENDPOINT="your-endpoint"
Then, use the key in your code:
import asyncio
from azure.core.credentials import AzureKeyCredential
from azure.ai.voicelive import connect
async def main():
async with connect(
endpoint="your-endpoint",
credential=AzureKeyCredential("your-api-key"),
model="gpt-4o-realtime-preview"
) as connection:
# Your async code here
pass
asyncio.run(main())
AAD Token Authentication
For production applications, AAD authentication is recommended:
import asyncio
from azure.identity.aio import DefaultAzureCredential
from azure.ai.voicelive import connect
async def main():
credential = DefaultAzureCredential()
async with connect(
endpoint="your-endpoint",
credential=credential,
model="gpt-4o-realtime-preview"
) as connection:
# Your async code here
pass
asyncio.run(main())
Key concepts
- VoiceLiveConnection – Manages an active async WebSocket connection to the service
- Session Management – Configure conversation parameters:
- SessionResource – Update session parameters (voice, formats, VAD) with async methods
- RequestSession – Strongly-typed session configuration
- ServerVad – Configure voice activity detection
- AzureStandardVoice – Configure voice settings
- Audio Handling:
- InputAudioBufferResource – Manage audio input to the service with async methods
- OutputAudioBufferResource – Control audio output from the service with async methods
- Conversation Management:
- ResponseResource – Create or cancel model responses with async methods
- ConversationResource – Manage conversation items with async methods
- Error Handling:
- ConnectionError – Base exception for WebSocket connection errors
- ConnectionClosed – Raised when WebSocket connection is closed
- Strongly-Typed Events – Process service events with type safety:
SESSION_UPDATED,RESPONSE_AUDIO_DELTA,RESPONSE_DONEINPUT_AUDIO_BUFFER_SPEECH_STARTED,INPUT_AUDIO_BUFFER_SPEECH_STOPPEDERROR, and more
Examples
Basic Voice Assistant (Featured Sample)
The Basic Voice Assistant sample demonstrates full-featured voice interaction with:
- Real-time speech streaming
- Server-side voice activity detection
- Interruption handling
- High-quality audio processing
# Run the basic voice assistant sample
# Requires [aiohttp] for async
python samples/basic_voice_assistant_async.py
# With custom parameters
python samples/basic_voice_assistant_async.py --model gpt-4o-realtime-preview --voice alloy --instructions "You're a helpful assistant"
Minimal example
import asyncio
from azure.core.credentials import AzureKeyCredential
from azure.ai.voicelive.aio import connect
from azure.ai.voicelive.models import (
RequestSession, Modality, InputAudioFormat, OutputAudioFormat, ServerVad, ServerEventType
)
API_KEY = "your-api-key"
ENDPOINT = "wss://your-endpoint.com/openai/realtime"
MODEL = "gpt-4o-realtime-preview"
async def main():
async with connect(
endpoint=ENDPOINT,
credential=AzureKeyCredential(API_KEY),
model=MODEL,
) as conn:
session = RequestSession(
modalities=[Modality.TEXT, Modality.AUDIO],
instructions="You are a helpful assistant.",
input_audio_format=InputAudioFormat.PCM16,
output_audio_format=OutputAudioFormat.PCM16,
turn_detection=ServerVad(
threshold=0.5,
prefix_padding_ms=300,
silence_duration_ms=500
),
)
await conn.session.update(session=session)
# Process events
async for evt in conn:
print(f"Event: {evt.type}")
if evt.type == ServerEventType.RESPONSE_DONE:
break
asyncio.run(main())
Available Voice Options
Azure Neural Voices
# Use Azure Neural voices
voice_config = AzureStandardVoice(
name="en-US-AvaNeural", # Or another voice name
type="azure-standard"
)
Popular voices include:
en-US-AvaNeural- Female, natural and professionalen-US-JennyNeural- Female, conversationalen-US-GuyNeural- Male, professional
OpenAI Voices
# Use OpenAI voices (as string)
voice_config = "alloy" # Or another OpenAI voice
Available OpenAI voices:
alloy- Versatile, neutralecho- Precise, clearfable- Animated, expressiveonyx- Deep, authoritativenova- Warm, conversationalshimmer- Optimistic, friendly
Handling Events
async for event in connection:
if event.type == ServerEventType.SESSION_UPDATED:
print(f"Session ready: {event.session.id}")
# Start audio capture
elif event.type == ServerEventType.INPUT_AUDIO_BUFFER_SPEECH_STARTED:
print("User started speaking")
# Stop playback and cancel any current response
elif event.type == ServerEventType.RESPONSE_AUDIO_DELTA:
# Play the audio chunk
audio_bytes = event.delta
elif event.type == ServerEventType.ERROR:
print(f"Error: {event.error.message}")
Troubleshooting
Connection Issues
-
WebSocket connection errors (1006/timeout):
VerifyAZURE_VOICELIVE_ENDPOINT, network rules, and that your credential has access. -
Missing WebSocket dependencies:
If you see import errors, make sure you have installed the package: pip install azure-ai-voicelive[aiohttp] -
Auth failures:
For API key, double-checkAZURE_VOICELIVE_API_KEY. For AAD, ensure the identity is authorized.
Audio Device Issues
-
No microphone/speaker detected:
Check device connections and permissions. On headless CI environments, audio samples can't run. -
Audio library installation problems:
On Linux/macOS you may need PortAudio:# Debian/Ubuntu sudo apt-get install -y portaudio19-dev libasound2-dev # macOS (Homebrew) brew install portaudio
Enable Verbose Logging
import logging
logging.basicConfig(level=logging.DEBUG)
Next steps
-
Run the featured sample:
- Try
samples/basic_voice_assistant_async.pyfor a complete voice assistant implementation
- Try
-
Customize your implementation:
- Experiment with different voices and parameters
- Add custom instructions for specialized assistants
- Integrate with your own audio capture/playback systems
-
Advanced scenarios:
- Add function calling support
- Implement tool usage
- Create multi-turn conversations with history
-
Explore other samples:
- Check the
samples/directory for specialized examples - See
samples/README.mdfor a full list of samples
- Check the
Contributing
This project follows the Azure SDK guidelines. If you'd like to contribute:
- Fork the repo and create a feature branch
- Run linters and tests locally
- Submit a pull request with a clear description of the change
Release notes
Changelogs are available in the package directory.
License
This project is released under the MIT License.
Release History
1.1.0 (2025-11-03)
Features Added
- Added support for Agent configuration through the new
AgentConfigmodel - Added
agentfield toResponseSessionmodel to support agent-based conversations - The
AgentConfigmodel includes properties for agent type, name, description, agent_id, and thread_id
1.1.0b1 (2025-10-06)
Features Added
- AgentConfig Support: Re-introduced
AgentConfigfunctionality with enhanced capabilities:AgentConfigmodel added back to public API with full import and export supportagentfield re-added toResponseSessionmodel for session-level agent configuration- Updated cross-language package mappings to include
AgentConfigsupport - Provides foundation for advanced agent configuration scenarios
1.0.0 (2025-10-01)
Features Added
- Enhanced WebSocket Connection Options: Significantly improved WebSocket connection configuration with transport-agnostic design:
- Added new timeout configuration options:
receive_timeout,close_timeout, andhandshake_timeoutfor fine-grained control - Enhanced
compressionparameter to support both boolean and integer types for advanced zlib window configuration - Added
vendor_optionsparameter for implementation-specific options passthrough (escape hatch for advanced users) - Improved documentation with clearer descriptions for all connection parameters
- Better support for common aliases from other WebSocket ecosystems (
max_size,ping_interval, etc.) - More robust option mapping with proper type conversion and safety checks
- Added new timeout configuration options:
- Enhanced Type Safety: Improved type safety for content parts with proper enum usage:
InputAudioContentPart,InputTextContentPart, andOutputTextContentPartnow useContentPartTypeenum values instead of string literals- Better IntelliSense support and compile-time type checking for content part discriminators
Breaking Changes
- Improved Naming Conventions: Updated model and enum names for better clarity and consistency:
OAIVoiceenum renamed toOpenAIVoiceNamefor more descriptive namingToolChoiceObjectmodel renamed toToolChoiceSelectionfor better semantic meaningToolChoiceFunctionObjectmodel renamed toToolChoiceFunctionSelectionfor consistency- Updated type unions and imports to reflect the new naming conventions
- Cross-language package mappings updated to maintain compatibility across SDKs
- Session Model Architecture: Separated
ResponseSessionandRequestSessionmodels for better design clarity:ResponseSessionno longer inherits fromRequestSessionand now inherits directly from_Model- All session configuration fields are now explicitly defined in
ResponseSessioninstead of being inherited - This provides clearer separation of concerns between request and response session configurations
- May affect type checking and code that relied on the previous inheritance relationship
- Model Cleanup: Removed unused
AgentConfigmodel and related fields from the public API:AgentConfigclass has been completely removed from imports and exportsagentfield removed fromResponseSessionmodel (including constructor parameter)- Updated cross-language package mappings to reflect the removal
- Model Naming Convention Update: Renamed
EOUDetectiontoEouDetectionfor better naming consistency:- Class name changed from
EOUDetectiontoEouDetection - All inheritance relationships updated:
AzureSemanticDetection,AzureSemanticDetectionEn, andAzureSemanticDetectionMultilingualnow inherit fromEouDetection - Type annotations updated in
AzureSemanticVad,AzureSemanticVadEn,AzureSemanticVadMultilingual, andServerVadclasses - Import statements and exports updated to reflect the new naming
- Class name changed from
- Enhanced Content Part Type Safety: Content part discriminators now use enum values instead of string literals:
InputAudioContentPart.typenow usesContentPartType.INPUT_AUDIOinstead of"input_audio"InputTextContentPart.typenow usesContentPartType.INPUT_TEXTinstead of"input_text"OutputTextContentPart.typenow usesContentPartType.TEXTinstead of"text"
Other Changes
- Initial GA release
1.0.0b5 (2025-09-26)
Features Added
- Enhanced Semantic Detection Type Safety: Added new
EouThresholdLevelenum for better type safety in end-of-utterance detection:LOWfor low sensitivity threshold levelMEDIUMfor medium sensitivity threshold levelHIGHfor high sensitivity threshold levelDEFAULTfor default sensitivity threshold level
- Improved Semantic Detection Configuration: Enhanced semantic detection classes with better type annotations:
threshold_levelparameter now supports both string values andEouThresholdLevelenum- Cleaner type definitions for
AzureSemanticDetection,AzureSemanticDetectionEn, andAzureSemanticDetectionMultilingual - Improved documentation for threshold level parameters
- Comprehensive Unit Test Suite: Added extensive unit test coverage with 200+ test cases covering:
- All enum types and their functionality
- Model creation, validation, and serialization
- Async connection functionality with proper mocking
- Client event handling and workflows
- Voice configuration across all supported types
- Message handling with content part hierarchy
- Integration scenarios and real-world usage patterns
- Recent changes validation and backwards compatibility
- API Version Update: Updated to API version
2025-10-01(from2025-05-01-preview) - Enhanced Type Safety: Added new
AzureVoiceTypeenum with values for better Azure voice type categorization:AZURE_CUSTOMfor custom voice configurationsAZURE_STANDARDfor standard voice configurationsAZURE_PERSONALfor personal voice configurations
- Improved Message Handling: Added
MessageRoleenum for better role type safety in message items - Enhanced Model Documentation: Comprehensive documentation improvements across all models:
- Added detailed docstrings for model classes and their parameters
- Enhanced enum value documentation with descriptions
- Improved type annotations and parameter descriptions
- Enhanced Semantic Detection: Added improved configuration options for all semantic detection classes:
- Added
threshold_levelparameter with options:"low","medium","high","default"(recommended over deprecatedthreshold) - Added
timeout_msparameter for timeout configuration in milliseconds (recommended over deprecatedtimeout)
- Added
- Video Background Support: Added new
Backgroundmodel for video background customization:- Support for solid color backgrounds in hex format (e.g.,
#00FF00FF) - Support for image URL backgrounds
- Mutually exclusive color and image URL options
- Support for solid color backgrounds in hex format (e.g.,
- Enhanced Video Parameters: Extended
VideoParamsmodel with:backgroundparameter for configuring video backgrounds using the newBackgroundmodelgop_sizeparameter for Group of Pictures (GOP) size control, affecting compression efficiency and seeking performance
- Improved Type Safety: Added
TurnDetectionTypeenum for better type safety and IntelliSense support - Package Structure Modernization: Simplified package initialization with namespace package support
- Enhanced Error Handling: Added
ConnectionErrorandConnectionClosedexception classes to the async API for better WebSocket error management
Breaking Changes
- Cross-Language Package Identity Update: Updated package ID from
VoiceLivetoVoiceLive.WebSocketfor better cross-language consistency - Model Refactoring:
- Renamed
UserContentParttoMessageContentPartfor clearer content part hierarchy - All message items now require a
contentfield with list ofMessageContentPartobjects OutputTextContentPartnow inherits fromMessageContentPartinstead of being standalone
- Renamed
- Enhanced Type Safety:
- Azure voice classes now use
AzureVoiceTypeenum discriminators instead of string literals - Message role discriminators now use
MessageRoleenum values for better type safety
- Azure voice classes now use
- Removed Deprecated Parameters: Completely removed deprecated parameters from semantic detection classes:
- Removed
thresholdparameter from all semantic detection classes (AzureSemanticDetection,AzureSemanticDetectionEn,AzureSemanticDetectionMultilingual) - Removed
timeoutparameter from all semantic detection classes - Users must now use
threshold_levelandtimeout_msparameters respectively
- Removed
- Removed Synchronous API: Completely removed synchronous WebSocket operations to focus exclusively on async patterns:
- Removed sync
connect()function and syncVoiceLiveConnectionclass from main patch implementation - Removed sync
basic_voice_assistant.pysample (only async version remains) - Simplified sync patch to minimal structure with empty exports
- All functionality now available only through async patterns
- Removed sync
- Updated Dependencies: Modified package dependencies to reflect async-only architecture:
- Moved
aiohttp>=3.9.0,<4.0.0from optional to required dependency - Removed
websocketsoptional dependency as sync API no longer exists - Removed optional dependency groups
websockets,aiohttp, andall-websockets
- Moved
- Model Rename:
- Renamed
AudioInputTranscriptionSettingstoAudioInputTranscriptionOptionsfor consistency with naming conventions - Renamed
AzureMultilingualSemanticVadtoAzureSemanticVadMultilingualfor naming consistency with other multilingual variants
- Renamed
- Enhanced Type Safety: Turn detection discriminator types now use enum values instead of string literals for better type safety
Bug Fixes
- Serialization Improvements: Fixed type casting issue in serialization utilities for better enum handling and type safety
Other Changes
- Testing Infrastructure: Added comprehensive unit test suite with extensive coverage:
- 8 main test files with 200+ individual test methods
- Tests for all enums, models, async operations, client events, voice configurations, and message handling
- Integration tests covering real-world scenarios and recent changes
- Proper mocking for async WebSocket connections
- Backwards compatibility validation
- Test coverage for all recent changes and enhancements
- API Documentation: Updated API view properties to reflect model structure changes, new enums, and cross-language package identity
- Documentation Updates: Comprehensive updates to all markdown documentation:
- Updated README.md to reflect async-only nature with updated examples and installation instructions
- Updated samples README.md to remove sync sample references
- Enhanced BASIC_VOICE_ASSISTANT.md with comprehensive async implementation guide
- Added MIGRATION_GUIDE.md for users upgrading from previous versions
1.0.0b4 (2025-09-19)
Features Added
- Personal Voice Models: Added
PersonalVoiceModelsenum with support forDragonLatestNeural,PhoenixLatestNeural, andPhoenixV2Neuralmodels - Enhanced Animation Support: Added comprehensive server event classes for animation blendshapes and viseme handling:
ServerEventResponseAnimationBlendshapeDeltaandServerEventResponseAnimationBlendshapeDoneServerEventResponseAnimationVisemeDeltaandServerEventResponseAnimationVisemeDone
- Audio Timestamp Events: Added
ServerEventResponseAudioTimestampDeltaandServerEventResponseAudioTimestampDonefor better audio timing control - Improved Error Handling: Added
ErrorResponseclass for better error management - Enhanced Base Classes: Added
ConversationItemBaseandSessionBasefor better code organization and inheritance - Token Usage Improvements: Renamed
UsagetoTokenUsagefor better clarity - Audio Format Improvements: Reorganized audio format enums with separate
InputAudioFormatandOutputAudioFormatenums for better clarity - Enhanced Output Audio Format Support: Added more granular output audio format options including specific sampling rates (8kHz, 16kHz) for PCM16
Breaking Changes
- Model Cleanup: Removed experimental classes
AzurePlatformVoice,LLMVoice,AzureSemanticVadServer,InputAudio,NoTurnDetection, andToolChoiceFunctionObjectFunction - Class Rename: Renamed
Usageclass toTokenUsagefor better clarity - Enum Reorganization:
- Replaced
AudioFormatenum with separateInputAudioFormatandOutputAudioFormatenums - Removed
Phi4mmVoiceenum - Removed
EMOTIONvalue fromAnimationOutputTypeenum - Removed
IN_PROGRESSvalue fromItemParamStatusenum
- Replaced
- Server Events: Removed
RESPONSE_EMOTION_HYPOTHESISfromServerEventTypeenum
Other Changes
- Package Structure: Simplified package initialization with namespace package support
- Sample Updates: Improved basic voice assistant samples
- Code Optimization: Streamlined model definitions with significant code reduction
- API Configuration: Updated API view properties for better tooling support
1.0.0b3 (2025-09-17)
Features Added
- Transcription improvement: Added phrase list
- New Voice Types: Added
AzurePlatformVoiceandLLMVoiceclasses - Enhanced Speech Detection: Added
AzureSemanticVadServerclass - Improved Function Calling: Enhanced async function calling sample with better error handling
- English-Specific Detection: Added
AzureSemanticDetectionEnclass for optimized English-only semantic end-of-utterance detection - English-Specific Voice Activity Detection: Added
AzureSemanticVadEnclass for enhanced English-only voice activity detection
Breaking Changes
- Transcription: Removed
custom_modelandenabledfromAudioInputTranscriptionSettings. - Async Authentication: Fixed credential handling for async scenarios
- Model Serialization: Improved error handling and deserialization
Other Changes
- Code Modernization: Updated type annotations throughout
1.0.0b2 (2025-09-10)
Features Added
- Async function call
Bugs Fixed
- Fixed function calling: ensure
FunctionCallOutputItem.outputis properly serialized as a JSON string before sending to the service.
1.0.0b1 (2025-08-28)
Features Added
- Added WebSocket connection support through
connect(). - Added
VoiceLiveConnectionfor managing WebSocket connections. - Added models of Voice Live preview.
- Added WebSocket-based examples in the samples directory.
Other Changes
- Initial preview release.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file azure_ai_voicelive-1.1.0.tar.gz.
File metadata
- Download URL: azure_ai_voicelive-1.1.0.tar.gz
- Upload date:
- Size: 127.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: RestSharp/106.13.0.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9398d0a3ad8a3c43844e89dfb8c61a39422f294770820f405b114c4d752d3f43
|
|
| MD5 |
0cac5648939b0f303f99ef7876595931
|
|
| BLAKE2b-256 |
4146e304076e2bdca64a3b77bf9b6c79b8b4f29b994cc22196ff8c72b93faf09
|
File details
Details for the file azure_ai_voicelive-1.1.0-py3-none-any.whl.
File metadata
- Download URL: azure_ai_voicelive-1.1.0-py3-none-any.whl
- Upload date:
- Size: 83.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: RestSharp/106.13.0.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
29f2ab8bef67dd41cddafb0239f059351ddb44a7d95bcc4e706516e0c2bfdcc4
|
|
| MD5 |
708260e78ea3c0152adb67918250c62e
|
|
| BLAKE2b-256 |
a44b48a81dae63b3fa1208603a71bffda884bc9cec46f04d223e867efd132357
|