Skip to main content

Microsoft Corporation Azure Ai Voicelive Client Library for Python

Project description

Azure AI VoiceLive client library for Python

This package provides a real-time, speech-to-speech client for Azure AI VoiceLive. It opens a WebSocket session to stream microphone audio to the service and receive typed server events (including audio) for responsive, interruptible conversations.

Status: General Availability (GA). This is a stable release suitable for production use.

Important: As of version 1.0.0, this SDK is async-only. The synchronous API has been removed to focus exclusively on async patterns. All examples and samples use async/await syntax.


Getting started

Prerequisites

  • Python 3.9+
  • An Azure subscription
  • A VoiceLive resource and endpoint
  • A working microphone and speakers/headphones if you run the voice samples

Install

Install the stable GA version:

# Base install (core client only)
python -m pip install azure-ai-voicelive

# For asynchronous streaming (uses aiohttp)
python -m pip install "azure-ai-voicelive[aiohttp]"

# For voice samples (includes audio processing)
python -m pip install azure-ai-voicelive[aiohttp] pyaudio python-dotenv

The SDK provides async-only WebSocket connections using aiohttp for optimal performance and reliability.

Authenticate

You can authenticate with an API key or an Azure Active Directory (AAD) token.

API Key Authentication (Quick Start)

Set environment variables in a .env file or directly in your environment:

# In your .env file or environment variables
AZURE_VOICELIVE_API_KEY="your-api-key"
AZURE_VOICELIVE_ENDPOINT="your-endpoint"

Then, use the key in your code:

import asyncio
from azure.core.credentials import AzureKeyCredential
from azure.ai.voicelive import connect

async def main():
    async with connect(
        endpoint="your-endpoint",
        credential=AzureKeyCredential("your-api-key"),
        model="gpt-4o-realtime-preview"
    ) as connection:
        # Your async code here
        pass

asyncio.run(main())

AAD Token Authentication

For production applications, AAD authentication is recommended:

import asyncio
from azure.identity.aio import DefaultAzureCredential
from azure.ai.voicelive import connect

async def main():
    credential = DefaultAzureCredential()
    
    async with connect(
        endpoint="your-endpoint",
        credential=credential,
        model="gpt-4o-realtime-preview"
    ) as connection:
        # Your async code here
        pass

asyncio.run(main())

Key concepts

  • VoiceLiveConnection – Manages an active async WebSocket connection to the service
  • Session Management – Configure conversation parameters:
    • SessionResource – Update session parameters (voice, formats, VAD) with async methods
    • RequestSession – Strongly-typed session configuration
    • ServerVad – Configure voice activity detection
    • AzureStandardVoice – Configure voice settings
  • Audio Handling:
    • InputAudioBufferResource – Manage audio input to the service with async methods
    • OutputAudioBufferResource – Control audio output from the service with async methods
  • Conversation Management:
    • ResponseResource – Create or cancel model responses with async methods
    • ConversationResource – Manage conversation items with async methods
  • Error Handling:
    • ConnectionError – Base exception for WebSocket connection errors
    • ConnectionClosed – Raised when WebSocket connection is closed
  • Strongly-Typed Events – Process service events with type safety:
    • SESSION_UPDATED, RESPONSE_AUDIO_DELTA, RESPONSE_DONE
    • INPUT_AUDIO_BUFFER_SPEECH_STARTED, INPUT_AUDIO_BUFFER_SPEECH_STOPPED
    • ERROR, and more

Examples

Basic Voice Assistant (Featured Sample)

The Basic Voice Assistant sample demonstrates full-featured voice interaction with:

  • Real-time speech streaming
  • Server-side voice activity detection
  • Interruption handling
  • High-quality audio processing
# Run the basic voice assistant sample
# Requires [aiohttp] for async
python samples/basic_voice_assistant_async.py

# With custom parameters
python samples/basic_voice_assistant_async.py --model gpt-4o-realtime-preview --voice alloy --instructions "You're a helpful assistant"

Minimal example

import asyncio
from azure.core.credentials import AzureKeyCredential
from azure.ai.voicelive.aio import connect
from azure.ai.voicelive.models import (
    RequestSession, Modality, InputAudioFormat, OutputAudioFormat, ServerVad, ServerEventType
)

API_KEY = "your-api-key"
ENDPOINT = "wss://your-endpoint.com/openai/realtime"
MODEL = "gpt-4o-realtime-preview"

async def main():
    async with connect(
        endpoint=ENDPOINT,
        credential=AzureKeyCredential(API_KEY),
        model=MODEL,
    ) as conn:
        session = RequestSession(
            modalities=[Modality.TEXT, Modality.AUDIO],
            instructions="You are a helpful assistant.",
            input_audio_format=InputAudioFormat.PCM16,
            output_audio_format=OutputAudioFormat.PCM16,
            turn_detection=ServerVad(
                threshold=0.5, 
                prefix_padding_ms=300, 
                silence_duration_ms=500
            ),
        )
        await conn.session.update(session=session)

        # Process events
        async for evt in conn:
            print(f"Event: {evt.type}")
            if evt.type == ServerEventType.RESPONSE_DONE:
                break

asyncio.run(main())

Available Voice Options

Azure Neural Voices

# Use Azure Neural voices
voice_config = AzureStandardVoice(
    name="en-US-AvaNeural",  # Or another voice name
    type="azure-standard"
)

Popular voices include:

  • en-US-AvaNeural - Female, natural and professional
  • en-US-JennyNeural - Female, conversational
  • en-US-GuyNeural - Male, professional

OpenAI Voices

# Use OpenAI voices (as string)
voice_config = "alloy"  # Or another OpenAI voice

Available OpenAI voices:

  • alloy - Versatile, neutral
  • echo - Precise, clear
  • fable - Animated, expressive
  • onyx - Deep, authoritative
  • nova - Warm, conversational
  • shimmer - Optimistic, friendly

Handling Events

async for event in connection:
    if event.type == ServerEventType.SESSION_UPDATED:
        print(f"Session ready: {event.session.id}")
        # Start audio capture
        
    elif event.type == ServerEventType.INPUT_AUDIO_BUFFER_SPEECH_STARTED:
        print("User started speaking")
        # Stop playback and cancel any current response
        
    elif event.type == ServerEventType.RESPONSE_AUDIO_DELTA:
        # Play the audio chunk
        audio_bytes = event.delta
        
    elif event.type == ServerEventType.ERROR:
        print(f"Error: {event.error.message}")

Troubleshooting

Connection Issues

  • WebSocket connection errors (1006/timeout):
    Verify AZURE_VOICELIVE_ENDPOINT, network rules, and that your credential has access.

  • Missing WebSocket dependencies:
    If you see import errors, make sure you have installed the package: pip install azure-ai-voicelive[aiohttp]

  • Auth failures:
    For API key, double-check AZURE_VOICELIVE_API_KEY. For AAD, ensure the identity is authorized.

Audio Device Issues

  • No microphone/speaker detected:
    Check device connections and permissions. On headless CI environments, audio samples can't run.

  • Audio library installation problems:
    On Linux/macOS you may need PortAudio:

    # Debian/Ubuntu
    sudo apt-get install -y portaudio19-dev libasound2-dev
    # macOS (Homebrew)
    brew install portaudio
    

Enable Verbose Logging

import logging
logging.basicConfig(level=logging.DEBUG)

Next steps

  1. Run the featured sample:

    • Try samples/basic_voice_assistant_async.py for a complete voice assistant implementation
  2. Customize your implementation:

    • Experiment with different voices and parameters
    • Add custom instructions for specialized assistants
    • Integrate with your own audio capture/playback systems
  3. Advanced scenarios:

    • Add function calling support
    • Implement tool usage
    • Create multi-turn conversations with history
  4. Explore other samples:

    • Check the samples/ directory for specialized examples
    • See samples/README.md for a full list of samples

Contributing

This project follows the Azure SDK guidelines. If you'd like to contribute:

  1. Fork the repo and create a feature branch
  2. Run linters and tests locally
  3. Submit a pull request with a clear description of the change

Release notes

Changelogs are available in the package directory.


License

This project is released under the MIT License.

Release History

1.1.0 (2025-11-03)

Features Added

  • Added support for Agent configuration through the new AgentConfig model
  • Added agent field to ResponseSession model to support agent-based conversations
  • The AgentConfig model includes properties for agent type, name, description, agent_id, and thread_id

1.1.0b1 (2025-10-06)

Features Added

  • AgentConfig Support: Re-introduced AgentConfig functionality with enhanced capabilities:
    • AgentConfig model added back to public API with full import and export support
    • agent field re-added to ResponseSession model for session-level agent configuration
    • Updated cross-language package mappings to include AgentConfig support
    • Provides foundation for advanced agent configuration scenarios

1.0.0 (2025-10-01)

Features Added

  • Enhanced WebSocket Connection Options: Significantly improved WebSocket connection configuration with transport-agnostic design:
    • Added new timeout configuration options: receive_timeout, close_timeout, and handshake_timeout for fine-grained control
    • Enhanced compression parameter to support both boolean and integer types for advanced zlib window configuration
    • Added vendor_options parameter for implementation-specific options passthrough (escape hatch for advanced users)
    • Improved documentation with clearer descriptions for all connection parameters
    • Better support for common aliases from other WebSocket ecosystems (max_size, ping_interval, etc.)
    • More robust option mapping with proper type conversion and safety checks
  • Enhanced Type Safety: Improved type safety for content parts with proper enum usage:
    • InputAudioContentPart, InputTextContentPart, and OutputTextContentPart now use ContentPartType enum values instead of string literals
    • Better IntelliSense support and compile-time type checking for content part discriminators

Breaking Changes

  • Improved Naming Conventions: Updated model and enum names for better clarity and consistency:
    • OAIVoice enum renamed to OpenAIVoiceName for more descriptive naming
    • ToolChoiceObject model renamed to ToolChoiceSelection for better semantic meaning
    • ToolChoiceFunctionObject model renamed to ToolChoiceFunctionSelection for consistency
    • Updated type unions and imports to reflect the new naming conventions
    • Cross-language package mappings updated to maintain compatibility across SDKs
  • Session Model Architecture: Separated ResponseSession and RequestSession models for better design clarity:
    • ResponseSession no longer inherits from RequestSession and now inherits directly from _Model
    • All session configuration fields are now explicitly defined in ResponseSession instead of being inherited
    • This provides clearer separation of concerns between request and response session configurations
    • May affect type checking and code that relied on the previous inheritance relationship
  • Model Cleanup: Removed unused AgentConfig model and related fields from the public API:
    • AgentConfig class has been completely removed from imports and exports
    • agent field removed from ResponseSession model (including constructor parameter)
    • Updated cross-language package mappings to reflect the removal
  • Model Naming Convention Update: Renamed EOUDetection to EouDetection for better naming consistency:
    • Class name changed from EOUDetection to EouDetection
    • All inheritance relationships updated: AzureSemanticDetection, AzureSemanticDetectionEn, and AzureSemanticDetectionMultilingual now inherit from EouDetection
    • Type annotations updated in AzureSemanticVad, AzureSemanticVadEn, AzureSemanticVadMultilingual, and ServerVad classes
    • Import statements and exports updated to reflect the new naming
  • Enhanced Content Part Type Safety: Content part discriminators now use enum values instead of string literals:
    • InputAudioContentPart.type now uses ContentPartType.INPUT_AUDIO instead of "input_audio"
    • InputTextContentPart.type now uses ContentPartType.INPUT_TEXT instead of "input_text"
    • OutputTextContentPart.type now uses ContentPartType.TEXT instead of "text"

Other Changes

  • Initial GA release

1.0.0b5 (2025-09-26)

Features Added

  • Enhanced Semantic Detection Type Safety: Added new EouThresholdLevel enum for better type safety in end-of-utterance detection:
    • LOW for low sensitivity threshold level
    • MEDIUM for medium sensitivity threshold level
    • HIGH for high sensitivity threshold level
    • DEFAULT for default sensitivity threshold level
  • Improved Semantic Detection Configuration: Enhanced semantic detection classes with better type annotations:
    • threshold_level parameter now supports both string values and EouThresholdLevel enum
    • Cleaner type definitions for AzureSemanticDetection, AzureSemanticDetectionEn, and AzureSemanticDetectionMultilingual
    • Improved documentation for threshold level parameters
  • Comprehensive Unit Test Suite: Added extensive unit test coverage with 200+ test cases covering:
    • All enum types and their functionality
    • Model creation, validation, and serialization
    • Async connection functionality with proper mocking
    • Client event handling and workflows
    • Voice configuration across all supported types
    • Message handling with content part hierarchy
    • Integration scenarios and real-world usage patterns
    • Recent changes validation and backwards compatibility
  • API Version Update: Updated to API version 2025-10-01 (from 2025-05-01-preview)
  • Enhanced Type Safety: Added new AzureVoiceType enum with values for better Azure voice type categorization:
    • AZURE_CUSTOM for custom voice configurations
    • AZURE_STANDARD for standard voice configurations
    • AZURE_PERSONAL for personal voice configurations
  • Improved Message Handling: Added MessageRole enum for better role type safety in message items
  • Enhanced Model Documentation: Comprehensive documentation improvements across all models:
    • Added detailed docstrings for model classes and their parameters
    • Enhanced enum value documentation with descriptions
    • Improved type annotations and parameter descriptions
  • Enhanced Semantic Detection: Added improved configuration options for all semantic detection classes:
    • Added threshold_level parameter with options: "low", "medium", "high", "default" (recommended over deprecated threshold)
    • Added timeout_ms parameter for timeout configuration in milliseconds (recommended over deprecated timeout)
  • Video Background Support: Added new Background model for video background customization:
    • Support for solid color backgrounds in hex format (e.g., #00FF00FF)
    • Support for image URL backgrounds
    • Mutually exclusive color and image URL options
  • Enhanced Video Parameters: Extended VideoParams model with:
    • background parameter for configuring video backgrounds using the new Background model
    • gop_size parameter for Group of Pictures (GOP) size control, affecting compression efficiency and seeking performance
  • Improved Type Safety: Added TurnDetectionType enum for better type safety and IntelliSense support
  • Package Structure Modernization: Simplified package initialization with namespace package support
  • Enhanced Error Handling: Added ConnectionError and ConnectionClosed exception classes to the async API for better WebSocket error management

Breaking Changes

  • Cross-Language Package Identity Update: Updated package ID from VoiceLive to VoiceLive.WebSocket for better cross-language consistency
  • Model Refactoring:
    • Renamed UserContentPart to MessageContentPart for clearer content part hierarchy
    • All message items now require a content field with list of MessageContentPart objects
    • OutputTextContentPart now inherits from MessageContentPart instead of being standalone
  • Enhanced Type Safety:
    • Azure voice classes now use AzureVoiceType enum discriminators instead of string literals
    • Message role discriminators now use MessageRole enum values for better type safety
  • Removed Deprecated Parameters: Completely removed deprecated parameters from semantic detection classes:
    • Removed threshold parameter from all semantic detection classes (AzureSemanticDetection, AzureSemanticDetectionEn, AzureSemanticDetectionMultilingual)
    • Removed timeout parameter from all semantic detection classes
    • Users must now use threshold_level and timeout_ms parameters respectively
  • Removed Synchronous API: Completely removed synchronous WebSocket operations to focus exclusively on async patterns:
    • Removed sync connect() function and sync VoiceLiveConnection class from main patch implementation
    • Removed sync basic_voice_assistant.py sample (only async version remains)
    • Simplified sync patch to minimal structure with empty exports
    • All functionality now available only through async patterns
  • Updated Dependencies: Modified package dependencies to reflect async-only architecture:
    • Moved aiohttp>=3.9.0,<4.0.0 from optional to required dependency
    • Removed websockets optional dependency as sync API no longer exists
    • Removed optional dependency groups websockets, aiohttp, and all-websockets
  • Model Rename:
    • Renamed AudioInputTranscriptionSettings to AudioInputTranscriptionOptions for consistency with naming conventions
    • Renamed AzureMultilingualSemanticVad to AzureSemanticVadMultilingual for naming consistency with other multilingual variants
  • Enhanced Type Safety: Turn detection discriminator types now use enum values instead of string literals for better type safety

Bug Fixes

  • Serialization Improvements: Fixed type casting issue in serialization utilities for better enum handling and type safety

Other Changes

  • Testing Infrastructure: Added comprehensive unit test suite with extensive coverage:
    • 8 main test files with 200+ individual test methods
    • Tests for all enums, models, async operations, client events, voice configurations, and message handling
    • Integration tests covering real-world scenarios and recent changes
    • Proper mocking for async WebSocket connections
    • Backwards compatibility validation
    • Test coverage for all recent changes and enhancements
  • API Documentation: Updated API view properties to reflect model structure changes, new enums, and cross-language package identity
  • Documentation Updates: Comprehensive updates to all markdown documentation:
    • Updated README.md to reflect async-only nature with updated examples and installation instructions
    • Updated samples README.md to remove sync sample references
    • Enhanced BASIC_VOICE_ASSISTANT.md with comprehensive async implementation guide
    • Added MIGRATION_GUIDE.md for users upgrading from previous versions

1.0.0b4 (2025-09-19)

Features Added

  • Personal Voice Models: Added PersonalVoiceModels enum with support for DragonLatestNeural, PhoenixLatestNeural, and PhoenixV2Neural models
  • Enhanced Animation Support: Added comprehensive server event classes for animation blendshapes and viseme handling:
    • ServerEventResponseAnimationBlendshapeDelta and ServerEventResponseAnimationBlendshapeDone
    • ServerEventResponseAnimationVisemeDelta and ServerEventResponseAnimationVisemeDone
  • Audio Timestamp Events: Added ServerEventResponseAudioTimestampDelta and ServerEventResponseAudioTimestampDone for better audio timing control
  • Improved Error Handling: Added ErrorResponse class for better error management
  • Enhanced Base Classes: Added ConversationItemBase and SessionBase for better code organization and inheritance
  • Token Usage Improvements: Renamed Usage to TokenUsage for better clarity
  • Audio Format Improvements: Reorganized audio format enums with separate InputAudioFormat and OutputAudioFormat enums for better clarity
  • Enhanced Output Audio Format Support: Added more granular output audio format options including specific sampling rates (8kHz, 16kHz) for PCM16

Breaking Changes

  • Model Cleanup: Removed experimental classes AzurePlatformVoice, LLMVoice, AzureSemanticVadServer, InputAudio, NoTurnDetection, and ToolChoiceFunctionObjectFunction
  • Class Rename: Renamed Usage class to TokenUsage for better clarity
  • Enum Reorganization:
    • Replaced AudioFormat enum with separate InputAudioFormat and OutputAudioFormat enums
    • Removed Phi4mmVoice enum
    • Removed EMOTION value from AnimationOutputType enum
    • Removed IN_PROGRESS value from ItemParamStatus enum
  • Server Events: Removed RESPONSE_EMOTION_HYPOTHESIS from ServerEventType enum

Other Changes

  • Package Structure: Simplified package initialization with namespace package support
  • Sample Updates: Improved basic voice assistant samples
  • Code Optimization: Streamlined model definitions with significant code reduction
  • API Configuration: Updated API view properties for better tooling support

1.0.0b3 (2025-09-17)

Features Added

  • Transcription improvement: Added phrase list
  • New Voice Types: Added AzurePlatformVoice and LLMVoice classes
  • Enhanced Speech Detection: Added AzureSemanticVadServer class
  • Improved Function Calling: Enhanced async function calling sample with better error handling
  • English-Specific Detection: Added AzureSemanticDetectionEn class for optimized English-only semantic end-of-utterance detection
  • English-Specific Voice Activity Detection: Added AzureSemanticVadEn class for enhanced English-only voice activity detection

Breaking Changes

  • Transcription: Removed custom_model and enabled from AudioInputTranscriptionSettings.
  • Async Authentication: Fixed credential handling for async scenarios
  • Model Serialization: Improved error handling and deserialization

Other Changes

  • Code Modernization: Updated type annotations throughout

1.0.0b2 (2025-09-10)

Features Added

  • Async function call

Bugs Fixed

  • Fixed function calling: ensure FunctionCallOutputItem.output is properly serialized as a JSON string before sending to the service.

1.0.0b1 (2025-08-28)

Features Added

  • Added WebSocket connection support through connect().
  • Added VoiceLiveConnection for managing WebSocket connections.
  • Added models of Voice Live preview.
  • Added WebSocket-based examples in the samples directory.

Other Changes

  • Initial preview release.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

azure_ai_voicelive-1.1.0.tar.gz (127.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

azure_ai_voicelive-1.1.0-py3-none-any.whl (83.2 kB view details)

Uploaded Python 3

File details

Details for the file azure_ai_voicelive-1.1.0.tar.gz.

File metadata

  • Download URL: azure_ai_voicelive-1.1.0.tar.gz
  • Upload date:
  • Size: 127.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: RestSharp/106.13.0.0

File hashes

Hashes for azure_ai_voicelive-1.1.0.tar.gz
Algorithm Hash digest
SHA256 9398d0a3ad8a3c43844e89dfb8c61a39422f294770820f405b114c4d752d3f43
MD5 0cac5648939b0f303f99ef7876595931
BLAKE2b-256 4146e304076e2bdca64a3b77bf9b6c79b8b4f29b994cc22196ff8c72b93faf09

See more details on using hashes here.

File details

Details for the file azure_ai_voicelive-1.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for azure_ai_voicelive-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 29f2ab8bef67dd41cddafb0239f059351ddb44a7d95bcc4e706516e0c2bfdcc4
MD5 708260e78ea3c0152adb67918250c62e
BLAKE2b-256 a44b48a81dae63b3fa1208603a71bffda884bc9cec46f04d223e867efd132357

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page