Speechmatics Voice Agent Python client for Real-Time API
Project description
Speechmatics Voice SDK
Python SDK for building voice-enabled applications using Speechmatics Real-Time API. Optimized for specific use cases: conversational AI, voice agents, transcription services, and real-time captioning.
Table of Contents
- What is the Voice SDK?
- Installation
- Quick Start
- Configuration
- Event Messages
- Common Usage Patterns
- Environment Variables
- Examples
- SDK Class Reference
- Requirements
- Documentation
- License
What is the Voice SDK?
The Voice SDK is a higher-level abstraction built on top of the Speechmatics Real-Time API (speechmatics-rt). While the Real-Time API provides raw transcription events (words and utterances), the Voice SDK adds:
- Intelligent Segmentation - Groups words into meaningful speech segments per speaker
- Turn Detection - Automatically detects when speakers finish their turns using adaptive or ML-based methods
- Speaker Management - Focus on or ignore specific speakers in multi-speaker scenarios
- Preset Configurations - Ready-to-use configs for common use cases (conversation, note-taking, captions)
- Simplified Event Handling - Receive clean, structured segments instead of raw word-level events
When to Use Voice SDK vs Real-Time API
Use Voice SDK when:
- You are building conversational AI or voice agents
- You need automatic turn detection
- You want speaker-focused transcription
- You need ready-to-use presets for common scenarios
Use Real-Time API when:
- You only need raw, word-level events
- You are building custom segmentation / aggregation logic
- You want fine-grained control over every event
Installation
# Standard installation
pip install speechmatics-voice
# With VAD and SMART_TURN (ML-based turn detection)
pip install speechmatics-voice[smart]
Note: Some features require additional ML dependencies (ONNX runtime, transformers). If not installed, these features will be unavailable and a warning will be shown.
👉 Using Docker? Click to see how to install the required models.
Use within Docker
If you are using a Docker container with the Voice SDK installed and you require the smart features (SMART_TURN), then you can use the following in your Dockerfile to make sure the ML models are included and not downloaded at runtime.
"""
Download the Voice SDK required models during the build process.
"""
from speechmatics.voice import SileroVAD, SmartTurnDetector
def load_models():
SileroVAD.download_model()
SmartTurnDetector.download_model()
if __name__ == "__main__":
load_models()
Then, in your Dockerfile, include the following:
COPY ./models.py models.py
RUN uv run models.py
This copies the script and runs it as part of the build.
Quick Start
Basic Example
A simple example that shows complete sentences as they have been finalized, with different speakers shown with different IDs.
import asyncio
import os
from speechmatics.rt import Microphone
from speechmatics.voice import VoiceAgentClient, AgentServerMessageType
async def main():
"""Stream microphone audio to Speechmatics Voice Agent using 'scribe' preset"""
# Audio configuration
SAMPLE_RATE = 16000 # Hz
CHUNK_SIZE = 160 # Samples per read
PRESET = "scribe" # Configuration preset
# Create client with preset
client = VoiceAgentClient(
api_key=os.getenv("SPEECHMATICS_API_KEY"),
preset=PRESET
)
# Print finalised segments of speech with speaker ID
@client.on(AgentServerMessageType.ADD_SEGMENT)
def on_segment(message):
for segment in message["segments"]:
speaker = segment["speaker_id"]
text = segment["text"]
print(f"{speaker}: {text}")
# Setup microphone
mic = Microphone(SAMPLE_RATE, CHUNK_SIZE)
if not mic.start():
print("Error: Microphone not available")
return
# Connect to the Voice Agent
await client.connect()
# Stream microphone audio (interruptable using keyboard)
try:
while True:
audio_chunk = await mic.read(CHUNK_SIZE)
if not audio_chunk:
break # Microphone stopped producing data
await client.send_audio(audio_chunk)
except KeyboardInterrupt:
pass
finally:
await client.disconnect()
if __name__ == "__main__":
asyncio.run(main())
Configuring a Voice Agent Client
When creating a VoiceAgentClient, there are several ways to configure it:
- Presets - optimised configurations for common use cases. These require no further configuration to be set.
# Low latency preset - for fast responses (may split speech in to smaller segments)
client = VoiceAgentClient(api_key=api_key, preset="fast")
# Conversation preset - for natural dialogue
client = VoiceAgentClient(api_key=api_key, preset="adaptive")
# Advanced conversation with ML turn detection
client = VoiceAgentClient(api_key=api_key, preset="smart_turn")
# External end of turn preset - endpointing handled by the client
client = VoiceAgentClient(api_key=api_key, preset="external")
# Scribe preset - for note-taking
client = VoiceAgentClient(api_key=api_key, preset="scribe")
# Captions preset - for live captioning
client = VoiceAgentClient(api_key=api_key, preset="captions")
# To view all available presets, use:
presets = VoiceAgentConfigPreset.list_presets()
- Custom Configuration - for more control, you can also specify custom configuration in a
VoiceAgentConfigobject.
from speechmatics.voice import VoiceAgentClient, VoiceAgentConfig, EndOfUtteranceMode
# Define your custom configuration
config = VoiceAgentConfig(
language="en",
enable_diarization=True,
max_delay=0.7,
end_of_utterance_mode=EndOfUtteranceMode.ADAPTIVE,
)
client = VoiceAgentClient(api_key=api_key, config=config)
- Custom Configuration with Overlays - you can use presets as a starting point, and then customize with overlays.
from speechmatics.voice import VoiceAgentConfigPreset, VoiceAgentConfig
# Use preset with custom overrides
config = VoiceAgentConfigPreset.SCRIBE(
VoiceAgentConfig(
language="es",
max_delay=0.8
)
)
Note: If no config or preset is provided, the client will default to the
externalpreset.
Configuration Serialization
It can also be useful to export and import configuration as JSON:
from speechmatics.voice import VoiceAgentConfigPreset, VoiceAgentConfig
# Export preset to JSON
config_json = VoiceAgentConfigPreset.SCRIBE().to_json()
# Load from JSON
config = VoiceAgentConfig.from_json(config_json)
# Or create from JSON string
config = VoiceAgentConfig.from_json('{"language": "en", "enable_diarization": true}')
Configuration
Basic Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
language |
str | "en" |
Language code for transcription (e.g., "en", "es", "fr"). See supported languages. |
operating_point |
OperatingPoint | ENHANCED |
Balance accuracy vs latency. Options: STANDARD or ENHANCED. |
domain |
str | None |
Domain-specific model (e.g., "finance", "medical"). See supported languages and domains. |
output_locale |
str | None |
Output locale for formatting (e.g., "en-GB", "en-US"). See supported languages and locales. |
max_delay |
float | 0.7 |
Maximum transcription delay for word emission. |
Turn Detection Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
end_of_utterance_mode |
EndOfUtteranceMode | FIXED |
Controls how turn endings are detected. Options: - FIXED - Uses fixed silence threshold. Fast but may split slow speech.- ADAPTIVE - Adjusts delay based on speech rate, pauses, and disfluencies. Best for natural conversation.- EXTERNAL - Manual control via client.finalize(). For custom turn logic. |
end_of_utterance_silence_trigger |
float | 0.2 |
Silence duration in seconds to trigger turn end (also used for the basis of adaptive delay). |
Speaker Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
enable_diarization |
bool | False |
Enable speaker diarization to identify and label different speakers. |
speaker_sensitivity |
float | 0.5 |
Diarization sensitivity between 0.0 and 1.0. Higher values detect more speakers. |
max_speakers |
int | None |
Limit maximum number of speakers to detect. |
prefer_current_speaker |
bool | False |
Give extra weight to current speaker for word grouping. |
speaker_config |
SpeakerFocusConfig | SpeakerFocusConfig() |
Configure speaker focus/ignore rules. |
known_speakers |
list[SpeakerIdentifier] | [] |
Pre-enrolled speaker identifiers for speaker identification. |
Usage Examples
Using speaker_config, you can focus on only specific speakers but keep words from others, or ignore specific speakers.
from speechmatics.voice import SpeakerFocusConfig, SpeakerFocusMode
# Focus only on specific speakers, but keep words from other speakers
config = VoiceAgentConfig(
enable_diarization=True,
speaker_config=SpeakerFocusConfig(
focus_speakers=["S1", "S2"],
focus_mode=SpeakerFocusMode.RETAIN
)
)
# Ignore specific speakers
config = VoiceAgentConfig(
enable_diarization=True,
speaker_config=SpeakerFocusConfig(
ignore_speakers=["S3"]
)
)
Using known_speakers, you can use pre-enrolled speaker identifiers to identify specific speakers.
from speechmatics.voice import SpeakerIdentifier
# Use known speakers from previous session
config = VoiceAgentConfig(
enable_diarization=True,
known_speakers=[
SpeakerIdentifier(label="Alice", speaker_identifiers=["XX...XX"]),
SpeakerIdentifier(label="Bob", speaker_identifiers=["YY...YY"])
]
)
Language & Vocabulary
| Parameter | Type | Default | Description |
|---|---|---|---|
additional_vocab |
list[AdditionalVocabEntry] | [] |
Custom vocabulary for domain-specific terms. |
punctuation_overrides |
dict | None |
Custom punctuation rules. |
Usage Examples
Using additional_vocab, you can specify a dictionary of domain-specific terms.
from speechmatics.voice import AdditionalVocabEntry
config = VoiceAgentConfig(
language="en",
additional_vocab=[
AdditionalVocabEntry(
content="Speechmatics",
sounds_like=["speech matters", "speech matics"]
),
AdditionalVocabEntry(content="API"),
]
)
Audio Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
sample_rate |
int | 16000 |
Audio sample rate in Hz. |
audio_encoding |
AudioEncoding | PCM_S16LE |
Audio encoding format. |
Advanced Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
transcription_update_preset |
TranscriptionUpdatePreset | COMPLETE |
Controls when to emit updates: COMPLETE, COMPLETE_PLUS_TIMING, WORDS, WORDS_PLUS_TIMING, or TIMING. |
speech_segment_config |
SpeechSegmentConfig | SpeechSegmentConfig() |
Fine-tune segment generation and post-processing. |
smart_turn_config |
SmartTurnConfig | None |
Configure SMART_TURN behavior (buffer length, threshold). |
include_results |
bool | False |
Include word-level timing data in segments. |
include_partials |
bool | True |
Include interim (lower confidence) words in emitted segments. Set to False for final-only output. |
Event Messages
The Voice SDK emits real-time, structured events as a session progresses via AgentServerMessageType.
These events fall into three main categories:
- Core Events - high-level session and transcription updates.
- Speaker Events - detected speech activity.
- Additional - detailed, low-level events.
To handle events, register a callback using @client.on() decorator or client.on() method.
Note: The payloads shown below are the actual message payloads from the Voice SDK. When using the CLI example with
--output-file, messages also include atstimestamp field (e.g.,"ts": "2025-11-11 23:18:35.909"), which is added by the CLI for logging purposes and is not part of the SDK payload.
High Level Overview
Core Events
| Event | Description | Notes / Purpose |
|---|---|---|
RECOGNITION_STARTED |
Fired when a transcription session starts | Contains session ID, language pack info |
ADD_PARTIAL_SEGMENT |
Emitted continuously during speech | Provides interim, real-time transcription text |
ADD_SEGMENT |
Fired when a segment is finalized | Provides stable, final transcription text |
END_OF_TURN |
Fired when a speaker’s turn ends | Depends on end_of_utterance_mode; useful for turn tracking |
Speaker Events
| Event | When it fires | Purpose |
|---|---|---|
SPEAKER_STARTED |
Voice detected | Marks start of speech |
SPEAKER_ENDED |
Silence detected | Marks end of speech |
SPEAKERS_RESULT |
Enrollment completes | Provides speaker IDs and labels |
Additional Events
| Event | When it fires | Purpose |
|---|---|---|
START_OF_TURN |
New turn begins | Optional, low-level event for turn tracking |
END_OF_TURN_PREDICTION |
Predicts turn completion | Fires before END_OF_TURN in adaptive mode |
END_OF_UTTERANCE |
Silence threshold reached | Low-level STT engine trigger |
ADD_PARTIAL_TRANSCRIPT |
Word-level partial transcript | Legacy; use ADD_PARTIAL_SEGMENT instead |
ADD_TRANSCRIPT |
Word-level final transcript | Legacy; use ADD_SEGMENT instead |
Core Events - Examples and Payloads
RECOGNITION_STARTED
@client.on(AgentServerMessageType.RECOGNITION_STARTED)
def on_started(message):
session_id = message["id"]
language = message["language_pack_info"]["language_description"]
print(f"Session {session_id} started - Language: {language}")
Payload:
{
"message": "RecognitionStarted",
"id": "a8779b0b-a238-43de-8211-c70f5fcbe191",
"orchestrator_version": "2025.08.29127+289170c022.HEAD",
"language_pack_info": {
"language_description": "English",
"word_delimiter": " ",
"writing_direction": "left-to-right",
"itn": true,
"adapted": false
}
}
ADD_PARTIAL_SEGMENT
@client.on(AgentServerMessageType.ADD_PARTIAL_SEGMENT)
def on_partial(message):
for segment in message["segments"]:
print(f"[INTERIM] {segment['speaker_id']}: {segment['text']}")
Payload:
{
"message": "AddPartialSegment",
"segments": [
{
"speaker_id": "S1",
"is_active": true,
"timestamp": "2025-11-11T23:18:37.189+00:00",
"language": "en",
"text": "Welcome to",
"metadata": {
"start_time": 1.28,
"end_time": 1.6
}
}
],
"metadata": {
"start_time": 1.28,
"end_time": 1.6,
"processing_time": 0.307
}
}
Fields:
speaker_id- Speaker label (e.g.,"S1","S2")is_active-trueif speaker is in focus (based onspeaker_config)text- Current partial transcription textmetadata.start_time- Segment start time (seconds since session start)metadata.end_time- Segment end time (seconds since session start)
Top-level metadata contains the same timing plus processing_time.
ADD_SEGMENT
@client.on(AgentServerMessageType.ADD_SEGMENT)
def on_segment(message):
for segment in message["segments"]:
speaker = segment["speaker_id"]
text = segment["text"]
start = message["metadata"]["start_time"]
print(f"[{start:.2f}s] {speaker}: {text}")
Payload:
{
"message": "AddSegment",
"segments": [
{
"speaker_id": "S1",
"is_active": true,
"timestamp": "2025-11-11T23:18:37.189+00:00",
"language": "en",
"text": "Welcome to Speechmatics.",
"metadata": {
"start_time": 1.28,
"end_time": 8.04
}
}
],
"metadata": {
"start_time": 1.28,
"end_time": 8.04,
"processing_time": 0.187
}
}
END_OF_TURN
@client.on(AgentServerMessageType.END_OF_TURN)
def on_turn_end(message):
duration = message["metadata"]["end_time"] - message["metadata"]["start_time"]
print(f"Turn ended (duration: {duration:.2f}s)")
Payload:
{
"message": "EndOfTurn",
"turn_id": 0,
"metadata": {
"start_time": 1.28,
"end_time": 8.04
}
}
Speaker Events - Examples and Payloads
SPEAKER_STARTED
@client.on(AgentServerMessageType.SPEAKER_STARTED)
def on_speaker_start(message):
speaker = message["speaker_id"]
time = message["time"]
print(f"{speaker} started speaking at {time}s")
Payload:
{
"message": "SpeakerStarted",
"is_active": true,
"speaker_id": "S1",
"time": 1.28
}
SPEAKER_ENDED
@client.on(AgentServerMessageType.SPEAKER_ENDED)
def on_speaker_end(message):
speaker = message["speaker_id"]
time = message["time"]
print(f"{speaker} stopped speaking at {time}s")
Payload:
{
"message": "SpeakerEnded",
"is_active": false,
"speaker_id": "S1",
"time": 2.64
}
SPEAKERS_RESULT
# Listen for the result
@client.on(AgentServerMessageType.SPEAKERS_RESULT)
def on_speakers(message):
for speaker in message["speakers"]:
print(f"Speaker {speaker['label']}: {speaker['speaker_identifiers']}")
# Request speaker IDs at end of session
await client.send_message({"message": AgentClientMessageType.GET_SPEAKERS, "final": True})
# Request speaker IDs now
await client.send_message({"message": AgentClientMessageType.GET_SPEAKERS})
Common Usage Patterns
Simple Transcription
client = VoiceAgentClient(api_key=api_key, preset="scribe")
@client.on(AgentServerMessageType.ADD_SEGMENT)
def on_segment(message):
for segment in message["segments"]:
print(f"{segment['speaker_id']}: {segment['text']}")
Conversational AI with Turn Detection
config = VoiceAgentConfig(
language="en",
enable_diarization=True,
end_of_utterance_mode=EndOfUtteranceMode.ADAPTIVE,
)
client = VoiceAgentClient(api_key=api_key, config=config)
@client.on(AgentServerMessageType.ADD_SEGMENT)
def on_segment(message):
user_text = message["segments"][0]["text"]
# Process user input
@client.on(AgentServerMessageType.END_OF_TURN)
def on_turn_end(message):
# User finished speaking - generate AI response
pass
Live Captions with Timestamps
client = VoiceAgentClient(api_key=api_key, preset="captions")
@client.on(AgentServerMessageType.ADD_SEGMENT)
def on_segment(message):
start_time = message["metadata"]["start_time"]
for segment in message["segments"]:
print(f"[{start_time:.1f}s] {segment['text']}")
Speaker Identification
from speechmatics.voice import SpeakerIdentifier
# Use known speakers from previous session
known_speakers = [
SpeakerIdentifier(label="Alice", speaker_identifiers=["XX...XX"]),
SpeakerIdentifier(label="Bob", speaker_identifiers=["YY...YY"])
]
config = VoiceAgentConfig(
enable_diarization=True,
known_speakers=known_speakers
)
client = VoiceAgentClient(api_key=api_key, config=config)
@client.on(AgentServerMessageType.ADD_SEGMENT)
def on_segment(message):
for segment in message["segments"]:
# Will show "Alice" or "Bob" instead of "S1", "S2"
print(f"{segment['speaker_id']}: {segment['text']}")
Manual Turn Control
config = VoiceAgentConfig(
end_of_utterance_mode=EndOfUtteranceMode.EXTERNAL
)
client = VoiceAgentClient(api_key=api_key, config=config)
# Manually trigger turn end
await client.finalize(end_of_turn=True)
Focus on Specific Speaker
from speechmatics.voice import SpeakerFocusConfig, SpeakerFocusMode
config = VoiceAgentConfig(
enable_diarization=True,
speaker_config=SpeakerFocusConfig(
focus_speakers=["S1"], # Only emit S1's speech
focus_mode=SpeakerFocusMode.RETAIN
)
)
client = VoiceAgentClient(api_key=api_key, config=config)
@client.on(AgentServerMessageType.ADD_SEGMENT)
def on_segment(message):
# Only S1's segments will appear here
for segment in message["segments"]:
if segment["is_active"]:
print(f"{segment['text']}")
# Dynamically change focused speaker during session
await client.update_diarization_config(
SpeakerFocusConfig(
focus_speakers=["S2"], # Switch focus to S2
focus_mode=SpeakerFocusMode.RETAIN
)
)
Environment Variables
SPEECHMATICS_API_KEY- Your Speechmatics API key (required)SPEECHMATICS_RT_URL- Custom WebSocket endpoint (optional)SMART_TURN_MODEL_PATH- Path for SMART_TURN ONNX model cache (optional)SMART_TURN_HF_URL- Override SMART_TURN model download URL (optional)
Examples
See the examples/voice/ directory for complete working examples:
simple/- Basic microphone transcriptionscribe/- Note-taking with custom vocabularycli/- Full-featured CLI with all options
SDK Class Reference
VoiceAgentClient
class VoiceAgentClient:
def __init__(
self,
auth: Optional[AuthBase] = None,
api_key: Optional[str] = None,
url: Optional[str] = None,
app: Optional[str] = None,
config: Optional[VoiceAgentConfig] = None,
preset: Optional[str] = None
):
"""Create Voice Agent client.
Args:
auth: Authentication instance (optional)
api_key: Speechmatics API key (defaults to SPEECHMATICS_API_KEY env var)
url: Custom WebSocket URL (defaults to SPEECHMATICS_RT_URL env var)
app: Optional application name for endpoint URL
config: Voice Agent configuration (optional)
preset: Preset name ("scribe", "fast", etc.) (optional)
"""
async def connect(self) -> None:
"""Connect to Speechmatics service.
Establishes WebSocket connection and starts transcription session.
Must be called before sending audio.
"""
async def disconnect(self) -> None:
"""Disconnect from service.
Closes WebSocket connection and cleans up resources.
"""
async def send_audio(self, payload: bytes) -> None:
"""Send audio data for transcription.
Args:
payload: Audio data as bytes
"""
def update_diarization_config(self, config: SpeakerFocusConfig) -> None:
"""Update diarization configuration during session.
Args:
config: New speaker focus configuration
"""
def finalize(self, end_of_turn: bool = False) -> None:
"""Finalize segments and optionally trigger end of turn.
Args:
end_of_turn: Whether to emit end of turn message (default: False)
"""
async def send_message(self, message: dict) -> None:
"""Send control message to service.
Args:
message: Control message dictionary
"""
def on(self, event: AgentServerMessageType, callback: Callable) -> None:
"""Register event handler.
Args:
event: Event type to listen for
callback: Function to call when event occurs
"""
def once(self, event: AgentServerMessageType, callback: Callable) -> None:
"""Register one-time event handler.
Args:
event: Event type to listen for
callback: Function to call once when event occurs
"""
def off(self, event: AgentServerMessageType, callback: Callable) -> None:
"""Unregister event handler.
Args:
event: Event type
callback: Function to remove
"""
Requirements
- Python 3.9+
- Speechmatics API key (Get one through: Speechmatics Portal)
Documentation
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file speechmatics_voice-0.2.8.tar.gz.
File metadata
- Download URL: speechmatics_voice-0.2.8.tar.gz
- Upload date:
- Size: 61.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b2d9cbf773fd94400c744734662e2b16b5bdc4271d0dafde46ac032c438fe000
|
|
| MD5 |
abcd310172427b2ae68f426db1a3d00f
|
|
| BLAKE2b-256 |
e4b272b5b2203bbefbd22e7692adaca0dd7c2feebed1aaea5599ec579f74fbbf
|
File details
Details for the file speechmatics_voice-0.2.8-py3-none-any.whl.
File metadata
- Download URL: speechmatics_voice-0.2.8-py3-none-any.whl
- Upload date:
- Size: 57.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
423ac7620ae8c98f175faace2184ac4ab1fe448ffb41af57aae05ec655326f79
|
|
| MD5 |
319d95ecab098c8a746508295fa7bcf4
|
|
| BLAKE2b-256 |
892da2ab215a7a31fad5ef9267420dc9ced96d6d52e5b80b131ef41424607849
|