Real-time voice assistant built on OpenAI's Realtime API
Project description
rtvoice
A Python library for building real-time voice agents powered by the OpenAI Realtime API. It handles the full session lifecycle — microphone input, WebSocket streaming, turn detection, tool calling, and audio playback — so you can focus on what your agent does, not how it talks.
Installation
pip install rtvoice[audio]
Requires Python 3.13+ and an OPENAI_API_KEY environment variable (or pass api_key= directly).
Quickstart
import asyncio
from rtvoice import RealtimeAgent
async def main():
agent = RealtimeAgent(
instructions="You are Jarvis, a concise and helpful voice assistant.",
)
await agent.run()
asyncio.run(main())
Run it, speak into your microphone, and the agent responds through your speakers. Press Ctrl+C to end the session.
Table of Contents
- Tool calling
- Supervisor
- Conversation seeds
- Lifecycle listener
- Custom audio devices
- Turn detection
- Voice and model
- Recording
- Token tracking
- Inactivity timeout
- Azure OpenAI
Tool calling
Basic tools
Create a Tools instance, decorate functions with @tools.action(description), then pass the instance to RealtimeAgent. Both async and regular def functions are supported.
import asyncio
from rtvoice import RealtimeAgent, Tools
tools = Tools()
@tools.action("Get the current weather for a given city")
async def get_weather(city: str) -> str:
return f"It's 18°C and partly cloudy in {city}."
async def main():
agent = RealtimeAgent(
instructions="Answer weather questions using get_weather.",
tools=tools,
)
await agent.run()
asyncio.run(main())
Parameter types are inferred from the function signature and included in the schema sent to the model. All parameters without a default value are marked required.
Pydantic model tools
For richer schemas, register a Pydantic model with param_model=. The model fields become the tool parameters, and the function receives a validated model instance.
from typing import Literal
from pydantic import BaseModel, Field
from rtvoice import Tools
tools = Tools()
class CalendarSearchParams(BaseModel):
query: str = Field(description="What to search for")
date: str | None = Field(default=None, description="Optional ISO date filter")
limit: int = Field(default=5, description="Maximum number of matches")
source: Literal["work", "personal"] = "work"
@tools.action(
"Search calendar events",
param_model=CalendarSearchParams,
)
async def search_calendar(params: CalendarSearchParams) -> str:
return await calendar.search(
query=params.query,
date=params.date,
limit=params.limit,
source=params.source,
)
Nested Pydantic models, typed lists, enums, literals, defaults, and Field(description=...) are included in the generated tool schema.
Long-running tools
Set holding_instruction to have the assistant speak a phrase while the tool runs. The agent will say it immediately after calling the tool, before the result arrives.
@tools.action(
"Search the web for a query",
holding_instruction="Let me search that for you, give me a moment.",
)
async def search_web(query: str) -> str:
result = await do_search(query)
return result
Optionally add result_instruction to tell the model how to present the result once the tool returns:
@tools.action(
"Fetch the latest headlines",
holding_instruction="Fetching the news...",
result_instruction="Summarise the headlines in two sentences.",
)
async def get_headlines() -> str: ...
Status templates
status is a spoken update for tools registered with param_model=. Use {field_name} placeholders from the Pydantic model — rtvoice validates them at registration time.
class PlaySongParams(BaseModel):
song: str = Field(description="Song title")
@tools.action(
"Play a song by name",
param_model=PlaySongParams,
status="Playing {song} now.",
)
async def play_song(params: PlaySongParams) -> str:
await music_player.play(params.song)
return f"Now playing: {params.song}"
status can also be a callable that receives the validated Pydantic model and returns a string dynamically.
@tools.action(
"Play a song by name",
param_model=PlaySongParams,
status=lambda params: f"Playing {params.song} now.",
)
async def play_song(params: PlaySongParams) -> str:
await music_player.play(params.song)
return f"Now playing: {params.song}"
Context injection
Any tool parameter typed as Inject[T] is filled automatically by the framework — the model never sees it and does not need to supply a value. Three types are injectable:
| Type | What it provides |
|---|---|
Inject[EventBus] |
Internal event bus |
Inject[ConversationHistory] |
Full conversation so far |
Inject[YourContextType] |
Your custom context= object |
from rtvoice import Tools, Inject
from rtvoice.tools import ToolContext
from rtvoice.conversation import ConversationHistory
tools = Tools()
@tools.action("Summarise the conversation so far")
async def summarise(
history: Inject[ConversationHistory],
) -> str:
text = history.format()
return await llm.summarise(text)
Custom application context
Pass any object as context= on RealtimeAgent. It is then injectable in every tool via Inject[YourType].
from dataclasses import dataclass
from rtvoice import RealtimeAgent, Tools, Inject
@dataclass
class AppState:
user_name: str
premium: bool
tools = Tools()
@tools.action("Greet the user by name")
async def greet(state: Inject[AppState]) -> str:
tier = "premium" if state.premium else "free"
return f"Hello {state.user_name}, you are on the {tier} plan."
agent = RealtimeAgent(
instructions="Greet the user when asked.",
tools=tools,
context=AppState(user_name="Alice", premium=True),
)
Supervisor
Delegate complex, multi-step tasks to one LLM-driven supervisor. The voice agent hands off, speaks a holding phrase, and presents the result when done.
from rtvoice import RealtimeAgent, Supervisor, Tools
from rtvoice.llm import ChatOpenAI
tools = Tools()
@tools.action("Book a restaurant table")
async def book_table(
restaurant: str,
date: str,
time: str,
party_size: int,
) -> str:
return f"Booked for {party_size} at {restaurant} on {date} at {time}."
supervisor = Supervisor(
description="Books restaurant tables on behalf of the user.",
holding_instruction="I'm checking availability, just a moment.",
instructions="Use book_table to complete booking requests. Call done() when finished.",
tools=tools,
llm=ChatOpenAI(model="gpt-4o-mini"),
)
agent = RealtimeAgent(
instructions="Delegate restaurant bookings to the supervisor.",
supervisor=supervisor,
)
How it works: the realtime agent registers the Supervisor as a callable supervisor tool. When invoked, the supervisor runs its own agentic loop (tool calls → LLM → tool calls …) until it either calls done() or needs a clarification from the user via clarify(). Clarifications are automatically routed back through the voice agent and the loop resumes.
Supervisor parameters:
| Parameter | Description |
|---|---|
description |
Shown to the realtime model to decide when to delegate |
instructions |
System prompt for the supervisor's own LLM loop |
llm |
ChatOpenAI(model=...) or any ChatModel implementation |
tools |
Tools instance with the actions the supervisor may call |
holding_instruction |
Spoken while the supervisor works |
result_instructions |
Tells the realtime model how to present the result |
handoff_instructions |
Extra guidance appended to the tool description |
max_iterations |
Loop iteration cap (default: 10) |
context |
Arbitrary object injectable inside supervisor tools |
Conversation seeds
Pre-fill the session with synthetic conversation history before the microphone opens. The model will behave as if those exchanges already happened.
from rtvoice import RealtimeAgent, ConversationSeed, SeedMessage
agent = RealtimeAgent(
instructions="You are a helpful assistant.",
conversation_seed=ConversationSeed(
messages=[
SeedMessage.user("My name is Alice and I prefer short answers."),
SeedMessage.assistant("Got it, Alice. I'll keep things brief."),
]
),
)
Use ConversationSeed.from_pairs() for a more concise form when you have multiple user/assistant exchanges:
seed = ConversationSeed.from_pairs(
("My name is Alice.", "Nice to meet you, Alice."),
("I prefer short answers.", "Understood, I'll be brief."),
)
Lifecycle listener
Subclass AgentListener and pass it to RealtimeAgent to hook into session events. Override only the methods you care about — all are async no-ops by default.
from rtvoice import RealtimeAgent, AgentListener
class MyListener(AgentListener):
async def on_agent_starting(self) -> None:
print("Agent is starting up...")
async def on_agent_session_connected(self) -> None:
print("WebSocket connected, ready to talk.")
async def on_user_transcript(self, transcript: str) -> None:
print(f"User said: {transcript}")
async def on_assistant_transcript(self, transcript: str) -> None:
print(f"Assistant replied: {transcript}")
async def on_agent_stopped(self) -> None:
print("Session ended.")
agent = RealtimeAgent(
instructions="You are a helpful assistant.",
listener=MyListener(),
)
All available callbacks:
| Method | When it fires |
|---|---|
on_agent_starting() |
Before any I/O or WebSocket setup |
on_agent_session_connected() |
WebSocket session established |
on_agent_stopped() |
Agent fully shut down |
on_user_started_speaking() |
VAD detected speech start |
on_user_stopped_speaking() |
VAD detected speech end |
on_user_transcript(transcript) |
Finalised user transcript (requires transcription_model) |
on_assistant_started_responding() |
Assistant began streaming audio |
on_assistant_stopped_responding() |
Assistant finished streaming audio |
on_assistant_transcript(transcript) |
Full assistant response text |
on_assistant_transcript_delta(delta) |
Incremental assistant text chunk (requires "text" in output_modalities) |
on_agent_interrupted() |
User interrupted the assistant mid-response |
on_agent_error(error) |
Session or API error |
on_supervisor_started() |
The supervisor began running |
on_supervisor_finished() |
The supervisor finished |
on_user_inactivity_countdown(remaining_seconds) |
Fires each second before inactivity timeout |
Custom audio devices
Implement AudioInputDevice or AudioOutputDevice from rtvoice.audio to replace the default microphone or speaker — useful for telephony, file playback, testing, or embedded hardware.
Custom input
from collections.abc import AsyncIterator
from rtvoice.audio import AudioInputDevice
class CustomMicrophone(AudioInputDevice):
def __init__(self):
self._active = False
async def start(self) -> None:
self._active = True
# open your audio source here
async def stop(self) -> None:
self._active = False
# release resources here
async def stream_chunks(self) -> AsyncIterator[bytes]:
while self._active:
chunk = await self._read_pcm_chunk() # raw 16-bit PCM, 24 kHz mono
yield chunk
@property
def is_active(self) -> bool:
return self._active
agent = RealtimeAgent(
instructions="...",
audio_input=CustomMicrophone(),
)
Custom output
from rtvoice.audio import AudioOutputDevice
class CustomSpeaker(AudioOutputDevice):
def __init__(self):
self._playing = False
async def start(self) -> None:
self._playing = True
async def stop(self) -> None:
self._playing = False
async def play_chunk(self, chunk: bytes) -> None:
# write raw 16-bit PCM audio to your sink
await self._write_to_device(chunk)
async def clear_buffer(self) -> None:
# discard buffered audio (called on user interruption)
await self._flush()
@property
def is_playing(self) -> bool:
return self._playing
agent = RealtimeAgent(
instructions="...",
audio_output=CustomSpeaker(),
)
Audio format: 16-bit PCM, 24 kHz, mono in both directions.
Turn detection
Control when the model decides the user has finished speaking.
Semantic VAD (default)
Waits for a semantically complete thought. Less likely to cut off mid-sentence.
from rtvoice import RealtimeAgent, SemanticVAD, SemanticEagerness
agent = RealtimeAgent(
instructions="...",
turn_detection=SemanticVAD(eagerness=SemanticEagerness.LOW),
)
SemanticEagerness values: LOW, MEDIUM, HIGH, AUTO (default).
Server VAD
Energy-based: triggers on silence duration. More predictable latency.
from rtvoice import RealtimeAgent, ServerVAD
agent = RealtimeAgent(
instructions="...",
turn_detection=ServerVAD(
threshold=0.5, # energy threshold 0–1
prefix_padding_ms=300, # audio kept before speech onset
silence_duration_ms=500, # silence needed to commit end-of-turn
),
)
Voice and model
from rtvoice import RealtimeAgent, AssistantVoice, RealtimeModel
agent = RealtimeAgent(
model=RealtimeModel.GPT_REALTIME, # or GPT_REALTIME_MINI, GPT_REALTIME_1_5
voice=AssistantVoice.CORAL,
speech_speed=1.2, # 0.25–1.5, default 1.0
instructions="...",
)
Available voices: ALLOY, ASH, BALLAD, CORAL, ECHO, FABLE, ONYX, NOVA, SAGE, SHIMMER, VERSE, CEDAR, MARIN.
Recording
Save the raw session audio to a file:
agent = RealtimeAgent(
instructions="...",
recording_path="session.pcm",
)
result = await agent.run()
print(result.recording_path) # Path to the saved file
The returned AgentResult also contains result.turns — a list of ConversationTurn objects with role and text for every exchange.
Inactivity timeout
Automatically stop the agent after a period of user silence:
agent = RealtimeAgent(
instructions="...",
inactivity_timeout_enabled=True,
inactivity_timeout_seconds=30.0,
listener=MyListener(), # on_user_inactivity_countdown fires each second 5→1
)
The countdown fires through AgentListener.on_user_inactivity_countdown(remaining_seconds) — useful for playing a "still there?" prompt before the session closes.
Azure OpenAI
Pass an AzureOpenAIProvider instead of the default OpenAI provider:
from rtvoice import RealtimeAgent
from rtvoice import AzureOpenAIProvider
agent = RealtimeAgent(
instructions="...",
provider=AzureOpenAIProvider(
azure_endpoint="https://your-resource.openai.azure.com",
azure_deployment="gpt-4o-realtime-preview",
api_version="2024-12-17",
api_key="...", # or omit to use AZURE_OPENAI_API_KEY
),
)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rtvoice-0.7.0.tar.gz.
File metadata
- Download URL: rtvoice-0.7.0.tar.gz
- Upload date:
- Size: 121.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e5051ac59e4d23a7f8fc7488a6b77a281193d4ef2bba7b50e7ac73bb050c2707
|
|
| MD5 |
e03e9bc7a43a97cea5e61f148ce49d1a
|
|
| BLAKE2b-256 |
0c1ec17a68925c7cd31a1f8d4bc5ba56516e6f9c74d15163288c908089e586d3
|
File details
Details for the file rtvoice-0.7.0-py3-none-any.whl.
File metadata
- Download URL: rtvoice-0.7.0-py3-none-any.whl
- Upload date:
- Size: 62.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bc31db7788410489e760d124ad73ce1844724bd251a221e7b9a18c9b846d8c09
|
|
| MD5 |
2eb4ba2f85c1da834a992a680f2027c7
|
|
| BLAKE2b-256 |
2bda54cd147c073a45ac66a57e589136621e706a0c79243a08e86d981705a54c
|