Skip to main content

Real-time voice assistant built on OpenAI's Realtime API

Project description

rtvoice

PyPI version Python Version

A Python library for building real-time voice agents powered by the OpenAI Realtime API. It handles the full session lifecycle — microphone input, WebSocket streaming, turn detection, tool calling, and audio playback — so you can focus on what your agent does, not how it talks.


Installation

pip install rtvoice[audio]

Requires Python 3.13+ and an OPENAI_API_KEY environment variable (or pass api_key= directly).


Quickstart

import asyncio
from rtvoice import RealtimeAgent

async def main():
    agent = RealtimeAgent(
        instructions="You are Jarvis, a concise and helpful voice assistant.",
    )
    await agent.run()

asyncio.run(main())

Run it, speak into your microphone, and the agent responds through your speakers. Press Ctrl+C to end the session.


Table of Contents


Tool calling

Basic tools

Create a Tools instance, decorate functions with @tools.action(description), then pass the instance to RealtimeAgent. Both async and regular def functions are supported.

import asyncio
from rtvoice import RealtimeAgent, Tools

tools = Tools()

@tools.action("Get the current weather for a given city")
async def get_weather(city: str) -> str:
    return f"It's 18°C and partly cloudy in {city}."

async def main():
    agent = RealtimeAgent(
        instructions="Answer weather questions using get_weather.",
        tools=tools,
    )
    await agent.run()

asyncio.run(main())

Parameter types are inferred from the function signature and included in the schema sent to the model. All parameters without a default value are marked required.

Pydantic model tools

For richer schemas, register a Pydantic model with param_model=. The model fields become the tool parameters, and the function receives a validated model instance.

from typing import Literal

from pydantic import BaseModel, Field
from rtvoice import Tools

tools = Tools()

class CalendarSearchParams(BaseModel):
    query: str = Field(description="What to search for")
    date: str | None = Field(default=None, description="Optional ISO date filter")
    limit: int = Field(default=5, description="Maximum number of matches")
    source: Literal["work", "personal"] = "work"

@tools.action(
    "Search calendar events",
    param_model=CalendarSearchParams,
)
async def search_calendar(params: CalendarSearchParams) -> str:
    return await calendar.search(
        query=params.query,
        date=params.date,
        limit=params.limit,
        source=params.source,
    )

Nested Pydantic models, typed lists, enums, literals, defaults, and Field(description=...) are included in the generated tool schema.

Long-running tools

Set holding_instruction to have the assistant speak a phrase while the tool runs. The agent will say it immediately after calling the tool, before the result arrives.

@tools.action(
    "Search the web for a query",
    holding_instruction="Let me search that for you, give me a moment.",
)
async def search_web(query: str) -> str:
    result = await do_search(query)
    return result

Optionally add result_instruction to tell the model how to present the result once the tool returns:

@tools.action(
    "Fetch the latest headlines",
    holding_instruction="Fetching the news...",
    result_instruction="Summarise the headlines in two sentences.",
)
async def get_headlines() -> str: ...

Status templates

status is a spoken update for tools registered with param_model=. Use {field_name} placeholders from the Pydantic model — rtvoice validates them at registration time.

class PlaySongParams(BaseModel):
    song: str = Field(description="Song title")

@tools.action(
    "Play a song by name",
    param_model=PlaySongParams,
    status="Playing {song} now.",
)
async def play_song(params: PlaySongParams) -> str:
    await music_player.play(params.song)
    return f"Now playing: {params.song}"

status can also be a callable that receives the validated Pydantic model and returns a string dynamically.

@tools.action(
    "Play a song by name",
    param_model=PlaySongParams,
    status=lambda params: f"Playing {params.song} now.",
)
async def play_song(params: PlaySongParams) -> str:
    await music_player.play(params.song)
    return f"Now playing: {params.song}"

Context injection

Any tool parameter typed as Inject[T] is filled automatically by the framework — the model never sees it and does not need to supply a value. Three types are injectable:

Type What it provides
Inject[EventBus] Internal event bus
Inject[ConversationHistory] Full conversation so far
Inject[YourContextType] Your custom context= object
from rtvoice import Tools, Inject
from rtvoice.tools import ToolContext
from rtvoice.conversation import ConversationHistory

tools = Tools()

@tools.action("Summarise the conversation so far")
async def summarise(
    history: Inject[ConversationHistory],
) -> str:
    text = history.format()
    return await llm.summarise(text)

Custom application context

Pass any object as context= on RealtimeAgent. It is then injectable in every tool via Inject[YourType].

from dataclasses import dataclass
from rtvoice import RealtimeAgent, Tools, Inject

@dataclass
class AppState:
    user_name: str
    premium: bool

tools = Tools()

@tools.action("Greet the user by name")
async def greet(state: Inject[AppState]) -> str:
    tier = "premium" if state.premium else "free"
    return f"Hello {state.user_name}, you are on the {tier} plan."

agent = RealtimeAgent(
    instructions="Greet the user when asked.",
    tools=tools,
    context=AppState(user_name="Alice", premium=True),
)

Subagents

Delegate complex, multi-step tasks to a dedicated LLM-driven subagent. The voice agent hands off, speaks a holding phrase, and presents the result when done.

from rtvoice import RealtimeAgent, SubAgent, Tools
from rtvoice.llm import ChatOpenAI

tools = Tools()

@tools.action("Book a restaurant table")
async def book_table(
    restaurant: str,
    date: str,
    time: str,
    party_size: int,
) -> str:
    return f"Booked for {party_size} at {restaurant} on {date} at {time}."

booking_agent = SubAgent(
    name="Booking Assistant",
    description="Books restaurant tables on behalf of the user.",
    holding_instruction="I'm checking availability, just a moment.",
    instructions="Use book_table to complete booking requests. Call done() when finished.",
    tools=tools,
    llm=ChatOpenAI(model="gpt-4o-mini"),
)

agent = RealtimeAgent(
    instructions="Delegate restaurant bookings to the Booking Assistant.",
    subagents=[booking_agent],
)

How it works: the realtime agent registers each SubAgent as a callable tool. When invoked, the subagent runs its own agentic loop (tool calls → LLM → tool calls …) until it either calls done() or needs a clarification from the user via clarify(). Clarifications are automatically routed back through the voice agent and the loop resumes.

SubAgent parameters:

Parameter Description
name Unique name; becomes the tool name the realtime model calls
description Shown to the realtime model to decide when to delegate
instructions System prompt for the subagent's own LLM loop
llm ChatOpenAI(model=...) or any ChatModel implementation
tools Tools instance with the actions the subagent may call
mcp_servers MCP servers to connect to during prewarm
holding_instruction Spoken while the subagent works
result_instructions Tells the realtime model how to present the result
handoff_instructions Extra guidance appended to the tool description
max_iterations Loop iteration cap (default: 10)
context Arbitrary object injectable inside subagent tools

MCP servers

Connect any MCP-compatible tool server via MCPServerStdio. Tools are discovered automatically during startup.

from rtvoice import RealtimeAgent
from rtvoice import MCPServerStdio

agent = RealtimeAgent(
    instructions="You can read and write files in /tmp.",
    mcp_servers=[
        MCPServerStdio(
            command="npx",
            args=["-y", "@modelcontextprotocol/server-filesystem", "/tmp"],
        )
    ],
)

For heavy tool sets, attach the MCP server to a SubAgent instead. This keeps the realtime model's tool list short and avoids latency on every turn:

research_agent = SubAgent(
    name="Researcher",
    description="Searches the web and reads URLs.",
    instructions="Use the available tools to answer research questions.",
    llm=ChatOpenAI(model="gpt-4o"),
    mcp_servers=[
        MCPServerStdio(command="uvx", args=["mcp-server-fetch"]),
    ],
)

agent = RealtimeAgent(
    instructions="Delegate research tasks to the Researcher.",
    subagents=[research_agent],
)

Conversation seeds

Pre-fill the session with synthetic conversation history before the microphone opens. The model will behave as if those exchanges already happened.

from rtvoice import RealtimeAgent, ConversationSeed, SeedMessage

agent = RealtimeAgent(
    instructions="You are a helpful assistant.",
    conversation_seed=ConversationSeed(
        messages=[
            SeedMessage.user("My name is Alice and I prefer short answers."),
            SeedMessage.assistant("Got it, Alice. I'll keep things brief."),
        ]
    ),
)

Use ConversationSeed.from_pairs() for a more concise form when you have multiple user/assistant exchanges:

seed = ConversationSeed.from_pairs(
    ("My name is Alice.", "Nice to meet you, Alice."),
    ("I prefer short answers.", "Understood, I'll be brief."),
)

Lifecycle listener

Subclass AgentListener and pass it to RealtimeAgent to hook into session events. Override only the methods you care about — all are async no-ops by default.

from rtvoice import RealtimeAgent, AgentListener

class MyListener(AgentListener):
    async def on_agent_starting(self) -> None:
        print("Agent is starting up...")

    async def on_agent_session_connected(self) -> None:
        print("WebSocket connected, ready to talk.")

    async def on_user_transcript(self, transcript: str) -> None:
        print(f"User said: {transcript}")

    async def on_assistant_transcript(self, transcript: str) -> None:
        print(f"Assistant replied: {transcript}")

    async def on_agent_stopped(self) -> None:
        print("Session ended.")

agent = RealtimeAgent(
    instructions="You are a helpful assistant.",
    listener=MyListener(),
)

All available callbacks:

Method When it fires
on_agent_starting() Before any I/O or WebSocket setup
on_agent_session_connected() WebSocket session established
on_agent_stopped() Agent fully shut down
on_user_started_speaking() VAD detected speech start
on_user_stopped_speaking() VAD detected speech end
on_user_transcript(transcript) Finalised user transcript (requires transcription_model)
on_assistant_started_responding() Assistant began streaming audio
on_assistant_stopped_responding() Assistant finished streaming audio
on_assistant_transcript(transcript) Full assistant response text
on_assistant_transcript_delta(delta) Incremental assistant text chunk (requires "text" in output_modalities)
on_agent_interrupted() User interrupted the assistant mid-response
on_agent_error(error) Session or API error
on_subagent_started(agent_name) A subagent began running
on_subagent_finished(agent_name) A subagent finished
on_user_inactivity_countdown(remaining_seconds) Fires each second before inactivity timeout

Custom audio devices

Implement AudioInputDevice or AudioOutputDevice from rtvoice.audio to replace the default microphone or speaker — useful for telephony, file playback, testing, or embedded hardware.

Custom input

from collections.abc import AsyncIterator
from rtvoice.audio import AudioInputDevice

class CustomMicrophone(AudioInputDevice):
    def __init__(self):
        self._active = False

    async def start(self) -> None:
        self._active = True
        # open your audio source here

    async def stop(self) -> None:
        self._active = False
        # release resources here

    async def stream_chunks(self) -> AsyncIterator[bytes]:
        while self._active:
            chunk = await self._read_pcm_chunk()  # raw 16-bit PCM, 24 kHz mono
            yield chunk

    @property
    def is_active(self) -> bool:
        return self._active

agent = RealtimeAgent(
    instructions="...",
    audio_input=CustomMicrophone(),
)

Custom output

from rtvoice.audio import AudioOutputDevice

class CustomSpeaker(AudioOutputDevice):
    def __init__(self):
        self._playing = False

    async def start(self) -> None:
        self._playing = True

    async def stop(self) -> None:
        self._playing = False

    async def play_chunk(self, chunk: bytes) -> None:
        # write raw 16-bit PCM audio to your sink
        await self._write_to_device(chunk)

    async def clear_buffer(self) -> None:
        # discard buffered audio (called on user interruption)
        await self._flush()

    @property
    def is_playing(self) -> bool:
        return self._playing

agent = RealtimeAgent(
    instructions="...",
    audio_output=CustomSpeaker(),
)

Audio format: 16-bit PCM, 24 kHz, mono in both directions.


Turn detection

Control when the model decides the user has finished speaking.

Semantic VAD (default)

Waits for a semantically complete thought. Less likely to cut off mid-sentence.

from rtvoice import RealtimeAgent, SemanticVAD, SemanticEagerness

agent = RealtimeAgent(
    instructions="...",
    turn_detection=SemanticVAD(eagerness=SemanticEagerness.LOW),
)

SemanticEagerness values: LOW, MEDIUM, HIGH, AUTO (default).

Server VAD

Energy-based: triggers on silence duration. More predictable latency.

from rtvoice import RealtimeAgent, ServerVAD

agent = RealtimeAgent(
    instructions="...",
    turn_detection=ServerVAD(
        threshold=0.5,           # energy threshold 0–1
        prefix_padding_ms=300,   # audio kept before speech onset
        silence_duration_ms=500, # silence needed to commit end-of-turn
    ),
)

Voice and model

from rtvoice import RealtimeAgent, AssistantVoice, RealtimeModel

agent = RealtimeAgent(
    model=RealtimeModel.GPT_REALTIME,       # or GPT_REALTIME_MINI, GPT_REALTIME_1_5
    voice=AssistantVoice.CORAL,
    speech_speed=1.2,                       # 0.25–1.5, default 1.0
    instructions="...",
)

Available voices: ALLOY, ASH, BALLAD, CORAL, ECHO, FABLE, ONYX, NOVA, SAGE, SHIMMER, VERSE, CEDAR, MARIN.


Recording

Save the raw session audio to a file:

agent = RealtimeAgent(
    instructions="...",
    recording_path="session.pcm",
)

result = await agent.run()
print(result.recording_path)   # Path to the saved file

The returned AgentResult also contains result.turns — a list of ConversationTurn objects with role and text for every exchange.


Token tracking

AgentResult.token_usage is populated automatically after every agent.run() call. It covers all LLM activity in the session: realtime turns, transcription, and any subagent LLM calls.

result = await agent.run()
usage = result.token_usage

print(f"Total cost: ${usage.cost.total_usd:.4f}")
print(f"Input tokens: {usage.usage.input_tokens}")
print(f"Output tokens: {usage.usage.output_tokens}")
print(f"Cached input tokens: {usage.usage.cached_input_tokens}")

Per-model breakdown

for model_summary in result.token_usage.by_model:
    print(f"{model_summary.model}: ${model_summary.cost.total_usd:.4f}")
    print(f"  audio in/out: {model_summary.usage.input_audio_tokens} / {model_summary.usage.output_audio_tokens}")
    print(f"  text  in/out: {model_summary.usage.input_text_tokens} / {model_summary.usage.output_text_tokens}")

TokenUsageSummary fields

Field Type Description
usage TokenUsageBreakdown Aggregated token counts across all calls
cost TokenUsageCost Aggregated cost in USD
by_model list[TokenUsageModelSummary] Per-model breakdown with the same usage and cost
records list[TokenUsageRecord] Raw per-call records (source, model, usage, cost)
has_unpriced_usage bool True if any call used a model not in the pricing catalog

TokenUsageBreakdown fields: input_tokens, cached_input_tokens, output_tokens, total_tokens, input_text_tokens, cached_input_text_tokens, output_text_tokens, input_audio_tokens, cached_input_audio_tokens, output_audio_tokens, input_image_tokens, cached_input_image_tokens, duration_seconds.

TokenUsageCost fields: input_usd, cached_input_usd, output_usd, duration_usd, total_usd.

Pricing catalog

Built-in prices are included for: gpt-realtime, gpt-realtime-mini, gpt-realtime-1.5, gpt-4o, gpt-4o-mini, gpt-5.4, gpt-5.4-mini, gpt-5.5, gpt-4o-transcribe, gpt-4o-mini-transcribe, whisper-1. If a model is not in the catalog, its tokens are still counted but costs show as 0.0 and has_unpriced_usage is set to True.


Inactivity timeout

Automatically stop the agent after a period of user silence:

agent = RealtimeAgent(
    instructions="...",
    inactivity_timeout_enabled=True,
    inactivity_timeout_seconds=30.0,
    listener=MyListener(),   # on_user_inactivity_countdown fires each second 5→1
)

The countdown fires through AgentListener.on_user_inactivity_countdown(remaining_seconds) — useful for playing a "still there?" prompt before the session closes.


Azure OpenAI

Pass an AzureOpenAIProvider instead of the default OpenAI provider:

from rtvoice import RealtimeAgent
from rtvoice import AzureOpenAIProvider

agent = RealtimeAgent(
    instructions="...",
    provider=AzureOpenAIProvider(
        azure_endpoint="https://your-resource.openai.azure.com",
        azure_deployment="gpt-4o-realtime-preview",
        api_version="2024-12-17",
        api_key="...",          # or omit to use AZURE_OPENAI_API_KEY
    ),
)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rtvoice-0.6.0.tar.gz (130.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rtvoice-0.6.0-py3-none-any.whl (75.7 kB view details)

Uploaded Python 3

File details

Details for the file rtvoice-0.6.0.tar.gz.

File metadata

  • Download URL: rtvoice-0.6.0.tar.gz
  • Upload date:
  • Size: 130.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.2

File hashes

Hashes for rtvoice-0.6.0.tar.gz
Algorithm Hash digest
SHA256 05ecda63d9dce0e2bda86bd373a55153cda93c6f85357ffdb0ab203d69616932
MD5 871f502e6699513b577a8250e5274a48
BLAKE2b-256 8920c7fce1eaa718438717de5e2150ed3ad81b7c98f6694763998c6ede87f641

See more details on using hashes here.

File details

Details for the file rtvoice-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: rtvoice-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 75.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.2

File hashes

Hashes for rtvoice-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 54580a10e6d8d13465b22ad1542d86215df722e5d350738a10a1b2d2a267c33d
MD5 351635f434063885822c1f12f3fe5c71
BLAKE2b-256 31ba0fce524cd032fea491015ca8c43bbb7fd11c3f8c33520ecd7359c4ec14f2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page