Real-time voice assistant built on OpenAI's Realtime API

Project description

rtvoice

A Python framework for building voice agents powered by OpenAI's Realtime API. Built with clean architecture, event-driven design, and comprehensive watchdog patterns for production-ready voice applications.

Overview

rtvoice provides a robust foundation for creating interactive voice agents with real-time audio streaming, function calling, interruption handling, and conversation management.

Architecture

graph TB
    subgraph "User Interface"
        MIC[Microphone Input]
        SPEAKER[Speaker Output]
    end

    subgraph "rtvoice Agent"
        AGENT[Agent]
        BUS[Event Bus]
        WS[WebSocket Client]

        subgraph "Watchdogs"
            AUDIO_IN[Audio Input Watchdog]
            AUDIO_OUT[Audio Output Watchdog]
            RT[Realtime Watchdog]
            TRUNC[Truncation Watchdog]
            INT[Interruption Watchdog]
            TIMEOUT[Inactivity Watchdog]
            TRANS[Transcription Watchdog]
            TOOLS[Tool Calling Watchdog]
            HIST[History Watchdog]
            REC[Recording Watchdog]
        end

        TOOLS_REG[Tool Registry]
    end

    subgraph "OpenAI Realtime API"
        OAI[OpenAI WebSocket]
    end

    MIC -->|Audio Stream| AUDIO_IN
    AUDIO_IN -->|Base64 Chunks| BUS

    BUS -->|Events| RT
    RT -->|API Messages| WS
    WS <-->|WebSocket| OAI

    WS -->|Server Events| BUS
    BUS -->|Audio Deltas| AUDIO_OUT
    AUDIO_OUT -->|PCM Audio| SPEAKER

    BUS -.->|Subscribe| INT
    BUS -.->|Subscribe| TRUNC
    BUS -.->|Subscribe| TIMEOUT
    BUS -.->|Subscribe| TRANS
    BUS -.->|Subscribe| TOOLS
    BUS -.->|Subscribe| HIST
    BUS -.->|Subscribe| REC

    TOOLS -->|Execute| TOOLS_REG

    INT -->|Cancel Response| WS
    TRUNC -->|Truncate Message| WS
    TIMEOUT -->|Stop Agent| AGENT

    style AGENT fill:#e1f5ff
    style BUS fill:#fff4e1
    style OAI fill:#e8f5e9

Features

Core Capabilities

✅ Real-time Audio Streaming: Bidirectional audio using OpenAI's Realtime API
✅ Event-Driven Architecture: Clean separation of concerns with EventBus pattern
✅ Interruption Handling: Natural conversation flow with mid-response interruption
✅ Tool Calling: Function execution with automatic parameter validation
✅ Transcription: Optional speech-to-text for both user and assistant
✅ Recording: Full conversation recording support
✅ Inactivity Timeout: Automatic session management

Advanced Features

✅ Message Truncation: Precise conversation state management during interruptions
✅ Speech Speed Control: Dynamic adjustment of assistant response speed
✅ Volume Control: Runtime audio output adjustment
✅ Conversation History: Complete turn-by-turn conversation tracking
✅ Custom Audio Devices: Pluggable audio input/output interfaces

Basic Usage

import asyncio
from rtvoice import Agent, Tools
from rtvoice.views import AssistantVoice, RealtimeModel

async def main():
    tools = Tools()

    @tools.action("Get weather information")
    async def get_weather(location: str) -> str:
        return f"Weather in {location}: Sunny, 22°C"

    # Initialize agent
    agent = Agent(
        instructions="You are a helpful voice assistant.",
        model=RealtimeModel.GPT_REALTIME_MINI,
        voice=AssistantVoice.MARIN,
        speech_speed=1.0,
        tools=tools,
        api_key="your-openai-api-key"
    )

    # Start conversation
    async with agent:
        # Agent runs until user says "stop" or timeout occurs
        pass

    # Get conversation history
    history = await agent.stop()
    print(f"Conversation had {len(history.conversation_turns)} turns")

if __name__ == "__main__":
    asyncio.run(main())

Architecture Overview

Event Bus Pattern

All components communicate through a central EventBus, enabling:

Loose coupling: Components don't directly depend on each other
Easy testing: Mock events for unit tests
Clear event flow: All interactions are traceable
Extensible: Add new watchdogs without modifying existing code

Watchdog Pattern

Specialized watchdogs monitor and react to specific events:

Watchdog	Responsibility
AudioInputWatchdog	Streams microphone audio to the API
AudioOutputWatchdog	Plays assistant audio responses
LifecycleWatchdog	Manages WebSocket communication with OpenAI
InterruptionWatchdog	Handles user interruptions during responses
TruncationWatchdog	Manages conversation state during interruptions
ToolCallingWatchdog	Executes function calls from the assistant
TranscriptionWatchdog	Tracks speech-to-text output
TimeoutWatchdog	Monitors user inactivity and triggers shutdown
HistoryWatchdog	Maintains conversation turn history
RecordingWatchdog	Records full conversation audio

Event Flow Example

sequenceDiagram
    participant User
    participant AudioIn
    participant EventBus
    participant Realtime
    participant OpenAI
    participant AudioOut

    User->>AudioIn: Speaks
    AudioIn->>EventBus: InputAudioBufferAppendEvent
    EventBus->>Realtime: Handle audio event
    Realtime->>OpenAI: Send audio (WebSocket)
    OpenAI->>Realtime: ResponseOutputAudioDeltaEvent
    Realtime->>EventBus: Dispatch audio delta
    EventBus->>AudioOut: Play audio chunk
    AudioOut->>User: Hears response

    Note over User,AudioOut: User interrupts
    User->>AudioIn: Speaks again
    AudioIn->>EventBus: InputAudioBufferSpeechStartedEvent
    EventBus->>Interruption: Detect interruption
    Interruption->>OpenAI: ResponseCancelEvent
    Interruption->>OpenAI: OutputAudioBufferClearEvent

Key Events

Server Events (from OpenAI)

ResponseCreatedEvent                    # Assistant starts responding
ResponseOutputAudioDeltaEvent          # Audio chunk received
ResponseDoneEvent                      # Response completed
InputAudioBufferSpeechStartedEvent     # User starts speaking
InputAudioBufferSpeechStoppedEvent     # User stops speaking
FunctionCallItem                       # Tool call requested
InputAudioTranscriptionCompleted       # Transcription ready

Client Events (to OpenAI)

InputAudioBufferAppendEvent            # Send audio to API
ResponseCancelEvent                    # Cancel current response
ResponseCreateEvent                    # Request new response
SessionUpdateEvent                     # Update session config
ConversationItemTruncateEvent         # Truncate conversation item

Internal Events

AgentStartedEvent                      # Agent initialization complete
AgentStoppedEvent                      # Agent shutdown initiated
StopAgentCommand                       # Trigger agent shutdown
UserInactivityTimeoutEvent            # Timeout occurred

Built-in Tools

The framework includes several built-in tools:

@tools.action("Get the current local time")
def get_current_time() -> str:
    """Returns current time in HH:MM:SS format"""

@tools.action("Adjust volume level")
async def adjust_volume(level: float) -> str:
    """Set audio output volume (0.0-1.0)"""

@tools.action("Change assistant's talking speed")
async def change_assistant_response_speed(instructions: str) -> str:
    """Adjust speech speed with 'faster' or 'slower'"""

@tools.action("Stop the assistant run")
async def stop_assistant_run() -> str:
    """End the conversation"""

Custom Tools

Add your own tools easily:

from typing import Annotated
from rtvoice import Tools

tools = Tools()

@tools.action("Search for information")
async def search(
    query: Annotated[str, "The search query"],
    max_results: Annotated[int, "Maximum number of results"] = 5
) -> str:
    # Your implementation
    results = await search_database(query, limit=max_results)
    return f"Found {len(results)} results"

@tools.action(
    description="Book a restaurant reservation",
    response_instruction="Confirm the booking details to the user"
)
async def book_restaurant(
    restaurant: Annotated[str, "Restaurant name"],
    time: Annotated[str, "Reservation time (HH:MM)"],
    guests: Annotated[int, "Number of guests"]
) -> str:
    # Your booking logic
    return f"Booked table for {guests} at {restaurant} for {time}"

Configuration

Agent Parameters

Agent(
    instructions: str = "",                          # System prompt
    model: RealtimeModel = RealtimeModel.GPT_REALTIME_MINI,
    voice: AssistantVoice = AssistantVoice.MARIN,   # Voice selection
    speech_speed: float = 1.0,                       # 0.5 - 1.5
    transcription_model: TranscriptionModel | None = None,
    tools: Tools | None = None,
    recording_output_path: str | None = None,
    api_key: str | None = None,
    audio_input: AudioInputDevice | None = None,
    audio_output: AudioOutputDevice | None = None,
)

Available Voices

AssistantVoice.ALLOY
AssistantVoice.ECHO
AssistantVoice.SHIMMER
AssistantVoice.ASH
AssistantVoice.BALLAD
AssistantVoice.CORAL
AssistantVoice.SAGE
AssistantVoice.VERSE
AssistantVoice.MARIN (default)

Available Models

RealtimeModel.GPT_REALTIME - gpt-4o-realtime-preview-2024-12-17
RealtimeModel.GPT_REALTIME_MINI - gpt-4o-mini-realtime-preview-2024-12-17 (default)

Custom Audio Devices

Implement custom audio sources/outputs:

from rtvoice.audio.devices import AudioInputDevice, AudioOutputDevice
from collections.abc import AsyncIterator

class CustomMicrophone(AudioInputDevice):
    async def start(self) -> None:
        # Initialize your audio source
        pass

    async def stop(self) -> None:
        # Cleanup
        pass

    async def stream_chunks(self) -> AsyncIterator[bytes]:
        while self.is_active:
            chunk = await self.read_audio()  # Your implementation
            yield chunk

    @property
    def is_active(self) -> bool:
        return self._active

class CustomSpeaker(AudioOutputDevice):
    async def start(self) -> None:
        # Initialize audio output
        pass

    async def stop(self) -> None:
        # Cleanup
        pass

    async def play_chunk(self, chunk: bytes) -> None:
        # Play audio chunk
        pass

    async def clear_buffer(self) -> None:
        # Clear any queued audio
        pass

# Use custom devices
agent = Agent(
    audio_input=CustomMicrophone(),
    audio_output=CustomSpeaker()
)

Advanced Usage

Access Event Bus

agent = Agent(...)

# Subscribe to events
async def on_transcription(event):
    print(f"Transcribed: {event.transcript}")

agent.event_bus.subscribe(
    InputAudioTranscriptionCompleted,
    on_transcription
)

Recording Conversations

agent = Agent(
    recording_output_path="conversations/session_001.wav"
)

Conversation History

history = await agent.stop()

for turn in history.conversation_turns:
    print(f"{turn.role}: {turn.content}")

Requirements

Python 3.13+
OpenAI API key with Realtime API access
PyAudio for audio I/O
WebSockets for API communication
Pydantic for data validation

Installation from Source

git clone https://github.com/yourusername/rtvoice.git
cd rtvoice
uv pip install -e .

Environment Variables

# Required
OPENAI_API_KEY=your-api-key-here

# Optional
RTVOICE_LOG_LEVEL=DEBUG

Project details

Release history Release notifications | RSS feed

0.5.0

Apr 10, 2026

0.4.0

Mar 12, 2026

0.3.0

Mar 1, 2026

0.2.0

Mar 1, 2026

0.1.8

Mar 1, 2026

0.1.7

Mar 1, 2026

0.1.6

Mar 1, 2026

0.1.5

Feb 28, 2026

0.1.4

Feb 28, 2026

0.1.3

Feb 27, 2026

This version

0.1.2

Feb 26, 2026

0.1.0

Feb 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rtvoice-0.1.2.tar.gz (52.9 kB view details)

Uploaded Feb 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rtvoice-0.1.2-py3-none-any.whl (38.7 kB view details)

Uploaded Feb 26, 2026 Python 3

File details

Details for the file rtvoice-0.1.2.tar.gz.

File metadata

Download URL: rtvoice-0.1.2.tar.gz
Upload date: Feb 26, 2026
Size: 52.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.2

File hashes

Hashes for rtvoice-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`c84bcdd904ce8735b5b358ce13b3bbdf144a541ff946253b07d389ec0a9c620c`
MD5	`e35cf91530f14bfab8067f83d2c5e9d9`
BLAKE2b-256	`12d9b26749550334659fde04846e7756fb60648fc13725ce43d91170516b22d8`

See more details on using hashes here.

File details

Details for the file rtvoice-0.1.2-py3-none-any.whl.

File metadata

Download URL: rtvoice-0.1.2-py3-none-any.whl
Upload date: Feb 26, 2026
Size: 38.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.2

File hashes

Hashes for rtvoice-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`64a6fca3114a51fb546ca04b5385ac25cccecfd7a326ac913c08b10bef2cfeb7`
MD5	`6fb49fc7800e1015f306aec4b6263abc`
BLAKE2b-256	`c6f8602f3177ad44e78bab1c8c0572157b77b762d376be45e0103a6b74ac18eb`

See more details on using hashes here.

rtvoice 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

rtvoice

Overview

Architecture

Features

Core Capabilities

Advanced Features

Basic Usage

Architecture Overview

Event Bus Pattern

Watchdog Pattern

Event Flow Example

Key Events

Server Events (from OpenAI)

Client Events (to OpenAI)

Internal Events

Built-in Tools

Custom Tools

Configuration

Agent Parameters

Available Voices

Available Models

Custom Audio Devices

Advanced Usage

Access Event Bus

Recording Conversations

Conversation History

Requirements

Installation from Source

Environment Variables

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes