Provider-agnostic speech stack for speech-to-speech applications

These details have not been verified by PyPI

Project description

Converse Framework

Provider-agnostic speech stack for speech-to-speech applications.

Install
- Missing dependency behavior
- Python version compatibility
Quick Start
- Provider status semantics
Turn Latency and Metrics
Recipes
Runtime Provider Updates
WebSocket Session Helper
Examples
- Text chat
- Voice chat
Framework / App Boundary
- Transport boundary
Status

Install

pip install converse-framework

The base install pulls in only numpy. Real VAD / ASR / LLM / TTS providers live behind optional extras:

pip install converse-framework[silero]          # Silero VAD
pip install converse-framework[faster-whisper]  # faster-whisper ASR
pip install converse-framework[whisper-cpp]     # whisper.cpp HTTP ASR
pip install converse-framework[audio-cpp]       # audio.cpp HTTP ASR + TTS
pip install converse-framework[llamacpp]        # llama.cpp HTTP LLM
pip install converse-framework[openai-compat]   # OpenAI-compatible LLM + ASR + TTS (Ollama, Groq, Kokoro-FastAPI, ...)
pip install converse-framework[kokoro]          # Kokoro ONNX TTS
pip install converse-framework[pocket-tts]      # Pocket TTS
pip install converse-framework[all]             # everything

Missing dependency behavior

If a config requests a provider whose heavy backend is not installed, build_provider (and therefore build_provider_bundle) returns an UnavailableProvider sentinel for that slot instead of raising a bare ImportError. The sentinel's status.message always names the provider that was missing and includes the pip install extra to fix it. The mapping is owned by converse_framework.providers.unavailable.EXTRA_HINTS and exposed as extra_hint_for(kind, name), which returns the extra name (e.g. "converse-framework[silero]") when one is known and None otherwise.

from converse_framework import extra_hint_for
from converse_framework.providers.unavailable import UnavailableProvider

print(extra_hint_for("vad", "silero"))          # converse-framework[silero]
print(extra_hint_for("asr", "faster-whisper"))  # converse-framework[faster-whisper]
print(extra_hint_for("vad", "made-up"))         # None

p = UnavailableProvider("vad", "silero")
print(p.status.message)
# Provider 'silero' (vad) is not available. Install the required extra
# with `pip install converse-framework[silero]`.

is_provider_available(kind, name) is the companion check: it returns True only when the provider's heavy dependency is importable, so you can fail fast before handing the config to a pipeline. UnavailableProvider is a real implementation of all four provider protocols, so the rest of the pipeline keeps running (turns fail with a clear RuntimeError when the broken provider is actually invoked) and the consumer can decide whether to prompt for the install or fall back to a different provider.

Python version compatibility

The base package supports Python 3.11 and newer. Each extra has its own constraints (the table below mirrors the markers in pyproject.toml):

Extra	Python	Notes
base	3.11+	`numpy>=2.0` is the only required runtime dependency.
`silero`	3.11+	`silero-vad` + `onnxruntime`. No known upper bound.
`faster-whisper`	3.11+	The `nvidia-cublas-cu12` wheel pins Windows.
`llamacpp`	3.11+	`httpx` itself supports 3.9+, so 3.11+ is the only constraint.
`openai-compat`	3.11+	Only needs `httpx`. Talks to any OpenAI-compatible server.
`whisper-cpp`	3.11+	Only needs `httpx`, which supports 3.9+.
`audio-cpp`	3.11+	Only needs `httpx`. Talks to a user-managed `audiocpp_server`.
`kokoro`	3.11 to <3.14	`kokoro-onnx` 0.5.0 requires Python <3.14. The wheel build fails fast on 3.14+.
`pocket-tts`	3.11+	No known upper bound.

The kokoro extra is the only one with an upper-bound marker today. If you are on Python 3.14+ and need a TTS provider, use pocket-tts, audio-cpp, or a mock provider. New providers should add their own python_version markers in pyproject.toml when their backend has a known limit.

Quick Start

from converse_framework import build_provider_bundle

config = {
    "vad": {"provider": "mock"},
    "asr": {"provider": "mock"},
    "llm": {"provider": "mock"},
    "tts": {"provider": "mock"},
}

bundle = build_provider_bundle(config)
print(bundle.statuses())

import converse_framework only needs numpy to be installed — heavy provider backends are loaded lazily through the registry.

Provider status semantics

Every provider exposes a status property (cached state, no I/O), a lightweight probe_status() method (import checks, HTTP reachability — does not load models), and a load_status() method (may load or initialise heavy resources before returning).

Call probe_status() to check readiness without side effects — it is safe for status screens and health checks:

import asyncio

# Probe without loading models
results = asyncio.run(bundle.probe_statuses())
for kind, status in results.items():
    print(f"{kind}: ready={status.ready} level={status.status_level}")
    if status.voices:
        print(f"  voices={[v.id for v in status.voices]}")

Call load_status() when you need the definitive picture — it may trigger model downloads or initialise GPU resources:

results = asyncio.run(bundle.load_statuses())

The status_level field distinguishes "ready", "configured", "loading", "error", and "unavailable". The old check_status() is kept for backward compatibility and behaves the same as probe_status() for providers that implement it.

Turn Latency and Metrics

PipelineConfig.first_chunk_chars controls the eager first TTS flush. It defaults to 40, so the first audio can start at the first comma or short opening clause; later chunks return to the normal tts_chunk_chars threshold. Set it to 0 to restore the previous single-threshold behavior:

config = PipelineConfig(
    first_chunk_chars=40,
    tts_chunk_chars=120,
)
pipeline = SpeechPipeline(providers=bundle, sink=sink, config=config)

Every turn emits a turn.metrics event immediately before turn.finished:

{
  "type": "turn.metrics",
  "payload": {
    "mode": "chat",
    "turn_id": 12,
    "asr_ms": 184,
    "llm_first_token_ms": 327,
    "tts_first_chunk_ms": 511,
    "total_ms": 842
  }
}

The stage fields are turn-relative latency checkpoints. A field is null when that stage was not reached, such as ASR during a text turn or TTS when no first audio chunk was produced before the logical turn completed. This summary avoids reconstructing latency from the individual asr.final, llm.first_token, and tts.first_chunk events.

Recipes

The recipes below are short, self-contained scripts that exercise the public API. They all run with the base install (numpy + the framework) unless a snippet is explicitly fenced as requires the \` extra`.

Minimal mock text pipeline

build_provider_bundle returns a fully-mock provider bundle and SpeechPipeline runs an end-to-end text turn against it. QueueEventSink captures every event the pipeline emits so the script can assert or print them.

import asyncio

from converse_framework import (
    PipelineConfig,
    QueueEventSink,
    SpeechPipeline,
    build_provider_bundle,
)


async def main():
    queue: asyncio.Queue = asyncio.Queue()
    sink = QueueEventSink(queue)
    pipeline = SpeechPipeline(
        providers=build_provider_bundle(
            {
                "vad": {"provider": "mock"},
                "asr": {"provider": "mock"},
                "llm": {"provider": "mock"},
                "tts": {"provider": "mock"},
            }
        ),
        sink=sink,
        config=PipelineConfig(tts_chunk_chars=80),
    )

    await pipeline.handle_text_turn("Hello, mock pipeline.")
    # Let the TTS streaming task finish, then drain the captured events.
    await asyncio.sleep(0.5)
    types = [queue.get_nowait()["type"] for _ in range(queue.qsize())]
    print(types)


asyncio.run(main())

Audio frame to utterance collector to pipeline

parse_audio_frame validates a wire payload and turns it into an AudioFrame. AudioUtteranceCollector runs VAD on the frame, applies the rejection gates, and on vad.speech_end hands the assembled PCM bytes to its utterance_callback. The recipe wires that callback into SpeechPipeline.handle_audio_turn. The in-process VAD below fires vad.speech_start on the first frame and vad.speech_end on the third so the collector has something to dispatch — the framework's own MockVADProvider returns no events and is not useful for this path.

import asyncio
import base64

from converse_framework.audio_utils import AudioFrameStats, parse_audio_frame
from converse_framework.events import QueueEventSink
from converse_framework.pipeline import PipelineConfig, SpeechPipeline
from converse_framework.protocols import (
    ProviderCapabilities,
    ProviderStatus,
    VADEvent,
)
from converse_framework.registry import build_provider_bundle
from converse_framework.utterance_collector import (
    AudioUtteranceCollector,
    UtteranceCollectorConfig,
)


class ScriptedVAD:
    """A tiny in-process VAD: start on frame 0, end on frame 2."""

    def __init__(self) -> None:
        self._count = 0

    @property
    def status(self) -> ProviderStatus:
        return ProviderStatus(
            name="scripted",
            kind="vad",
            ready=True,
            message="Scripted VAD fires start at frame 0 and end at frame 2.",
            capabilities=ProviderCapabilities(),
        )

    async def check_status(self) -> ProviderStatus:
        return self.status

    async def process_frame(self, frame):
        self._count += 1
        events: list[VADEvent] = []
        if self._count == 1:
            events.append(VADEvent(type="vad.speech_start", probability=1.0, audio_ms=30))
        if self._count == 3:
            events.append(VADEvent(type="vad.speech_end", probability=1.0, audio_ms=90))
        return events


async def main():
    queue: asyncio.Queue = asyncio.Queue()
    sink = QueueEventSink(queue)
    bundle = build_provider_bundle(
        {
            "vad": {"provider": "mock"},
            "asr": {"provider": "mock"},
            "llm": {"provider": "mock"},
            "tts": {"provider": "mock"},
        }
    )
    pipeline = SpeechPipeline(providers=bundle, sink=sink, config=PipelineConfig(tts_chunk_chars=80))

    cfg = UtteranceCollectorConfig(
        sample_rate=16000,
        channels=1,
        frame_ms=30,
        # Disable the rejection gates -- this recipe shows the wiring
        # from frame to pipeline, not the collector's silence handling.
        min_speech_duration_ms=0,
        reject_low_energy_rms=0,
        reject_utterance_rms=0,
        trim_silence_rms=0,
    )
    stats = AudioFrameStats(
        expected_sample_rate=16000,
        expected_channels=1,
        expected_frame_ms=30,
    )

    async def on_utterance(pcm: bytes, sample_rate: int, mode: str) -> None:
        await pipeline.handle_audio_turn(pcm, sample_rate, mode=mode)

    collector = AudioUtteranceCollector(
        vad_provider=ScriptedVAD(),
        event_sink=sink,
        utterance_callback=on_utterance,
        config=cfg,
    )

    # Three 30 ms frames of silence (16 kHz mono -> 480 samples -> 960 bytes).
    silence = base64.b64encode(b"\x00\x00" * 480).decode("ascii")
    for seq in range(3):
        frame = parse_audio_frame(
            {
                "data": silence,
                "sample_rate": 16000,
                "channels": 1,
                "frame_ms": 30,
                "sequence": seq,
                "encoding": "pcm_s16le",
            },
            stats,
        )
        await collector.ingest_frame(frame)

    await pipeline.cancel_tts("done")
    await asyncio.sleep(0.3)
    types = [queue.get_nowait()["type"] for _ in range(queue.qsize())]
    print(types)


asyncio.run(main())

Custom provider registration

register_provider adds a new (kind, name) pair to the registry by import string. build_provider_bundle then resolves the name on demand and instantiates the class. is_provider_available is the companion probe — it returns True only when the underlying module can be imported, which is the safe check before handing the config to a pipeline. The recipe points the new name at the framework's own mock VAD so it runs against the base install; replace the import string with your own my_pkg.providers:MyVADProvider to register a real implementation.

from converse_framework.registry import (
    build_provider_bundle,
    is_provider_available,
    register_provider,
)

# Register a custom VAD name. Replace the import string with your own
# `my_pkg.providers:MyVADProvider` to wire up a real implementation.
register_provider(
    "vad",
    "my-vad",
    "converse_framework.providers.mock:MockVADProvider",
)

bundle = build_provider_bundle(
    {
        "vad": {"provider": "my-vad"},
        "asr": {"provider": "mock"},
        "llm": {"provider": "mock"},
        "tts": {"provider": "mock"},
    }
)
print(bundle.vad.status.provider_id)        # "mock" (the registered class)
print(is_provider_available("vad", "my-vad"))  # True

Custom event sink

SpeechPipeline accepts any EventSink subclass. The recipe prints each event as it fires, which is handy when you are wiring up a new transport and want to see the wire shape without standing up a queue.

import asyncio

from converse_framework import (
    EventSink,
    PipelineConfig,
    SpeechPipeline,
    build_provider_bundle,
)


class PrintSink(EventSink):
    """Minimal sink that prints each event as it fires."""

    async def emit(self, event_type, **payload):
        keys = ", ".join(payload) or "-"
        print(f"[event] {event_type} ({keys})")


async def main():
    sink = PrintSink()
    pipeline = SpeechPipeline(
        providers=build_provider_bundle(
            {
                "vad": {"provider": "mock"},
                "asr": {"provider": "mock"},
                "llm": {"provider": "mock"},
                "tts": {"provider": "mock"},
            }
        ),
        sink=sink,
        config=PipelineConfig(tts_chunk_chars=80),
    )
    await pipeline.handle_text_turn("Hello, custom sink.")
    # Let the TTS streaming task finish before the loop exits.
    await asyncio.sleep(0.5)


asyncio.run(main())

Browser playback (JS reference client)

The framework ships a vanilla JavaScript / Web Audio reference client at converse_framework/js/tts-audio-player.js that turns the framework's tts.audio events into sound without bundling a build step. It builds AudioBuffers directly from PCM s16le bytes (avoiding decodeAudioData on tiny chunks) and coalesces consecutive events within a short window before scheduling, which is the same fix that resolved Pocket TTS choppiness in the reference harness.

<script src="converse_framework/js/tts-audio-player.js"></script>
<script>
  const player = new TtsAudioPlayer({ coalesceMs: 80 });
  ws.addEventListener('message', (ev) => {
    const event = JSON.parse(ev.data);
    if (event.type === 'tts.audio') player.onEvent(event);
  });
  // when the conversation ends:
  player.close();
</script>

The reference client handles the most common case (mono / stereo PCM s16le with explicit sample rate, channels, and final flag) and ignores anything that is not pcm_s16le with a console warning. Drop the file into your static assets directory; no npm / bundler required.

Browser microphone capture (JS reference client)

The framework ships a vanilla JavaScript microphone capture class at converse_framework/js/mic-frame-sender.js. It uses getUserMedia and an AudioWorkletNode (with inline blob-URL processor, falling back to ScriptProcessorNode) to deliver 16-bit PCM s16le frames at a configurable interval:

<script src="converse_framework/js/mic-frame-sender.js"></script>
<script>
  const ws = new WebSocket("ws://localhost:8000/ws");
  const mic = new MicFrameSender({
    webSocket: ws,
    sampleRate: 16000,
    channels: 1,
    frameMs: 30,
    frameFormat: "binary-v1", // optional; JSON/base64 remains the default
    onLevel: (db) => console.log("mic level", db.toFixed(1)),
  });
  mic.start(); // begins capture after user gesture
</script>

Binary microphone packets use a versioned format while control messages and outgoing framework events remain JSON. Existing clients need no changes because the sender defaults to the original JSON/base64 audio.frame envelope.

The binary v1 packet is a 16-byte network-order header, followed by the UTF-8 mode and raw PCM s16le bytes:

Bytes	Field
0–1	ASCII magic `CF`
2	Version (`1`)
3	Message kind (`1` for microphone audio)
4–7	Unsigned 32-bit frame sequence
8–11	Unsigned 32-bit sample rate
12	Channel count
13–14	Unsigned 16-bit frame duration in milliseconds
15	UTF-8 mode byte length
16…	Mode bytes, then PCM s16le frame bytes

A zero-length mode uses WebSocketSessionConfig.default_mode. The server validates the version, packet kind, audio shape, UTF-8 mode, and exact PCM byte count before the frame reaches the utterance collector.

A composed client at converse_framework/js/browser-voice-client.js combines MicFrameSender, TtsAudioPlayer, and an optional SpeakerEchoGuard (see converse_framework/js/speaker-echo-guard.js) into a single class with automatic WebSocket event dispatch.

Mobile microphone access requires additional HTTPS / tunnel setup (see next section).

Mobile Browser Microphone Testing

Browser microphone capture (via getUserMedia) requires a secure context — HTTPS, localhost, or 127.0.0.1. This is not a framework limitation; it is a browser security requirement.

Local desktop development — localhost is always considered secure. A plain ws://localhost:8000/ws works with no extra setup.

Same-LAN testing (desktop) — also works, because ws://<lan-ip>/ws is accepted by desktop browsers for WebSocket.send() (it is the getUserMedia call that checks the page context, not the WebSocket itself). Serve the HTML page itself via HTTPS to keep mobile browsers happy (see below).

Mobile device on same LAN — a plain http://<lan-ip> page will be rejected by mobile browsers when calling getUserMedia. You need either a tunnel that provides HTTPS or a local trusted certificate.

Option 1 — Cloudflare Tunnel (recommended for testing)

Install cloudflared (winget install cloudflare.cloudflared on Windows, brew install cloudflare/cloudflare/cloudflared on macOS, or download from the Cloudflare Zero Trust dashboard).

Start your server on port 8000:

uvicorn converse_framework.examples.websocket_voice_chat:create_app --factory

Run the tunnel:

cloudflared tunnel --url http://localhost:8000

Cloudflare prints a public https://<random>.trycloudflare.com URL.
Open that URL on your mobile device. Change the WebSocket URL in your client to wss://<random>.trycloudflare.com/ws.

Option 2 — ngrok

Install ngrok from https://ngrok.com/download.
Start your server on port 8000.
Tunnel:
```
ngrok http 8000
```
Use the generated https://<random>.ngrok-free.app URL.
WebSocket URL: wss://<random>.ngrok-free.app/ws.

Option 3 — Local trusted certificate (advanced)

Use mkcert to create a trusted CA-signed cert for your LAN IP::

# Install mkcert once
brew install mkcert  # macOS
winget install mkcert  # Windows (or scoop install mkcert)
mkcert -install

# Create a cert for your LAN IP, e.g. 192.168.1.42
mkcert 192.168.1.42 localhost 127.0.0.1

# Run uvicorn with the generated key/cert files
uvicorn converse_framework.examples.websocket_voice_chat:create_app --factory \
    --ssl-keyfile ./192.168.1.42-key.pem \
    --ssl-certfile ./192.168.1.42.pem

The page and WebSocket are now served over https://192.168.1.42:8000 and wss://192.168.1.42:8000/ws respectively. The mkcert root CA must be installed on the mobile device (see mkcert docs for Android /iOS instructions).

Summary of WebSocket URL forms

Scenario	Page URL	WebSocket URL
Desktop localhost	`http://localhost:8000`	`ws://localhost:8000/ws`
Desktop same LAN	`http://<lan-ip>:8000`	`ws://<lan-ip>:8000/ws`
Mobile via tunnel	`https://<tunnel>/`	`wss://<tunnel>/ws`
Mobile via local cert	`https://<lan-ip>:8000`	`wss://<lan-ip>:8000/ws`

Wrap an external CLI as a provider

When the engine you want to use is only available as a CLI binary (whisper-cli, whisper.cpp/main, the Vosk CLI, …), the framework's converse_framework.examples.subprocess_provider shows the pattern. The class shells out to a configured binary, writes a WAV header followed by the caller's PCM s16le body to the subprocess's stdin, and yields the subprocess's stdout as a single final transcript event.

from converse_framework.examples.subprocess_provider import (
    SubprocessASRProvider,
)

provider = SubprocessASRProvider({
    "binary": "whisper-cli",
    "model": "ggml-small.en.bin",
    "command_template": ["-m", "{model}", "-f", "-"],
    "timeout_s": 120,
})
# Then plug it into a ProviderBundle:
from converse_framework.registry import build_provider_bundle
bundle = build_provider_bundle(
    {
        "vad": {"provider": "mock"},
        "asr": {"provider": "subprocess"},   # see note below
        "llm": {"provider": "mock"},
        "tts": {"provider": "mock"},
    },
)

SubprocessASRProvider is shipped as a recipe (not a registered provider) because it is generic: copy the class, point it at your binary of choice, and register it with register_provider("asr", "my-name", "my.module:MySubprocessProvider"). The example also ships a fake-echo script (--use-fake-echo) that lets the driver run end-to-end in CI without installing any real ASR.

OpenAI-compatible endpoints (LLM, ASR, TTS)

The openai-compatible provider name (requires the openai-compat extra) is registered for all three inference kinds and talks to any server that implements the matching OpenAI endpoint:

Kind	Endpoint	Works with
`llm`	`/v1/chat/completions`	Ollama, LM Studio, vLLM, llama.cpp, Groq, OpenRouter, Together, OpenAI
`asr`	`/v1/audio/transcriptions`	OpenAI Whisper, Groq hosted Whisper, `speaches` / faster-whisper-server
`tts`	`/v1/audio/speech`	OpenAI TTS, Kokoro-FastAPI, openedai-speech

from converse_framework import build_provider_bundle

bundle = build_provider_bundle(
    {
        "vad": {"provider": "mock"},
        "asr": {
            "provider": "openai-compatible",
            "base_url": "https://api.groq.com/openai",
            "model": "whisper-large-v3",
            "api_key": "gsk_...",
        },
        "llm": {
            "provider": "openai-compatible",
            "base_url": "http://localhost:11434",  # e.g. Ollama; no /v1 suffix
            "model": "llama3.2",                   # "auto" = first listed model
        },
        "tts": {
            "provider": "openai-compatible",
            "base_url": "http://localhost:8880",   # e.g. Kokoro-FastAPI
            "model": "kokoro",
            "voice": "af_heart",
        },
    }
)

All three accept base_url (must not include the /v1 path segment -- the providers append the versioned paths themselves), an optional api_key sent as an Authorization: Bearer header, and timeout_s. The servers are managed externally; the framework never starts them.

Kind-specific notes:

LLM -- model defaults to "auto", which resolves to the first entry reported by /v1/models; hosted services list many models, so set it explicitly for anything other than a single-model local server. Shares its implementation with the llamacpp provider (which also accepts api_key, matching llama.cpp's --api-key option); the difference is that llamacpp probes the llama.cpp-native /health endpoint first, while openai-compatible checks /v1/models directly.
ASR -- uploads the utterance as a multipart WAV, so it works with remote/hosted servers (unlike the audio-cpp ASR provider, which passes a server-local file path). Optional language and temperature are forwarded as form fields.
TTS -- requests response_format: "wav" and yields the decoded PCM as a single final chunk, so the server must support WAV output (OpenAI, Kokoro-FastAPI, and openedai-speech all do). voice is required by OpenAI; some local servers have a default. Optional speed is forwarded.

Pocket TTS voice listing and configuration

Pocket TTS supports listing available voices and changing voice or other options at runtime via :meth:TTSProvider.configure (introduced in protocol v0.2). All variants return a :class:ProviderConfigResult with changed and requires_reload flags.

List voices without importing the heavy ONNX backend:

from converse_framework.providers.pocket_tts import PocketTTSProvider

provider = PocketTTSProvider({"voice": "azelma"})
voices = provider.list_voices()
for v in voices:
    print(f"{v.id}: {v.name} ({v.gender}, {v.language})")
    # e.g. "azelma: Azelma (Female, en)"

Change voice (clears only the voice cache, preserves the loaded model):

result = provider.configure(voice="anna")
print(result.changed, result.requires_reload)
# True, False — model stays, voice state reloaded

Change quantization or temperature (clears both model and voice, requiring a full reload on next synthesis):

result = provider.configure(quantize=True)
print(result.requires_reload)
# True — both _model and _voice_state cleared

Change max_tokens or coalesce_ms without unloading:

result = provider.configure(max_tokens=250, coalesce_ms=120)
print(result.requires_reload)
# False — values stored, no cache invalidated

ProviderBundle.replace() and pipeline.update_providers() (see the Runtime Provider Updates section) work with any TTS provider including Pocket TTS.

CUDA DLL helper (Windows)

On Windows, NVIDIA wheel packages like nvidia-cublas-cu12 install DLLs under site-packages/nvidia/<package>/bin/, but C extension libraries such as CTranslate2 may not search those directories automatically. The framework ships a CUDA DLL discovery helper at converse_framework/cuda_utils.py that finds them and adds them to the DLL search path.

from converse_framework.cuda_utils import (
    add_nvidia_dll_directories,
    discover_nvidia_dll_dirs,
    format_nvidia_dll_diagnostic,
)

# Add all discovered NVIDIA DLL directories to the search path.
# Keep the handles alive for the lifetime of the process.
dll_handles = add_nvidia_dll_directories()

# Print a diagnostic string for debugging:
print(format_nvidia_dll_diagnostic())

The helper searches nvidia/cublas/bin, nvidia/cudnn/bin, nvidia/cusparse/bin, nvidia/cusolver/bin, and nvidia/curand/bin inside site-packages. It is Windows-only (no-op on other platforms) and best-effort — failures are logged, not raised.

FasterWhisperASRProvider calls add_nvidia_dll_directories() automatically inside _ensure_model() when the config option auto_cuda_dll_dirs is True (the default). Disable with:

provider = FasterWhisperASRProvider({
    "model": "large-v3-turbo",
    "device": "cuda",
    "auto_cuda_dll_dirs": False,  # disable auto-discovery
})

Runtime Provider Updates

The framework supports swapping providers at runtime without recreating the pipeline or collector. This is useful for settings UIs that let users change TTS voice, VAD model, or ASR backend without restarting the conversation.

ProviderBundle.replace()

:meth:ProviderBundle.replace creates a new bundle with specific providers swapped out by keyword argument, inheriting the rest from the original bundle. It is a no-side-effect, no-copy operation — the caller owns the lifecycle of the old providers.

from converse_framework import build_provider_bundle, build_provider

bundle = build_provider_bundle({
    "vad": {"provider": "mock"},
    "asr": {"provider": "mock"},
    "llm": {"provider": "mock"},
    "tts": {"provider": "mock"},
})

new_tts = build_provider("tts", "mock", {"first_chunk_delay_ms": 500})
new_bundle = bundle.replace(tts=new_tts)
# new_bundle.tts is the new provider; vad/asr/llm are unchanged.
# bundle is unaffected.

Multiple providers can be replaced at once:

replaced = bundle.replace(vad=new_vad, tts=new_tts)

ProviderBundle.unload_replaced()

:meth:ProviderBundle.unload_replaced compares two bundles by identity and calls unload() on every provider that differs. Providers with the same identity reference are left untouched.

old_bundle = build_provider_bundle(config)
new_bundle = old_bundle.replace(tts=new_tts)
await ProviderBundle.unload_replaced(old_bundle, new_bundle)

SpeechPipeline.update_providers()

:meth:SpeechPipeline.update_providers is the safe way to swap providers on an active pipeline. It cancels in-flight TTS synthesis by default (so the next turn picks up the new provider), swaps the bundle, and emits a providers.updated event with the serialized statuses of the new bundle. Conversation history is not cleared.

from converse_framework import (
    PipelineConfig, QueueEventSink, SpeechPipeline,
    build_provider_bundle,
)

queue = asyncio.Queue()
pipeline = SpeechPipeline(
    providers=build_provider_bundle(initial_config),
    sink=QueueEventSink(queue),
    config=PipelineConfig(),
)

new_bundle = build_provider_bundle(updated_config)
await pipeline.update_providers(new_bundle, reason="settings_change")
# pipeline.providers is now new_bundle
# TTS was cancelled if it was playing
# providers.updated event was emitted

AudioUtteranceCollector.update_vad_provider()

:meth:AudioUtteranceCollector.update_vad_provider swaps the VAD provider that drives utterance boundary detection. It raises :class:RuntimeError if the collector is currently recording an utterance to avoid corrupting in-flight VAD state. The pre-speech buffer is cleared on swap so stale audio from the old VAD is not passed to the new one.

new_vad = SileroVADProvider({"speech_threshold": 0.6})
collector.update_vad_provider(new_vad)

End-to-end pattern

A typical settings-update flow combines all the pieces:

# 1. Build the new bundle
new_bundle = bundle.replace(tts=new_tts)

# 2. Probe without loading models
probe_results = await new_bundle.probe_statuses()

# 3. On user confirmation, swap in the pipeline
await pipeline.update_providers(new_bundle)

# 4. Swap the VAD in the collector (separate because the
#    collector and pipeline are independent components)
if "vad" in updated:
    collector.update_vad_provider(new_bundle.vad)

# 5. Old providers are unloaded in the background by
#    pipeline.update_providers().

WebSocket Session Helper

The framework provides a reusable :class:WebSocketSession that handles the common message-dispatch loop for browser-based voice apps. It owns the transport, sink, provider bundle, pipeline, collector, and frame stats, and routes seven JSON control/message types plus versioned binary microphone packets without requiring the application to copy the recipe state machine.

Built-in message types:

Binary v1 packet — raw PCM s16le microphone frame forwarded directly to the utterance collector.
audio.frame — legacy JSON/base64 microphone frame; still fully supported.
text.turn — text conversation turn.
conversation.clear — clears per-mode conversation history.
tts.cancel — cancels in-flight TTS synthesis.
status.request — emits probe/check/load status (kind selected by the probe / check / load flag in the payload).
settings.update — delegated to an optional :class:WebSocketSessionHooks callback.
providers.reload — swaps the provider bundle and optionally reloads the VAD provider, with before / after hooks.

Unknown message types fall through to the optional on_unknown_message hook or emit a turn.error event.

Configuration and hooks are supplied via:

:class:WebSocketSessionConfig — provider config, collector config, pipeline config, default mode, auto-probe on reload.
:class:WebSocketSessionHooks — optional async callbacks for unknown messages, settings updates, status requests, provider reload lifecycle, and event monitoring.

The session class lives at converse_framework.session and is not imported from the top-level __init__.py to keep lightweight imports for apps that do not use it.

Usage sketch:

import json

from converse_framework.session import (
    WebSocketSession,
    WebSocketSessionConfig,
    WebSocketSessionHooks,
)

async def on_settings_update(session, payload):
    print("settings updated", payload)


async def on_event(session, event):
    print("event", event.type)


hooks = WebSocketSessionHooks(
    on_settings_update=on_settings_update,
    on_event=on_event,
)
session = WebSocketSession(
    transport=your_transport,
    config=WebSocketSessionConfig(
        provider_config={
            "vad": {"provider": "mock"},
            "asr": {"provider": "mock"},
            "llm": {"provider": "mock"},
            "tts": {"provider": "mock"},
        },
    ),
    hooks=hooks,
)

while True:
    packet = await websocket.receive()
    if packet.get("type") == "websocket.disconnect":
        break
    if packet.get("bytes") is not None:
        message = packet["bytes"]
    elif packet.get("text") is not None:
        message = json.loads(packet["text"])
    else:
        continue
    await session.handle_message(message)

Examples

Text chat (automated-test covered)

Run a real text conversation against SpeechPipeline using only the framework's public API. No FastAPI, no WebSocket, no profile files.

python -m converse_framework.examples.text_chat

Try a real provider by passing overrides (the matching extra must be installed):

python -m converse_framework.examples.text_chat \
    --provider asr=faster-whisper \
    --provider llm=llamacpp \
    --provider tts=kokoro

The driver behind the CLI is converse_framework.examples.text_chat.run_text_chat, which is what the test suite exercises.

Voice chat (manual)

The voice example wires an AudioUtteranceCollector to the pipeline and feeds it PCM frames. It is a manual example — you supply a WAV file (or replace the source with a microphone capture) and the script drives the conversation. It is intentionally not covered by the automated tests because it depends on platform audio I/O.

# With real providers installed
python -m converse_framework.examples.voice_chat --input path/to/16k_mono.wav

# Or run the same flow with mock providers to validate the path
python -m converse_framework.examples.voice_chat --mock --input path/to/16k_mono.wav

Framework / App Boundary

The framework owns the provider-agnostic speech stack:

Provider protocols (VADProvider, ASRProvider, LLMProvider, TTSProvider).
Audio frame parsing, PCM conversion, metering, and silence trimming.
Event sink API and the wire shape used by the browser UI.
SpeechPipeline turn orchestration (ASR → LLM → TTS, streaming chunks, cancellation, barge-in).
AudioUtteranceCollector (VAD-driven utterance collection).
A lazy provider registry and the optional concrete providers behind extras.
WebSocketSession (optional reusable message-dispatch loop).
Browser JS helpers (mic-frame-sender.js, speaker-echo-guard.js, browser-voice-client.js, tts-audio-player.js).
CUDA DLL discovery helper (cuda_utils).

As of v0.3 the framework also provides safe provider-swap mechanics (ProviderBundle.replace(), pipeline.update_providers(), collector.update_vad_provider()), first-class provider configuration (configure(), list_voices()), and lifecycle events (provider.loading, provider.loaded, provider.error), OpenAI-compatible inference providers, eager first-chunk TTS, per-turn latency metrics, and opt-in binary microphone frames.

The framework does not own the application. The following stay in the consumer app (e.g. the reference harness):

FastAPI app, REST endpoints, WebSocket handler.
Profile files and runtime settings persistence.
Character card parsing and first-message seeding.
Companion mode policy and memory store.
TTS preset manager and provider settings UX.
The WebSocket transport itself.

Transport boundary

The framework defines a generic Transport protocol and ships a QueueTransport for tests. The consumer app owns the real WebSocket transport — WebSocketTransport (or equivalent) lives in the app, not in the framework, so the framework never takes a hard dependency on FastAPI. The reference harness exposes conversational_harness.transport.WebSocketTransport for that purpose.

Status

The current package metadata is v0.3.0. The automated test surfaces are:

Surface	Tests
Python 3.11 / 3.12 / 3.13	312 pytest tests
Browser JavaScript helpers	61 assertions

Run them locally:

# Framework (run from the package root)
python -m pytest

# Browser helpers
node tests/js/test_helpers.mjs
node tests/js/test_speaker_echo_guard.mjs

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.3.1

Jul 10, 2026

0.3.0

Jul 9, 2026

0.2.3

Jul 3, 2026

0.2.2

Jun 4, 2026

0.2.0

Jun 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

converse_framework-0.3.1.tar.gz (176.4 kB view details)

Uploaded Jul 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

converse_framework-0.3.1-py3-none-any.whl (115.3 kB view details)

Uploaded Jul 10, 2026 Python 3

File details

Details for the file converse_framework-0.3.1.tar.gz.

File metadata

Download URL: converse_framework-0.3.1.tar.gz
Upload date: Jul 10, 2026
Size: 176.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for converse_framework-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`a35759f9adb7722da4e5f63960204322b4142ec3d7f662a3d4ba5e545de02f17`
MD5	`6d25319e681f59b57549a54541340f57`
BLAKE2b-256	`8861f9631c5ba4e23764a0ff778487b16959f6896fc96d9208311b5fa58b5d1a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for converse_framework-0.3.1.tar.gz:

Publisher: publish.yml on thomas9120/Converse-Framework

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: converse_framework-0.3.1.tar.gz
- Subject digest: a35759f9adb7722da4e5f63960204322b4142ec3d7f662a3d4ba5e545de02f17
- Sigstore transparency entry: 2139233178
- Sigstore integration time: Jul 10, 2026
Source repository:
- Permalink: thomas9120/Converse-Framework@7b5efbdf0e6ba734cc27cb028f4ee516cb6abb25
- Branch / Tag: refs/tags/v0.3.1
- Owner: https://github.com/thomas9120
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@7b5efbdf0e6ba734cc27cb028f4ee516cb6abb25
- Trigger Event: release

File details

Details for the file converse_framework-0.3.1-py3-none-any.whl.

File metadata

Download URL: converse_framework-0.3.1-py3-none-any.whl
Upload date: Jul 10, 2026
Size: 115.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for converse_framework-0.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cf2be6210600812c6113526be839f9c6c59a72a103ebaf9efae94df3d8d3353e`
MD5	`e706c25ac0e1a7d8d7fc54cd62f91828`
BLAKE2b-256	`785dda8f81f0d260d9d23a8c1aa53a157c17a692b596dea22e667cc64a2409a1`

See more details on using hashes here.

Provenance

The following attestation bundles were made for converse_framework-0.3.1-py3-none-any.whl:

Publisher: publish.yml on thomas9120/Converse-Framework

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: converse_framework-0.3.1-py3-none-any.whl
- Subject digest: cf2be6210600812c6113526be839f9c6c59a72a103ebaf9efae94df3d8d3353e
- Sigstore transparency entry: 2139233212
- Sigstore integration time: Jul 10, 2026
Source repository:
- Permalink: thomas9120/Converse-Framework@7b5efbdf0e6ba734cc27cb028f4ee516cb6abb25
- Branch / Tag: refs/tags/v0.3.1
- Owner: https://github.com/thomas9120
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@7b5efbdf0e6ba734cc27cb028f4ee516cb6abb25
- Trigger Event: release

converse-framework 0.3.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Converse Framework

Table of Contents

Install

Missing dependency behavior

Python version compatibility

Quick Start

Provider status semantics

Turn Latency and Metrics

Recipes

Minimal mock text pipeline

Audio frame to utterance collector to pipeline

Custom provider registration

Custom event sink

Browser playback (JS reference client)

Browser microphone capture (JS reference client)

Mobile Browser Microphone Testing

Wrap an external CLI as a provider

OpenAI-compatible endpoints (LLM, ASR, TTS)

Pocket TTS voice listing and configuration

CUDA DLL helper (Windows)

Runtime Provider Updates

ProviderBundle.replace()

ProviderBundle.unload_replaced()

SpeechPipeline.update_providers()

AudioUtteranceCollector.update_vad_provider()

End-to-end pattern

WebSocket Session Helper

Examples

Text chat (automated-test covered)

Voice chat (manual)

Framework / App Boundary

Transport boundary

Status

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance