Lightweight turn detection library for conversational AI

These details have not been verified by PyPI

Project links

Project description

Vogent Turn

Fast and accurate turn detection for voice AI

Multimodal turn detection that combines audio intonation and text context to accurately determine when a speaker has finished their turn in a conversation.

Technical Report

Model Weights

HF Space

Key Features

Multimodal: Uses both audio (Whisper encoder) and text (SmolLM) for context-aware predictions
Fast: Optimized with torch.compile for low-latency inference
Easy to Use: Simple Python API with just a few lines of code
Production-Ready: Batched inference, model caching, and comprehensive error handling

Architecture

Audio Encoder: Whisper-Tiny (processes up to 8 seconds of 16kHz audio)
Text Model: SmolLM-135M (12 layers, ~80M parameters)
Classifier: Binary classification (turn complete / turn incomplete)

The model projects audio embeddings into the LLM's input space and processes them together with conversation context for turn detection.

Installation

Using `pip`

pip install vogent-turn

Using `uv`

uv init
uv add vogent-turn

From Source (Traditional)

git clone https://github.com/vogent/vogent-turn.git
cd vogent-turn
pip install -e .

From Source (with UV - Recommended for Development)

UV is a fast Python package manager. If you have UV installed:

git clone https://github.com/vogent/vogent-turn.git
cd vogent-turn

# Create virtual environment and install
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv pip install -e .

Requirements

See pyproject.toml for full list

Quick Start

Python Library

from vogent_turn import TurnDetector
import soundfile as sf
import urllib.request

# Initialize detector
detector = TurnDetector(compile_model=True, warmup=True)

# Download and load audio
audio_url = "https://storage.googleapis.com/voturn-sample-recordings/incomplete_number_sample.wav"
urllib.request.urlretrieve(audio_url, "sample.wav")
audio, sr = sf.read("sample.wav")

# Run turn detection with conversational context
result = detector.predict(
    audio,
    prev_line="What is your phone number",
    curr_line="My number is 804",
    sample_rate=sr,
    return_probs=True,
)

print(f"Turn complete: {result['is_endpoint']}")
print(f"Confidence: {result['prob_endpoint']:.1%}")

CLI Tool

# Basic usage (sample rate automatically detected from file)
vogent-turn-predict speech.wav \
  --prev "What is your phone number" \
  --curr "My number is 804"

Note: Sample rate is automatically detected from the audio file. Audio will be resampled to 16kHz internally if needed.

API Reference

`TurnDetector`

Main class for turn detection inference.

Constructor

detector = TurnDetector(
    model_name="vogent/Vogent-Turn-80M",  # HuggingFace model ID
    revision="main",                     # Model revision
    device=None,                         # "cuda", "cpu", or None (auto)
    compile_model=True                   # Use torch.compile for speed
)

`predict()`

Detect if the current speaker has finished their turn.

result = detector.predict(
    audio,                    # np.ndarray: (n_samples,) mono float32
    prev_line="",             # str: Previous speaker's text (optional)
    curr_line="",             # str: Current speaker's text (optional)
    sample_rate=None,         # int: Sample rate in Hz (recommended to specify, otherwise 16kHz is assumed)
    return_probs=False        # bool: Return probabilities
)

Note: The model operates at 16kHz internally. If you provide audio at a different sample rate, it will be automatically resampled (requires librosa). If no sample rate is specified, 16kHz is assumed with a warning.

Returns:

If return_probs=False: bool (True = turn complete, False = continue)
If return_probs=True: dict with keys:
- is_endpoint: bool
- prob_endpoint: float (0-1)
- prob_continue: float (0-1)

`predict_batch()`

Process multiple audio samples efficiently in a single batch.

results = detector.predict_batch(
    audio_batch,              # list[np.ndarray]: List of audio arrays
    context_batch=None,       # list[dict]: List of context dicts with 'prev_line' and 'curr_line'
    sample_rate=None,         # int: Sample rate in Hz (applies to all audio)
    return_probs=False        # bool: Return probabilities
)

Note: All audio samples in the batch must have the same sample rate. Audio will be automatically resampled to 16kHz if a different rate is specified.

Returns:

List of predictions (same format as predict() depending on return_probs)

Audio Requirements

Sample rate: 16kHz
Channels: Mono
Format: float32 numpy array
Range: [-1.0, 1.0]
Duration: Up to 8 seconds (longer audio will be truncated)

Text Context Format

The model uses conversation context to improve predictions:

prev_line: What the previous speaker said (e.g., a question)
curr_line: What the current speaker is saying (e.g., their response)

For best performance, do not include terminal punctuation (periods, etc.).

Example:

result = detector.predict(
    audio,
    prev_line="How are you doing today",
    curr_line="I'm doing great thanks"
)

Model Details

Multimodal Architecture

Audio (16kHz) ─────> Whisper Encoder ─> Audio Embeddings (1500D)
                                              |
                                              v
                                        Audio Projector
                                              |
                                              v
Text Context ─────> SmolLM Tokenizer ─> Text Embeddings (variable length)
                                              |
                                              v
                              [Audio Embeds + Text Embeds] ─> SmolLM
                                              |
                                              v
                                      Classification Head
                                              |
                                              v
                                    [Endpoint / Continue]

Training Data

The model is trained on conversational audio with labeled turn boundaries. It learns to detect:

Prosodic cues: Pitch, intonation, pauses
Semantic cues: Completeness of thought, question-answer patterns
Contextual cues: Conversation flow and expectations

Examples

Sample scripts can be found in the examples/ directory. python3 examples/basic_usage.py downloads an audio file and runs the turn detector. python3 examples/batch_processing.py downloads two audio files and runs the turn detector with a batched input. request_batcher.py is a sample implementation of a thread for continuous receiving and batching of requests (e.g. in a production setting).

Development

Project Structure

vogent-turn/                    # Project root
├── pyproject.toml              # Package configuration and dependencies
├── vogent_turn/                # Python package
│   ├── __init__.py             # Package exports
│   ├── inference.py            # Main TurnDetector class
│   ├── predict.py              # CLI tool
│   ├── smollm_whisper.py       # Model architecture
│   └── whisper.py              # Whisper components
└── examples/                   # Usage examples
    ├── basic_usage.py
    └── batch_processing.py

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

Citation

If you use this library in your research, please cite:

@software{vogent_turn,
  title = {Vogent Turn: Multimodal Turn Detection for Conversational AI},
  author = {Vogent},
  year = {2024},
  url = {https://github.com/vogent/vogent-turn}
}

License

Inference code is open-source under Apache 2.0. Model weights are under a modified Apache 2.0 license with stricter attribution requirements for certain types of usage.

Support

Issues: GitHub Issues

Changelog

v0.1.0 (2025-10-19)

Initial release
Multimodal turn detection with Whisper + SmolLM
Python library and CLI tool
Torch.compile optimization for fast inference

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

Oct 28, 2025

0.1.0

Oct 20, 2025

0.0.1

Oct 20, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vogent_turn-0.1.1.tar.gz (237.6 kB view details)

Uploaded Oct 28, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vogent_turn-0.1.1-py3-none-any.whl (19.6 kB view details)

Uploaded Oct 28, 2025 Python 3

File details

Details for the file vogent_turn-0.1.1.tar.gz.

File metadata

Download URL: vogent_turn-0.1.1.tar.gz
Upload date: Oct 28, 2025
Size: 237.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.12

File hashes

Hashes for vogent_turn-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`30662242d4bd1e3d484fc9a4d9171c4466ada01adb95cd8e2f0f3818a18326de`
MD5	`72c812454f30975885a65221388a9e19`
BLAKE2b-256	`f38a69df53fe8f0e6748dbd075a8f30aa1c91aa04dc4aac531da380fc7b6ada9`

See more details on using hashes here.

File details

Details for the file vogent_turn-0.1.1-py3-none-any.whl.

File metadata

Download URL: vogent_turn-0.1.1-py3-none-any.whl
Upload date: Oct 28, 2025
Size: 19.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.12

File hashes

Hashes for vogent_turn-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`90363f921035edf92eb43e6012ba53093783da358a1742226b0d04a504462ef7`
MD5	`927c1fc764d94bf393235f3d043fe200`
BLAKE2b-256	`87851a980238e7960c2f87537879a452c3b7419f8bb3ed39bc81686a9688ab99`

See more details on using hashes here.

vogent-turn 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Vogent Turn

Key Features

Architecture

Installation

Using pip

Using uv

From Source (Traditional)

From Source (with UV - Recommended for Development)

Requirements

Quick Start

Python Library

CLI Tool

API Reference

TurnDetector

Constructor

predict()

predict_batch()

Audio Requirements

Text Context Format

Model Details

Multimodal Architecture

Training Data

Examples

Development

Project Structure

Contributing

Citation

License

Support

Changelog

v0.1.0 (2025-10-19)

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Using `pip`

Using `uv`

`TurnDetector`

`predict()`

`predict_batch()`