Skip to main content

Speaker diarization — who spoke when. Rust + ONNX, no Python runtime overhead.

Project description

polyvoice

CI Crates.io PyPI Docs.rs License: MIT

Speaker diarization for Rust — who spoke when, without Python.

Silero VAD + WeSpeaker embeddings + AHC clustering in a single Pipeline::run() call.

CLI Demo

Input:  14 seconds of two-speaker audio (16 kHz mono WAV)
Output: SPEAKER_00: 0.10s -  7.60s
        SPEAKER_01: 8.10s - 14.10s

Quick start

1. Add the dependency

[dependencies]
polyvoice = { version = "0.5", features = ["onnx"] }

2. Download models

bash scripts/download-models.sh
# Downloads WeSpeaker ResNet34 (25 MB) and Silero VAD v5 (2.2 MB) to models/

3. Run the pipeline

use polyvoice::{
    Pipeline, DiarizationConfig, VadConfig,
    FbankOnnxExtractor, SileroVad,
};
use std::path::Path;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Load models
    let extractor = FbankOnnxExtractor::new(
        Path::new("models/wespeaker_resnet34.onnx"),
        256, // embedding dim
        4,   // ONNX session pool size
    )?;
    let mut vad = SileroVad::new(Path::new("models/silero_vad.onnx"), 512)?;

    // Configure and run
    let pipeline = Pipeline::new(
        DiarizationConfig::default(),
        VadConfig::default(),
    );
    let (samples, _sr) = polyvoice::wav::read_wav(Path::new("meeting.wav"))?;
    let result = pipeline.run(&samples, &extractor, &mut vad)?;

    for turn in &result.turns {
        println!("{}: {:.2}s - {:.2}s", turn.speaker, turn.time.start, turn.time.end);
    }
    Ok(())
}

Python

pip install polyvoice

Or build from source:

cd python
maturin develop --release
import polyvoice

pipeline = polyvoice.Pipeline("models/")
turns = pipeline("meeting.wav")

for turn in turns:
    print(f"{turn.speaker}: {turn.start:.1f}s - {turn.end:.1f}s")

CLI

cargo install polyvoice --features cli

polyvoice download-models
polyvoice diarize meeting.wav
polyvoice diarize meeting.wav --format json
polyvoice diarize meeting.wav --format rttm --max-speakers 4

How it works

WAV / PCM audio (16 kHz mono)
       |
       v
+-------------+     +------------------+     +---------+
|  Silero VAD |---->| WeSpeaker        |---->|   AHC   |---> Speaker turns
|  (speech    |     | ResNet34         |     | cluster |
|   regions)  |     | (256-d embed.)   |     |         |
+-------------+     +------------------+     +---------+
                     fbank + CMVN           cosine similarity
                     lock-free pool         threshold merging

VAD detects speech regions, skipping silence. WeSpeaker extracts 256-dimensional speaker embeddings from log-mel filterbank features (80-bin, CMVN-normalized). AHC clusters embeddings by cosine similarity into speaker groups. The Pipeline wires it all together.

Comparison with pyannote

polyvoice pyannote
Language Rust Python
Runtime ONNX Runtime PyTorch
GIL-free Yes No
Binary size ~30 MB (with models) ~2 GB (torch + models)
Deploy Single binary / C FFI Python env + pip
Concurrent sessions Lock-free session pool Thread-limited
Streaming OnlineDiarizer built-in Third-party wrappers

pyannote is the gold standard for accuracy. polyvoice trades some accuracy for deployment simplicity: no Python runtime, no GPU required, ~30 MB total.

Minimum Supported Rust Version (MSRV)

1.85 (Rust 2024 edition).

Accuracy (DER benchmarks)

Evaluated with 0.25s collar on standard diarization benchmarks:

VoxConverse (232 files, 43.5 hours — broadcast, meetings, interviews)

System DER Miss FA Confusion Speed
polyvoice (AHC, t=0.45, me=2) ~15% 3.9% 3.2% 7.9% 10.6x RT (CPU)
pyannote 3.0 ~11% ~1x RT (GPU)

AMI (16 meetings, 9 hours — meeting room recordings)

System DER Miss FA Confusion Speed
polyvoice (AHC, t=0.45, me=2) ~23% 15.4% 3.5% 4.1% 7x RT (CPU)
pyannote 3.0 ~18% ~1x RT (GPU)
Simple i-vector + AHC ~33%

polyvoice delivers ~80% of pyannote's accuracy at 10x the speed on CPU alone — no GPU, no Python, ~30 MB total. The accuracy gap comes from neural end-to-end training and overlap-aware resegmentation, which polyvoice doesn't do yet.

# Reproduce benchmarks
bash scripts/download-ami-test.sh
cargo run --release --features cli --bin polyvoice-bench -- data/ami-test

bash scripts/download-voxconverse-test.sh
cargo run --release --features cli --bin polyvoice-bench -- data/voxconverse-test --threshold 0.4

Features

  • Pipeline APIPipeline::run() for one-call diarization with VAD + embeddings + clustering.
  • Online & OfflineOnlineDiarizer for real-time streaming, OfflineDiarizer for batch files.
  • ONNX-powered — WeSpeaker and ECAPA-TDNN extractors with 80-bin log-mel fbank + CMVN.
  • Lock-free session poolcrossbeam-queue backed pool for concurrent ONNX inference.
  • Silero VAD — integrated voice activity detection with stateful LSTM context.
  • Overlap detection — find regions where multiple speakers talk simultaneously.
  • Word alignment — assign speaker IDs to transcript words by timestamp.
  • Python bindingspip install polyvoice, 3-line API via PyO3/maturin.
  • CLIpolyvoice diarize meeting.wav with text/json/rttm output.
  • C FFI — drop-in .so/.dylib/.dll for Go, Node.js, C++ callers.
  • Safety verified — Miri (memory), Loom (concurrency), cargo-fuzz (inputs), across Linux/macOS/Windows.

Configuration

use polyvoice::{DiarizationConfig, VadConfig, SampleRate};

let config = DiarizationConfig {
    threshold: 0.45,          // cosine similarity threshold
    max_speakers: 64,         // hard speaker limit
    window_secs: 1.5,         // analysis window
    hop_secs: 0.75,           // sliding step
    min_speech_secs: 0.25,    // discard shorter segments
    max_gap_secs: 0.5,        // merge same-speaker gaps under 500 ms
    min_turn_duration_secs: 1.0,  // filter turns shorter than 1s
    min_embeddings_per_speaker: 2, // merge speakers with <2 embeddings
    sample_rate: SampleRate::new(16000).unwrap(),
};

let vad_config = VadConfig {
    frame_size: 512,          // Silero VAD chunk size (32 ms at 16 kHz)
    threshold: 0.5,           // speech probability threshold
    min_silence_ms: 300.0,    // minimum silence to split segments
};

Streaming (real-time)

use polyvoice::{OnlineDiarizer, DiarizationConfig, DummyExtractor};

let config = DiarizationConfig::default();
let mut diarizer = OnlineDiarizer::new(config);
let extractor = DummyExtractor::new(256);

// In your audio callback:
# let chunk = vec![0.0f32; 4800];
let segments = diarizer.feed(&chunk, &extractor).unwrap();
for seg in segments {
    println!("Speaker {:?} at {:.2}s", seg.speaker, seg.time.start);
}

Verification

Check Tool
Unsafe memory safety Miri (nightly CI)
Concurrency correctness Loom model-checking
Input fuzzing cargo-fuzz (4 targets)
API stability cargo-semver-checks
Cross-platform Ubuntu, macOS, Windows CI
Dependency audit cargo-audit

Roadmap

  • WeSpeaker + ECAPA-TDNN ONNX extractors
  • Silero VAD integration
  • Agglomerative hierarchical clustering (AHC)
  • Pipeline API (VAD + embeddings + AHC)
  • C FFI bindings
  • Miri / Loom / fuzz verification
  • Cross-platform CI
  • Python bindings (PyO3 / maturin)
  • CLI tool (polyvoice diarize / download-models)
  • DER benchmarks on AMI (~23%) and VoxConverse (~15%), 0.25s collar
  • Spectral clustering backend (experimental)
  • Merge-small-speakers post-processing
  • PLDA scoring backend

Contributing

See CONTRIBUTING.md.

Changelog

See CHANGELOG.md.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polyvoice-0.6.0a3-cp312-cp312-win_amd64.whl (8.3 MB view details)

Uploaded CPython 3.12Windows x86-64

polyvoice-0.6.0a3-cp312-cp312-manylinux_2_38_x86_64.whl (10.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.38+ x86-64

polyvoice-0.6.0a3-cp312-cp312-macosx_11_0_arm64.whl (9.1 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

File details

Details for the file polyvoice-0.6.0a3-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for polyvoice-0.6.0a3-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 370a6d9fbfb162d73c8528bfb90ef744b95269f028f64832877cbee0c5165858
MD5 de51bf88750fdba352f76f4fe9db0305
BLAKE2b-256 5b0dceb9e78912d5491daaa6a90f684d910aadb491b11d394a1b58f69207e5ae

See more details on using hashes here.

Provenance

The following attestation bundles were made for polyvoice-0.6.0a3-cp312-cp312-win_amd64.whl:

Publisher: release.yml on ekhodzitsky/polyvoice

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file polyvoice-0.6.0a3-cp312-cp312-manylinux_2_38_x86_64.whl.

File metadata

File hashes

Hashes for polyvoice-0.6.0a3-cp312-cp312-manylinux_2_38_x86_64.whl
Algorithm Hash digest
SHA256 15b1570916b37b35e72dc6a4b1047ef76eee1b4a3090e13d2da07df2e80e29c4
MD5 ba50c6b956edcaa82621e18f6fb8018a
BLAKE2b-256 e68bf53015fef9b87766bbe6c79225de1f9e435376fd0a40fd2f75e86bc0b155

See more details on using hashes here.

Provenance

The following attestation bundles were made for polyvoice-0.6.0a3-cp312-cp312-manylinux_2_38_x86_64.whl:

Publisher: release.yml on ekhodzitsky/polyvoice

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file polyvoice-0.6.0a3-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polyvoice-0.6.0a3-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2a824d366769563b33d76994750313c3f01f7a6aa0658c5d4223928ac9a33c2d
MD5 1d7555dee869e916cf9acaa82cf7a3b0
BLAKE2b-256 5a49a9e021f234d47ab31d3ed78b7dc0ad7f3e93658fb38d7791096b87e825a3

See more details on using hashes here.

Provenance

The following attestation bundles were made for polyvoice-0.6.0a3-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: release.yml on ekhodzitsky/polyvoice

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page