Skip to main content

Speaker diarization — who spoke when. Rust + ONNX, no Python runtime overhead.

Project description

polyvoice

CI Crates.io Docs.rs License: MIT

Speaker diarization for Rust — who spoke when, without Python.

Silero VAD + WeSpeaker embeddings + AHC clustering in a single Pipeline::run() call.

Input:  14 seconds of two-speaker audio (16 kHz mono WAV)
Output: SPEAKER_00: 0.10s -  7.60s
        SPEAKER_01: 8.10s - 14.10s

Quick start

1. Add the dependency

[dependencies]
polyvoice = { version = "0.5", features = ["onnx"] }

2. Download models

bash scripts/download-models.sh
# Downloads WeSpeaker ResNet34 (25 MB) and Silero VAD v5 (2.2 MB) to models/

3. Run the pipeline

use polyvoice::{
    Pipeline, DiarizationConfig, VadConfig,
    FbankOnnxExtractor, SileroVad,
};
use std::path::Path;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Load models
    let extractor = FbankOnnxExtractor::new(
        Path::new("models/wespeaker_resnet34.onnx"),
        256, // embedding dim
        4,   // ONNX session pool size
    )?;
    let mut vad = SileroVad::new(Path::new("models/silero_vad.onnx"), 512)?;

    // Configure and run
    let pipeline = Pipeline::new(
        DiarizationConfig::default(),
        VadConfig::default(),
    );
    let (samples, _sr) = polyvoice::wav::read_wav(Path::new("meeting.wav"))?;
    let result = pipeline.run(&samples, &extractor, &mut vad)?;

    for turn in &result.turns {
        println!("{}: {:.2}s - {:.2}s", turn.speaker, turn.time.start, turn.time.end);
    }
    Ok(())
}

Python

cd python
maturin develop --release

PyPI package coming soon.

import polyvoice

pipeline = polyvoice.Pipeline("models/")
turns = pipeline("meeting.wav")

for turn in turns:
    print(f"{turn.speaker}: {turn.start:.1f}s - {turn.end:.1f}s")

CLI

cargo install polyvoice --features cli

polyvoice download-models
polyvoice diarize meeting.wav
polyvoice diarize meeting.wav --format json
polyvoice diarize meeting.wav --format rttm --max-speakers 4

How it works

WAV / PCM audio (16 kHz mono)
       |
       v
+-------------+     +------------------+     +---------+
|  Silero VAD |---->| WeSpeaker        |---->|   AHC   |---> Speaker turns
|  (speech    |     | ResNet34         |     | cluster |
|   regions)  |     | (256-d embed.)   |     |         |
+-------------+     +------------------+     +---------+
                     fbank + CMVN           cosine similarity
                     lock-free pool         threshold merging

VAD detects speech regions, skipping silence. WeSpeaker extracts 256-dimensional speaker embeddings from log-mel filterbank features (80-bin, CMVN-normalized). AHC clusters embeddings by cosine similarity into speaker groups. The Pipeline wires it all together.

Comparison with pyannote

polyvoice pyannote
Language Rust Python
Runtime ONNX Runtime PyTorch
GIL-free Yes No
Binary size ~30 MB (with models) ~2 GB (torch + models)
Deploy Single binary / C FFI Python env + pip
Concurrent sessions Lock-free session pool Thread-limited
Streaming OnlineDiarizer built-in Third-party wrappers

pyannote is the gold standard for accuracy. polyvoice trades some accuracy for deployment simplicity: no Python runtime, no GPU required, ~30 MB total.

Minimum Supported Rust Version (MSRV)

1.85 (Rust 2024 edition).

Accuracy (DER benchmarks)

Evaluated with 0.25s collar on standard diarization benchmarks:

VoxConverse (232 files, 43.5 hours — broadcast, meetings, interviews)

System DER Miss FA Confusion Speed
polyvoice (AHC, t=0.4) 16.4% 3.9% 3.2% 9.3% 10.6x RT (CPU)
pyannote 3.0 ~11% ~1x RT (GPU)

AMI (16 meetings, 9 hours — meeting room recordings)

System DER Miss FA Confusion Speed
polyvoice (AHC, t=0.4) 27.5% 17.7% 2.2% 7.6% 7x RT (CPU)
pyannote 3.0 ~18% ~1x RT (GPU)
Simple i-vector + AHC ~33%

polyvoice delivers ~80% of pyannote's accuracy at 10x the speed on CPU alone — no GPU, no Python, ~30 MB total. The accuracy gap comes from neural end-to-end training and overlap-aware resegmentation, which polyvoice doesn't do yet.

# Reproduce benchmarks
bash scripts/download-ami-test.sh
cargo run --release --features cli --bin polyvoice-bench -- data/ami-test

bash scripts/download-voxconverse-test.sh
cargo run --release --features cli --bin polyvoice-bench -- data/voxconverse-test --threshold 0.4

Features

  • Pipeline APIPipeline::run() for one-call diarization with VAD + embeddings + clustering.
  • Online & OfflineOnlineDiarizer for real-time streaming, OfflineDiarizer for batch files.
  • ONNX-powered — WeSpeaker and ECAPA-TDNN extractors with 80-bin log-mel fbank + CMVN.
  • Lock-free session poolcrossbeam-queue backed pool for concurrent ONNX inference.
  • Silero VAD — integrated voice activity detection with stateful LSTM context.
  • Overlap detection — find regions where multiple speakers talk simultaneously.
  • Word alignment — assign speaker IDs to transcript words by timestamp.
  • Python bindingspip install polyvoice, 3-line API via PyO3/maturin.
  • CLIpolyvoice diarize meeting.wav with text/json/rttm output.
  • C FFI — drop-in .so/.dylib/.dll for Go, Node.js, C++ callers.
  • Safety verified — Miri (memory), Loom (concurrency), cargo-fuzz (inputs), across Linux/macOS/Windows.

Configuration

use polyvoice::{DiarizationConfig, VadConfig, SampleRate};

let config = DiarizationConfig {
    threshold: 0.4,           // cosine similarity threshold
    max_speakers: 64,         // hard speaker limit
    window_secs: 1.5,         // analysis window
    hop_secs: 0.75,           // sliding step
    min_speech_secs: 0.25,    // discard shorter segments
    max_gap_secs: 0.5,        // merge same-speaker gaps under 500 ms
    sample_rate: SampleRate::new(16000).unwrap(),
};

let vad_config = VadConfig {
    frame_size: 512,          // Silero VAD chunk size (32 ms at 16 kHz)
    threshold: 0.5,           // speech probability threshold
    min_silence_ms: 300.0,    // minimum silence to split segments
};

Streaming (real-time)

use polyvoice::{OnlineDiarizer, DiarizationConfig, DummyExtractor};

let config = DiarizationConfig::default();
let mut diarizer = OnlineDiarizer::new(config);
let extractor = DummyExtractor::new(256);

// In your audio callback:
# let chunk = vec![0.0f32; 4800];
let segments = diarizer.feed(&chunk, &extractor).unwrap();
for seg in segments {
    println!("Speaker {:?} at {:.2}s", seg.speaker, seg.time.start);
}

Verification

Check Tool
Unsafe memory safety Miri (nightly CI)
Concurrency correctness Loom model-checking
Input fuzzing cargo-fuzz (4 targets)
API stability cargo-semver-checks
Cross-platform Ubuntu, macOS, Windows CI
Dependency audit cargo-audit

Roadmap

  • WeSpeaker + ECAPA-TDNN ONNX extractors
  • Silero VAD integration
  • Agglomerative hierarchical clustering (AHC)
  • Pipeline API (VAD + embeddings + AHC)
  • C FFI bindings
  • Miri / Loom / fuzz verification
  • Cross-platform CI
  • Python bindings (PyO3 / maturin)
  • CLI tool (polyvoice diarize / download-models)
  • DER benchmarks on AMI (27.5%) and VoxConverse (16.4%), 0.25s collar
  • Spectral clustering backend
  • PLDA scoring backend

Contributing

See CONTRIBUTING.md.

Changelog

See CHANGELOG.md.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polyvoice-0.5.2-cp312-cp312-win_amd64.whl (7.2 MB view details)

Uploaded CPython 3.12Windows x86-64

polyvoice-0.5.2-cp312-cp312-manylinux_2_38_x86_64.whl (8.7 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.38+ x86-64

polyvoice-0.5.2-cp312-cp312-macosx_11_0_arm64.whl (7.9 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

polyvoice-0.5.2-cp311-cp311-macosx_11_0_arm64.whl (7.9 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

File details

Details for the file polyvoice-0.5.2-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: polyvoice-0.5.2-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 7.2 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for polyvoice-0.5.2-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 7b1e69a438546da511d40b67217ae13848834a1d6b19ce2b6a8aa32d30cb0487
MD5 1e478c17a8c3147f299dee0855cbd39b
BLAKE2b-256 e6b6c57c41894462ae459c14cceaa364ef16b4891c7bab850ab37e694eb427cc

See more details on using hashes here.

Provenance

The following attestation bundles were made for polyvoice-0.5.2-cp312-cp312-win_amd64.whl:

Publisher: release.yml on ekhodzitsky/polyvoice

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file polyvoice-0.5.2-cp312-cp312-manylinux_2_38_x86_64.whl.

File metadata

File hashes

Hashes for polyvoice-0.5.2-cp312-cp312-manylinux_2_38_x86_64.whl
Algorithm Hash digest
SHA256 7278124d4c8c4ff7830776cdf6a58b67f04989e26e5563c6fbb689bd68f6a57b
MD5 4da803f63838bdf341b1ca25cf2b077e
BLAKE2b-256 0f40df466185cdb099934837ef1acbb4f87d3030656461cac793aed3691bfb59

See more details on using hashes here.

Provenance

The following attestation bundles were made for polyvoice-0.5.2-cp312-cp312-manylinux_2_38_x86_64.whl:

Publisher: release.yml on ekhodzitsky/polyvoice

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file polyvoice-0.5.2-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polyvoice-0.5.2-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3421253760986dd17820ae9b0d980e3afdf38070ac70fb2e9d5298b1f359ac84
MD5 7602680caf7644667f914a9e48f6fef9
BLAKE2b-256 3a30c61ad44c33c81b17cc048ea4726b10539912e50b5e54bece2a3d627786cc

See more details on using hashes here.

Provenance

The following attestation bundles were made for polyvoice-0.5.2-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: release.yml on ekhodzitsky/polyvoice

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file polyvoice-0.5.2-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polyvoice-0.5.2-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 57b5aa3f056f8a285ba314870929bdf0e788af55027606041c0d93ab27c6572a
MD5 178324342ba6fcc375e872c221568e9b
BLAKE2b-256 58fa4914ebc3ef4dfb29d2626c4553f329eff8fb3288077f8be8c307e05757d6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page