Speaker diarization — who spoke when. Rust + ONNX, no Python runtime overhead.
Project description
polyvoice
Speaker diarization for Rust — who spoke when, without Python.
Silero VAD + WeSpeaker embeddings + AHC clustering in a single
Pipeline::run()call.
Input: 14 seconds of two-speaker audio (16 kHz mono WAV)
Output: SPEAKER_00: 0.10s - 7.60s
SPEAKER_01: 8.10s - 14.10s
Quick start
1. Add the dependency
[dependencies]
polyvoice = { version = "0.5", features = ["onnx"] }
2. Download models
bash scripts/download-models.sh
# Downloads WeSpeaker ResNet34 (25 MB) and Silero VAD v5 (2.2 MB) to models/
3. Run the pipeline
use polyvoice::{
Pipeline, DiarizationConfig, VadConfig,
FbankOnnxExtractor, SileroVad,
};
use std::path::Path;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Load models
let extractor = FbankOnnxExtractor::new(
Path::new("models/wespeaker_resnet34.onnx"),
256, // embedding dim
4, // ONNX session pool size
)?;
let mut vad = SileroVad::new(Path::new("models/silero_vad.onnx"), 512)?;
// Configure and run
let pipeline = Pipeline::new(
DiarizationConfig::default(),
VadConfig::default(),
);
let (samples, _sr) = polyvoice::wav::read_wav(Path::new("meeting.wav"))?;
let result = pipeline.run(&samples, &extractor, &mut vad)?;
for turn in &result.turns {
println!("{}: {:.2}s - {:.2}s", turn.speaker, turn.time.start, turn.time.end);
}
Ok(())
}
Python
cd python
maturin develop --release
PyPI package coming soon.
import polyvoice
pipeline = polyvoice.Pipeline("models/")
turns = pipeline("meeting.wav")
for turn in turns:
print(f"{turn.speaker}: {turn.start:.1f}s - {turn.end:.1f}s")
CLI
cargo install polyvoice --features cli
polyvoice download-models
polyvoice diarize meeting.wav
polyvoice diarize meeting.wav --format json
polyvoice diarize meeting.wav --format rttm --max-speakers 4
How it works
WAV / PCM audio (16 kHz mono)
|
v
+-------------+ +------------------+ +---------+
| Silero VAD |---->| WeSpeaker |---->| AHC |---> Speaker turns
| (speech | | ResNet34 | | cluster |
| regions) | | (256-d embed.) | | |
+-------------+ +------------------+ +---------+
fbank + CMVN cosine similarity
lock-free pool threshold merging
VAD detects speech regions, skipping silence. WeSpeaker extracts 256-dimensional speaker embeddings from log-mel filterbank features (80-bin, CMVN-normalized). AHC clusters embeddings by cosine similarity into speaker groups. The Pipeline wires it all together.
Comparison with pyannote
| polyvoice | pyannote | |
|---|---|---|
| Language | Rust | Python |
| Runtime | ONNX Runtime | PyTorch |
| GIL-free | Yes | No |
| Binary size | ~30 MB (with models) | ~2 GB (torch + models) |
| Deploy | Single binary / C FFI | Python env + pip |
| Concurrent sessions | Lock-free session pool | Thread-limited |
| Streaming | OnlineDiarizer built-in |
Third-party wrappers |
pyannote is the gold standard for accuracy. polyvoice trades some accuracy for deployment simplicity: no Python runtime, no GPU required, ~30 MB total.
Minimum Supported Rust Version (MSRV)
1.85 (Rust 2024 edition).
Accuracy (DER benchmarks)
Evaluated with 0.25s collar on standard diarization benchmarks:
VoxConverse (232 files, 43.5 hours — broadcast, meetings, interviews)
| System | DER | Miss | FA | Confusion | Speed |
|---|---|---|---|---|---|
| polyvoice (AHC, t=0.4) | 16.4% | 3.9% | 3.2% | 9.3% | 10.6x RT (CPU) |
| pyannote 3.0 | ~11% | — | — | — | ~1x RT (GPU) |
AMI (16 meetings, 9 hours — meeting room recordings)
| System | DER | Miss | FA | Confusion | Speed |
|---|---|---|---|---|---|
| polyvoice (AHC, t=0.4) | 27.5% | 17.7% | 2.2% | 7.6% | 7x RT (CPU) |
| pyannote 3.0 | ~18% | — | — | — | ~1x RT (GPU) |
| Simple i-vector + AHC | ~33% | — | — | — | — |
polyvoice delivers ~80% of pyannote's accuracy at 10x the speed on CPU alone — no GPU, no Python, ~30 MB total. The accuracy gap comes from neural end-to-end training and overlap-aware resegmentation, which polyvoice doesn't do yet.
# Reproduce benchmarks
bash scripts/download-ami-test.sh
cargo run --release --features cli --bin polyvoice-bench -- data/ami-test
bash scripts/download-voxconverse-test.sh
cargo run --release --features cli --bin polyvoice-bench -- data/voxconverse-test --threshold 0.4
Features
- Pipeline API —
Pipeline::run()for one-call diarization with VAD + embeddings + clustering. - Online & Offline —
OnlineDiarizerfor real-time streaming,OfflineDiarizerfor batch files. - ONNX-powered — WeSpeaker and ECAPA-TDNN extractors with 80-bin log-mel fbank + CMVN.
- Lock-free session pool —
crossbeam-queuebacked pool for concurrent ONNX inference. - Silero VAD — integrated voice activity detection with stateful LSTM context.
- Overlap detection — find regions where multiple speakers talk simultaneously.
- Word alignment — assign speaker IDs to transcript words by timestamp.
- Python bindings —
pip install polyvoice, 3-line API via PyO3/maturin. - CLI —
polyvoice diarize meeting.wavwith text/json/rttm output. - C FFI — drop-in
.so/.dylib/.dllfor Go, Node.js, C++ callers. - Safety verified — Miri (memory), Loom (concurrency), cargo-fuzz (inputs), across Linux/macOS/Windows.
Configuration
use polyvoice::{DiarizationConfig, VadConfig, SampleRate};
let config = DiarizationConfig {
threshold: 0.4, // cosine similarity threshold
max_speakers: 64, // hard speaker limit
window_secs: 1.5, // analysis window
hop_secs: 0.75, // sliding step
min_speech_secs: 0.25, // discard shorter segments
max_gap_secs: 0.5, // merge same-speaker gaps under 500 ms
sample_rate: SampleRate::new(16000).unwrap(),
};
let vad_config = VadConfig {
frame_size: 512, // Silero VAD chunk size (32 ms at 16 kHz)
threshold: 0.5, // speech probability threshold
min_silence_ms: 300.0, // minimum silence to split segments
};
Streaming (real-time)
use polyvoice::{OnlineDiarizer, DiarizationConfig, DummyExtractor};
let config = DiarizationConfig::default();
let mut diarizer = OnlineDiarizer::new(config);
let extractor = DummyExtractor::new(256);
// In your audio callback:
# let chunk = vec![0.0f32; 4800];
let segments = diarizer.feed(&chunk, &extractor).unwrap();
for seg in segments {
println!("Speaker {:?} at {:.2}s", seg.speaker, seg.time.start);
}
Verification
| Check | Tool |
|---|---|
| Unsafe memory safety | Miri (nightly CI) |
| Concurrency correctness | Loom model-checking |
| Input fuzzing | cargo-fuzz (4 targets) |
| API stability | cargo-semver-checks |
| Cross-platform | Ubuntu, macOS, Windows CI |
| Dependency audit | cargo-audit |
Roadmap
- WeSpeaker + ECAPA-TDNN ONNX extractors
- Silero VAD integration
- Agglomerative hierarchical clustering (AHC)
- Pipeline API (VAD + embeddings + AHC)
- C FFI bindings
- Miri / Loom / fuzz verification
- Cross-platform CI
- Python bindings (PyO3 / maturin)
- CLI tool (
polyvoice diarize/download-models) - DER benchmarks on AMI (27.5%) and VoxConverse (16.4%), 0.25s collar
- Spectral clustering backend
- PLDA scoring backend
Contributing
See CONTRIBUTING.md.
Changelog
See CHANGELOG.md.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file polyvoice-0.5.2-cp312-cp312-win_amd64.whl.
File metadata
- Download URL: polyvoice-0.5.2-cp312-cp312-win_amd64.whl
- Upload date:
- Size: 7.2 MB
- Tags: CPython 3.12, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7b1e69a438546da511d40b67217ae13848834a1d6b19ce2b6a8aa32d30cb0487
|
|
| MD5 |
1e478c17a8c3147f299dee0855cbd39b
|
|
| BLAKE2b-256 |
e6b6c57c41894462ae459c14cceaa364ef16b4891c7bab850ab37e694eb427cc
|
Provenance
The following attestation bundles were made for polyvoice-0.5.2-cp312-cp312-win_amd64.whl:
Publisher:
release.yml on ekhodzitsky/polyvoice
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
polyvoice-0.5.2-cp312-cp312-win_amd64.whl -
Subject digest:
7b1e69a438546da511d40b67217ae13848834a1d6b19ce2b6a8aa32d30cb0487 - Sigstore transparency entry: 1449724654
- Sigstore integration time:
-
Permalink:
ekhodzitsky/polyvoice@c77e7c7956663d6831bc74c87a4e54a860485410 -
Branch / Tag:
refs/tags/v0.5.2 - Owner: https://github.com/ekhodzitsky
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@c77e7c7956663d6831bc74c87a4e54a860485410 -
Trigger Event:
push
-
Statement type:
File details
Details for the file polyvoice-0.5.2-cp312-cp312-manylinux_2_38_x86_64.whl.
File metadata
- Download URL: polyvoice-0.5.2-cp312-cp312-manylinux_2_38_x86_64.whl
- Upload date:
- Size: 8.7 MB
- Tags: CPython 3.12, manylinux: glibc 2.38+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7278124d4c8c4ff7830776cdf6a58b67f04989e26e5563c6fbb689bd68f6a57b
|
|
| MD5 |
4da803f63838bdf341b1ca25cf2b077e
|
|
| BLAKE2b-256 |
0f40df466185cdb099934837ef1acbb4f87d3030656461cac793aed3691bfb59
|
Provenance
The following attestation bundles were made for polyvoice-0.5.2-cp312-cp312-manylinux_2_38_x86_64.whl:
Publisher:
release.yml on ekhodzitsky/polyvoice
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
polyvoice-0.5.2-cp312-cp312-manylinux_2_38_x86_64.whl -
Subject digest:
7278124d4c8c4ff7830776cdf6a58b67f04989e26e5563c6fbb689bd68f6a57b - Sigstore transparency entry: 1449724671
- Sigstore integration time:
-
Permalink:
ekhodzitsky/polyvoice@c77e7c7956663d6831bc74c87a4e54a860485410 -
Branch / Tag:
refs/tags/v0.5.2 - Owner: https://github.com/ekhodzitsky
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@c77e7c7956663d6831bc74c87a4e54a860485410 -
Trigger Event:
push
-
Statement type:
File details
Details for the file polyvoice-0.5.2-cp312-cp312-macosx_11_0_arm64.whl.
File metadata
- Download URL: polyvoice-0.5.2-cp312-cp312-macosx_11_0_arm64.whl
- Upload date:
- Size: 7.9 MB
- Tags: CPython 3.12, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3421253760986dd17820ae9b0d980e3afdf38070ac70fb2e9d5298b1f359ac84
|
|
| MD5 |
7602680caf7644667f914a9e48f6fef9
|
|
| BLAKE2b-256 |
3a30c61ad44c33c81b17cc048ea4726b10539912e50b5e54bece2a3d627786cc
|
Provenance
The following attestation bundles were made for polyvoice-0.5.2-cp312-cp312-macosx_11_0_arm64.whl:
Publisher:
release.yml on ekhodzitsky/polyvoice
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
polyvoice-0.5.2-cp312-cp312-macosx_11_0_arm64.whl -
Subject digest:
3421253760986dd17820ae9b0d980e3afdf38070ac70fb2e9d5298b1f359ac84 - Sigstore transparency entry: 1449724701
- Sigstore integration time:
-
Permalink:
ekhodzitsky/polyvoice@c77e7c7956663d6831bc74c87a4e54a860485410 -
Branch / Tag:
refs/tags/v0.5.2 - Owner: https://github.com/ekhodzitsky
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@c77e7c7956663d6831bc74c87a4e54a860485410 -
Trigger Event:
push
-
Statement type:
File details
Details for the file polyvoice-0.5.2-cp311-cp311-macosx_11_0_arm64.whl.
File metadata
- Download URL: polyvoice-0.5.2-cp311-cp311-macosx_11_0_arm64.whl
- Upload date:
- Size: 7.9 MB
- Tags: CPython 3.11, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
57b5aa3f056f8a285ba314870929bdf0e788af55027606041c0d93ab27c6572a
|
|
| MD5 |
178324342ba6fcc375e872c221568e9b
|
|
| BLAKE2b-256 |
58fa4914ebc3ef4dfb29d2626c4553f329eff8fb3288077f8be8c307e05757d6
|