Speaker diarization for Python — detect who spoke when in audio files. CPU-only, no GPU, no API keys, no account signup. Automatic speaker count detection.
Project description
diarize
Speaker diarization for Python — answers "who spoke when?" in any audio file.
Runs on CPU. No GPU, no API keys, no account signup. Apache 2.0 licensed.
pip install diarize
from diarize import diarize
result = diarize("meeting.wav")
for seg in result.segments:
print(f" [{seg.start:.1f}s - {seg.end:.1f}s] {seg.speaker}")
~4.8% weighted DER on VoxConverse dev. Processes audio ~8x faster than real-time on CPU. Automatically detects the number of speakers.
Primary benchmark: VoxConverse. Preliminary AMI meeting-domain validation is in progress.
How diarize compares
| diarize | pyannote (free) | pyannote (commercial) | |
|---|---|---|---|
| License | Apache 2.0 | CC-BY-4.0 | Commercial |
| GPU required | No | No (7x slower on CPU) | No |
| HuggingFace account | No | Yes | Yes |
| Auto speaker count | Yes | Yes | Yes |
| DER (VoxConverse dev) | ~4.8% | ~11.2% | ~8.5% |
| CPU speed (RTF) | 0.12 | 0.86 | — |
| Install | pip install diarize |
pip install pyannote.audio |
pip install pyannote.audio |
DER = Diarization Error Rate (lower is better). RTF = Real-Time Factor (lower is faster). pyannote numbers are self-reported from their benchmark page. The diarize number is from the VoxConverse dev evaluation described in benchmarks.
Quick Start
from diarize import diarize
result = diarize("meeting.wav")
print(f"Found {result.num_speakers} speakers")
for seg in result.segments:
print(f" [{seg.start:.1f}s - {seg.end:.1f}s] {seg.speaker}")
# Export to RTTM format
result.to_rttm("meeting.rttm")
Requires Python 3.9+. Supports WAV, MP3, FLAC, OGG, and other formats via soundfile/libsndfile.
diarize pins a compatible torch/torchaudio range during install, so no extra manual pinning is required.
📖 Full documentation — installation, API reference, architecture, benchmarks.
API
result = diarize("meeting.wav") # auto-detect speakers
result = diarize("call.mp3", num_speakers=2) # known speaker count
result = diarize("panel.flac", min_speakers=3, max_speakers=8)
result.segments # [Segment(start=0.5, end=4.2, speaker='SPEAKER_00'), ...]
result.num_speakers # 3
result.speakers # ['SPEAKER_00', 'SPEAKER_01', 'SPEAKER_02']
result.audio_duration # 324.5
result.to_rttm("output.rttm") # export to standard RTTM format
result.to_list() # export as list of dicts (JSON-serializable)
Each Segment has .start, .end, .speaker, and .duration (all in seconds).
Full API reference: documentation
How It Works
Four-stage pipeline, all CPU, all open-source:
- Silero VAD (MIT) — detects speech segments
- WeSpeaker ResNet34-LM (Apache 2.0) — extracts 256-dim speaker embeddings via ONNX
- GMM BIC + silhouette refinement — estimates the number of speakers
- Spectral Clustering (scikit-learn, BSD) + temporal smoothing — assigns speaker labels
Details: How It Works
Benchmarks
Evaluated on VoxConverse dev set (216 files, 1–20 speakers):
Diarization Error Rate (DER)
| System | Weighted DER | Notes |
|---|---|---|
| pyannote precision-2 | ~8.5% | Commercial license |
| diarize | ~4.8% | Apache 2.0, CPU-only, no API key |
| pyannote community-1 | ~11.2% | CC-BY-4.0, needs HF token |
| pyannote 3.1 (legacy) | ~11.2% | MIT, needs HF token |
Speaker Count Estimation
| Metric | Result |
|---|---|
| Files | 216 |
| Exact match | 125/216 (58%) |
| Within ±1 | 178/216 (82%) |
Many-speaker files remain the weak spot: automatic count estimation degrades above 7 speakers. Pass num_speakers when the count is known.
Preliminary AMI meeting-domain check (16 Mix-Headset test files, 4–9 speakers):
| Metric | Result |
|---|---|
| Weighted DER | 14.96% |
| Speaker count exact match | 4/16 (25%) |
| Speaker count within ±1 | 8/16 (50%) |
AMI confirms that meeting-domain speaker counting is harder: the estimator often collapses 6+ speaker meetings to 4–5 speakers.
Full benchmark results, speed comparison, and methodology: benchmarks.
When to use something else
- You need commercial support or broad cross-dataset validation. pyannote's commercial model has published production-oriented benchmarks beyond this limited VoxConverse/AMI evaluation. If accuracy is the top priority and you have budget, compare on your own data.
- You need very stable speaker labels in transcripts. Temporal smoothing reduces short label jumps, but diarize can still show speaker fragmentation / label switching: one real speaker may be split across multiple
SPEAKER_XXlabels, especially on noisy real-world audio. - Your audio has 8+ speakers. Automatic speaker count estimation degrades above 7 speakers. You can pass
num_speakersexplicitly, but test carefully. - You need overlapping speech detection. diarize assigns each segment to one speaker. Overlapping speech is not modeled.
- You need GPU-accelerated throughput. diarize is CPU-only by design. For processing thousands of hours with GPU infrastructure, NeMo or pyannote on GPU will be faster.
Roadmap
Current benchmarks include VoxConverse dev and preliminary AMI test validation. We are actively working on:
- Cross-dataset validation — DIHARD III, CALLHOME, and other standard benchmarks in isolated environments
- Speaker count estimation benchmarks — comparison of speaker counting accuracy against other systems
- Broader system comparison — NeMo, WhisperX, and other diarization solutions
- Streaming / real-time diarization — live audio streams with real-time speaker detection
- Speaker identification — recognise known speakers across sessions using stored embeddings
Logging
diarize uses Python's standard logging module:
import logging
logging.basicConfig(level=logging.INFO)
License
Apache 2.0 License. See LICENSE for details.
All dependencies are permissively licensed:
- Silero VAD: MIT
- WeSpeaker: Apache 2.0
- scikit-learn: BSD
- PyTorch: BSD
Contributing
Contributions are welcome! Please open an issue or pull request on GitHub.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file diarize-0.1.2.tar.gz.
File metadata
- Download URL: diarize-0.1.2.tar.gz
- Upload date:
- Size: 39.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a559e77e189d436355946c0bfd8c5d991ebed59e281496d967fd3f7837a6469d
|
|
| MD5 |
d6f552105afe221bb43f5d51baf63299
|
|
| BLAKE2b-256 |
0a7cc26f2d1d29e52f2f98acd7c59b92aa80d558c804703af16f220a26c84f3e
|
Provenance
The following attestation bundles were made for diarize-0.1.2.tar.gz:
Publisher:
publish.yml on FoxNoseTech/diarize
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
diarize-0.1.2.tar.gz -
Subject digest:
a559e77e189d436355946c0bfd8c5d991ebed59e281496d967fd3f7837a6469d - Sigstore transparency entry: 1449187351
- Sigstore integration time:
-
Permalink:
FoxNoseTech/diarize@4f25d27dee54f7e8264a914e705f7cee182151e2 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/FoxNoseTech
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@4f25d27dee54f7e8264a914e705f7cee182151e2 -
Trigger Event:
release
-
Statement type:
File details
Details for the file diarize-0.1.2-py3-none-any.whl.
File metadata
- Download URL: diarize-0.1.2-py3-none-any.whl
- Upload date:
- Size: 23.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
848b798e471e452e900d924ad98e4f1321812ba1a3eb63bdc22f15b0756c90f5
|
|
| MD5 |
82f6b2ca4ed5855332931de6b4a8dede
|
|
| BLAKE2b-256 |
7f620e2cc45e574eabb433b10902e10cbb68d3c3fc6590c7dea6431a7be6a9ab
|
Provenance
The following attestation bundles were made for diarize-0.1.2-py3-none-any.whl:
Publisher:
publish.yml on FoxNoseTech/diarize
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
diarize-0.1.2-py3-none-any.whl -
Subject digest:
848b798e471e452e900d924ad98e4f1321812ba1a3eb63bdc22f15b0756c90f5 - Sigstore transparency entry: 1449187386
- Sigstore integration time:
-
Permalink:
FoxNoseTech/diarize@4f25d27dee54f7e8264a914e705f7cee182151e2 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/FoxNoseTech
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@4f25d27dee54f7e8264a914e705f7cee182151e2 -
Trigger Event:
release
-
Statement type: