Speaker diarization for Python — detect who spoke when in audio files. CPU-only, no GPU, no API keys, no account signup. Automatic speaker count detection.
Project description
diarize
Speaker diarization for Python — answers "who spoke when?" in any audio file.
Runs on CPU. No GPU, no API keys, no account signup. Apache 2.0 licensed.
pip install diarize
from diarize import diarize
result = diarize("meeting.wav")
for seg in result.segments:
print(f" [{seg.start:.1f}s - {seg.end:.1f}s] {seg.speaker}")
~10.8% DER on VoxConverse (lower than pyannote's free models). Processes audio ~8x faster than real-time on CPU. Automatically detects the number of speakers.
Benchmarked on a single dataset (VoxConverse). Cross-dataset validation is in progress.
How diarize compares
| diarize | pyannote (free) | pyannote (commercial) | |
|---|---|---|---|
| License | Apache 2.0 | CC-BY-4.0 | Commercial |
| GPU required | No | No (7x slower on CPU) | No |
| HuggingFace account | No | Yes | Yes |
| Auto speaker count | Yes | Yes | Yes |
| DER (VoxConverse) | ~10.8% | ~11.2% | ~8.5% |
| CPU speed (RTF) | 0.12 | 0.86 | — |
| Install | pip install diarize |
pip install pyannote.audio |
pip install pyannote.audio |
DER = Diarization Error Rate (lower is better). RTF = Real-Time Factor (lower is faster). pyannote numbers are self-reported from their benchmark page. Full methodology: benchmarks.
Quick Start
from diarize import diarize
result = diarize("meeting.wav")
print(f"Found {result.num_speakers} speakers")
for seg in result.segments:
print(f" [{seg.start:.1f}s - {seg.end:.1f}s] {seg.speaker}")
# Export to RTTM format
result.to_rttm("meeting.rttm")
Requires Python 3.9+. Supports WAV, MP3, FLAC, OGG, and other formats via soundfile/libsndfile.
📖 Full documentation — installation, API reference, architecture, benchmarks.
API
result = diarize("meeting.wav") # auto-detect speakers
result = diarize("call.mp3", num_speakers=2) # known speaker count
result = diarize("panel.flac", min_speakers=3, max_speakers=8)
result.segments # [Segment(start=0.5, end=4.2, speaker='SPEAKER_00'), ...]
result.num_speakers # 3
result.speakers # ['SPEAKER_00', 'SPEAKER_01', 'SPEAKER_02']
result.audio_duration # 324.5
result.to_rttm("output.rttm") # export to standard RTTM format
result.to_list() # export as list of dicts (JSON-serializable)
Each Segment has .start, .end, .speaker, and .duration (all in seconds).
Full API reference: documentation
How It Works
Four-stage pipeline, all CPU, all open-source:
- Silero VAD (MIT) — detects speech segments
- WeSpeaker ResNet34-LM (Apache 2.0) — extracts 256-dim speaker embeddings via ONNX
- GMM BIC — estimates the number of speakers
- Spectral Clustering (scikit-learn, BSD) — assigns speaker labels
Details: How It Works
Benchmarks
Evaluated on VoxConverse dev set (216 files, 1–20 speakers):
Diarization Error Rate (DER)
| System | Weighted DER | Notes |
|---|---|---|
| pyannote precision-2 | ~8.5% | Commercial license |
| diarize | ~10.8% | Apache 2.0, CPU-only, no API key |
| pyannote community-1 | ~11.2% | CC-BY-4.0, needs HF token |
| pyannote 3.1 (legacy) | ~11.2% | MIT, needs HF token |
Speaker Count Estimation
| GT Speakers | Files | Exact Match | Within ±1 |
|---|---|---|---|
| 1 | 22 | 91% | 95% |
| 2 | 44 | 70% | 91% |
| 3 | 35 | 69% | 97% |
| 4 | 24 | 54% | 88% |
| 5 | 31 | 32% | 87% |
| 6–7 | 29 | 45% | 79% |
| 8+ | 31 | 0% | 26% |
| Overall | 216 | 51% | 81% |
Full benchmark results, speed comparison, and methodology: benchmarks.
When to use something else
- You need <9% DER. pyannote's commercial model (precision-2) achieves ~8.5%. If accuracy is the top priority and you have budget, use that.
- Your audio has 8+ speakers. Automatic speaker count estimation degrades above 7 speakers. You can pass
num_speakersexplicitly, but test carefully. - You need overlapping speech detection. diarize assigns each segment to one speaker. Overlapping speech is not modeled.
- You need GPU-accelerated throughput. diarize is CPU-only by design. For processing thousands of hours with GPU infrastructure, NeMo or pyannote on GPU will be faster.
Roadmap
Current benchmarks are based on VoxConverse dev set only. We are actively working on:
- Cross-dataset validation — AMI, DIHARD III, CALLHOME, and other standard benchmarks in isolated environments
- Speaker count estimation benchmarks — comparison of speaker counting accuracy against other systems
- Broader system comparison — NeMo, WhisperX, and other diarization solutions
- Streaming / real-time diarization — live audio streams with real-time speaker detection
- Speaker identification — recognise known speakers across sessions using stored embeddings
Logging
diarize uses Python's standard logging module:
import logging
logging.basicConfig(level=logging.INFO)
License
Apache 2.0 License. See LICENSE for details.
All dependencies are permissively licensed:
- Silero VAD: MIT
- WeSpeaker: Apache 2.0
- scikit-learn: BSD
- PyTorch: BSD
Contributing
Contributions are welcome! Please open an issue or pull request on GitHub.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file diarize-0.1.0.tar.gz.
File metadata
- Download URL: diarize-0.1.0.tar.gz
- Upload date:
- Size: 28.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8b04fffc9af88212194abdeb668a484b6f22d567f049ae92cfa6878d4d07c13e
|
|
| MD5 |
ee10d4e80cd99af3ccb8448c1ce5da45
|
|
| BLAKE2b-256 |
cced1fdb15bd67e99a112a1bb397370c30e0f5bc6c334663d013da8e0b317771
|
Provenance
The following attestation bundles were made for diarize-0.1.0.tar.gz:
Publisher:
publish.yml on FoxNoseTech/diarize
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
diarize-0.1.0.tar.gz -
Subject digest:
8b04fffc9af88212194abdeb668a484b6f22d567f049ae92cfa6878d4d07c13e - Sigstore transparency entry: 1006443030
- Sigstore integration time:
-
Permalink:
FoxNoseTech/diarize@c7bc69ad08d2c1704548cd6a90c38c84e38001b9 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/FoxNoseTech
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@c7bc69ad08d2c1704548cd6a90c38c84e38001b9 -
Trigger Event:
release
-
Statement type:
File details
Details for the file diarize-0.1.0-py3-none-any.whl.
File metadata
- Download URL: diarize-0.1.0-py3-none-any.whl
- Upload date:
- Size: 20.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
127a8f25db56127bd36d2b6c292ed6d4e2d300440ba355503b183fc4990af625
|
|
| MD5 |
c531686ef1b2d142b0693fd5f60ca2f4
|
|
| BLAKE2b-256 |
748b9a48968c377d90ca05763f5bc7e905127fc843be6a557306e74705e0b90c
|
Provenance
The following attestation bundles were made for diarize-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on FoxNoseTech/diarize
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
diarize-0.1.0-py3-none-any.whl -
Subject digest:
127a8f25db56127bd36d2b6c292ed6d4e2d300440ba355503b183fc4990af625 - Sigstore transparency entry: 1006443032
- Sigstore integration time:
-
Permalink:
FoxNoseTech/diarize@c7bc69ad08d2c1704548cd6a90c38c84e38001b9 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/FoxNoseTech
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@c7bc69ad08d2c1704548cd6a90c38c84e38001b9 -
Trigger Event:
release
-
Statement type: