Skip to main content

Speaker identification powered by pyannote and resemblyzer

Project description

voicetag

voicetag

Know who said what. Automatically.

PyPI version Python versions License CI status Downloads


What is voicetag?

voicetag is a Python library for speaker diarization and named speaker identification. It combines pyannote.audio for diarization with resemblyzer for speaker embeddings, giving you a single interface to answer: who is speaking, and when?

Enroll speakers once with a few audio samples, then identify them in any recording -- meetings, podcasts, interviews, phone calls.

Features

  • :zap: Dead-simple API -- enroll speakers and identify them in three lines of code
  • :globe_with_meridians: Language agnostic -- works with Hebrew, English, Mandarin, or any spoken language
  • :busts_in_silhouette: Built-in overlap detection -- flags regions where multiple speakers talk simultaneously
  • :rocket: Fast parallel processing -- concurrent embedding computation with configurable thread pools
  • :keyboard: CLI tool included -- enroll, identify, and manage profiles from the terminal
  • :floppy_disk: Save/load speaker profiles -- persist enrolled speakers to disk and reuse across sessions
  • :white_check_mark: Pydantic result models -- fully typed, validated, immutable result objects
  • :speech_balloon: Built-in transcription -- plug in OpenAI, Groq, Fireworks, Whisper, or Deepgram to get "who said what"

Quick Start

from voicetag import VoiceTag

vt = VoiceTag()
vt.enroll("Alice", ["alice1.wav", "alice2.wav"])
vt.enroll("Bob", ["bob1.wav"])

result = vt.identify("meeting.wav")
for segment in result.segments:
    print(f"{segment.speaker}: {segment.start:.1f}s - {segment.end:.1f}s")

Output:

Alice: 0.0s - 4.2s
Bob: 4.5s - 8.1s
Alice: 8.3s - 12.7s
UNKNOWN: 13.0s - 15.4s

Installation

pip install voicetag

For transcription support, install with a provider:

pip install voicetag[openai]    # OpenAI Whisper API
pip install voicetag[groq]      # Groq (fast Whisper)
pip install voicetag[whisper]   # Local Whisper (no API key needed)
pip install voicetag[deepgram]  # Deepgram
pip install voicetag[all-stt]   # All providers

voicetag requires access to the pyannote.audio speaker diarization model, which is gated behind a HuggingFace license agreement.

Prerequisites

  1. Accept the pyannote model licenses at:
  2. Create a HuggingFace token at huggingface.co/settings/tokens
  3. Set the token via environment variable or config:
export HF_TOKEN="hf_your_token_here"

Or pass it directly:

from voicetag import VoiceTag, VoiceTagConfig

vt = VoiceTag(config=VoiceTagConfig(hf_token="hf_your_token_here"))

GPU Acceleration (optional)

For faster processing on CUDA or Apple Silicon:

vt = VoiceTag(config=VoiceTagConfig(device="cuda"))  # NVIDIA GPU
vt = VoiceTag(config=VoiceTagConfig(device="mps"))    # Apple Silicon

CLI Usage

voicetag ships with a full-featured command-line interface.

Enroll a speaker

voicetag enroll "Alice" alice_sample1.wav alice_sample2.wav
Enrolled speaker Alice from 2 sample(s).
Profiles saved to voicetag_profiles.json

Identify speakers in a recording

voicetag identify meeting.wav --threshold 0.8
Speaker Timeline -- meeting.wav
+---------+----------+----------+----------+------------+
| Speaker | Start    | End      | Duration | Confidence |
+---------+----------+----------+----------+------------+
| Alice   | 00:00.00 | 00:04.20 | 00:04.20 | 0.92       |
| Bob     | 00:04.50 | 00:08.10 | 00:03.60 | 0.87       |
| Alice   | 00:08.30 | 00:12.70 | 00:04.40 | 0.91       |
| UNKNOWN | 00:13.00 | 00:15.40 | 00:02.40 | --         |
+---------+----------+----------+----------+------------+

Summary
  Total duration:  00:15.400
  Speakers:        2
  Segments:        4
  Overlaps:        0

Save results to JSON:

voicetag identify meeting.wav --output results.json

Manage profiles

voicetag profiles list
voicetag profiles remove "Alice"

Transcribe (speaker + text)

voicetag transcribe meeting.wav --provider openai --language en
voicetag transcribe meeting.wav --provider groq --language he
voicetag transcribe meeting.wav --provider whisper --model base
Transcript -- meeting.wav
+---------+----------+----------+------------------------------------+
| Speaker | Start    | End      | Text                               |
+---------+----------+----------+------------------------------------+
| Alice   | 00:00.00 | 00:04.20 | Let's start the meeting            |
| Bob     | 00:04.50 | 00:08.10 | Sure, I have the agenda ready      |
| Alice   | 00:08.30 | 00:12.70 | Great, let's go through it         |
+---------+----------+----------+------------------------------------+

List available providers:

voicetag providers

All CLI options

voicetag --help
voicetag identify --help
Option Description
--profiles PATH Path to speaker profiles file (default: voicetag_profiles.json)
--output, -o PATH Save results as JSON
--threshold FLOAT Similarity threshold override (0.0-1.0)
--hf-token TEXT HuggingFace API token
--device TEXT Torch device: cpu, cuda, mps
--unknown-only Skip speaker matching, just diarize

API Reference

VoiceTag

The main entry point. Wraps the full diarization + identification pipeline.

from voicetag import VoiceTag, VoiceTagConfig

vt = VoiceTag(config=VoiceTagConfig(...))
Method Returns Description
enroll(name, audio_paths) SpeakerProfile Register a speaker from one or more audio files
identify(audio_path) DiarizationResult Run full identification pipeline on an audio file
save(path) None Save enrolled speaker profiles to disk
load(path) None Load speaker profiles from disk
remove_speaker(name) None Remove an enrolled speaker by name
enrolled_speakers list[str] Property: list of enrolled speaker names
transcribe(audio_path, provider, ...) TranscriptResult Identify speakers and transcribe what they said

Transcription example

result = vt.transcribe("meeting.wav", provider="openai", language="en")

for seg in result.segments:
    print(f"[{seg.speaker}] {seg.text}")

# Full transcript
print(result.full_transcript)

# Group by speaker
for speaker, segments in result.by_speaker.items():
    print(f"\n{speaker}:")
    for seg in segments:
        print(f"  {seg.text}")

Supported providers: openai, groq, fireworks, whisper (local), deepgram

VoiceTagConfig

Configuration model (Pydantic v2, frozen/immutable).

config = VoiceTagConfig(
    hf_token="hf_...",          # HuggingFace token (or set HF_TOKEN env var)
    similarity_threshold=0.75,  # min cosine similarity for a match
    overlap_threshold=0.5,      # min overlap ratio to flag
    max_workers=4,              # parallel embedding threads
    min_segment_duration=0.5,   # discard segments shorter than this (seconds)
    device="cpu",               # "cpu", "cuda", or "mps"
)

Result Models

DiarizationResult -- returned by identify():

Field Type Description
segments list[SpeakerSegment | OverlapSegment] Ordered timeline of speaker segments
audio_duration float Total audio length in seconds
num_speakers int Number of distinct speakers detected
processing_time float Wall-clock pipeline time in seconds

SpeakerSegment:

Field Type Description
speaker str Identified speaker name or "UNKNOWN"
start float Start time in seconds
end float End time in seconds
confidence float Cosine similarity score (0.0-1.0)
duration float Property: end - start

OverlapSegment:

Field Type Description
speakers list[str] Names of overlapping speakers
start float Start time in seconds
end float End time in seconds
speaker Literal["OVERLAP"] Always "OVERLAP"
duration float Property: end - start

SpeakerProfile:

Field Type Description
name str Speaker name
embedding list[float] 256-dimensional mean embedding vector
num_samples int Number of audio files used for enrollment
created_at datetime UTC timestamp of enrollment

Error Handling

All exceptions inherit from VoiceTagError:

from voicetag import VoiceTagError

try:
    result = vt.identify("audio.wav")
except VoiceTagError as e:
    print(f"Error: {e}")
Exception When
VoiceTagConfigError Invalid config or missing HuggingFace token
EnrollmentError Enrollment fails (no audio, bad format)
DiarizationError Pyannote processing failure
AudioLoadError Audio file not found or unsupported format

Real-World Use Cases

  • Podcasts -- automatically label host vs. guest segments for transcription
  • Interviews -- separate interviewer and interviewee speech for analysis
  • Meeting recordings -- identify who said what in team meetings, generate per-speaker summaries
  • Court recordings -- tag judge, attorney, and witness speech segments
  • Call centers -- distinguish agent from customer in call recordings for QA
  • Media monitoring -- track specific speakers across broadcast recordings

How It Works

voicetag runs a three-stage pipeline:

Audio File
    |
    v
1. DIARIZE (pyannote.audio)
   "When does each speaker talk?"
   -> segments: [(0.0-4.2, SPEAKER_00), (4.5-8.1, SPEAKER_01), ...]
    |
    v
2. EMBED (resemblyzer)
   "What does each speaker sound like?"
   -> 256-dim embedding vector per segment (computed in parallel)
    |
    v
3. MATCH (cosine similarity)
   "Which enrolled speaker does this sound like?"
   -> Alice (0.92), Bob (0.87), UNKNOWN (below threshold)
    |
    v
DiarizationResult with named speaker timeline
  1. Diarize -- pyannote.audio segments the audio into speaker turns with anonymous labels (SPEAKER_00, SPEAKER_01, etc.)
  2. Embed -- resemblyzer computes a 256-dimensional voice embedding for each segment, running in parallel via a thread pool
  3. Match -- each embedding is compared against enrolled speaker profiles using cosine similarity. Matches above the threshold get assigned the speaker's name; others are labeled "UNKNOWN"

Overlap detection runs in parallel with matching, identifying regions where two or more speakers talk simultaneously.

Comparison

Feature voicetag pyannote alone WhisperX Manual labeling
Speaker diarization Yes Yes Yes N/A
Named speaker identification Yes No No Yes
Overlap detection Yes Yes No Varies
CLI tool Yes No Yes N/A
Save/load speaker profiles Yes N/A N/A N/A
Language agnostic Yes Yes Yes Yes
Typed result models Yes (Pydantic) No No N/A
Lines of code to identify 3 ~30 ~20 N/A

Configuration

VoiceTagConfig controls all tunable parameters:

Field Type Default Description
hf_token Optional[str] None HuggingFace token. Falls back to HF_TOKEN env var.
similarity_threshold float 0.75 Minimum cosine similarity for a match. Range: (0.0, 1.0).
overlap_threshold float 0.5 Minimum overlap ratio to flag as overlapping speech.
max_workers int 4 Thread count for parallel embedding computation.
min_segment_duration float 0.5 Segments shorter than this (seconds) are discarded.
device str "cpu" Torch device: "cpu", "cuda", or "mps".

Token resolution order:

  1. config.hf_token (explicit)
  2. HF_TOKEN environment variable
  3. Raise VoiceTagConfigError with a link to huggingface.co/settings/tokens

Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines on setting up the development environment, running tests, and submitting pull requests.

License

MIT -- Copyright (c) 2026 voicetag contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

voicetag-0.2.0.tar.gz (39.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

voicetag-0.2.0-py3-none-any.whl (32.6 kB view details)

Uploaded Python 3

File details

Details for the file voicetag-0.2.0.tar.gz.

File metadata

  • Download URL: voicetag-0.2.0.tar.gz
  • Upload date:
  • Size: 39.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for voicetag-0.2.0.tar.gz
Algorithm Hash digest
SHA256 60c31e8a1218d235f9a7503dca64dc6d0bddb32191a1d054ba20107887094ee0
MD5 c1622def839356dcc7ddebb09d788df3
BLAKE2b-256 da1587c1462ac546ec4d9b1335547d2d7882df7affd5072c7c67b05f9ea97b83

See more details on using hashes here.

File details

Details for the file voicetag-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: voicetag-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 32.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for voicetag-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4be77efb47935f1040d3aec8c253c268fd7ddd997ed3b65a0facaec0c6f1f591
MD5 6d0098ff75106aaff15e3b7f4c5d588e
BLAKE2b-256 c1b2651d7b137e722f56e4d8be8055f5566fb7d314efde8d5d0da38cf0c57bb6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page