voicetag

Speaker identification powered by pyannote and resemblyzer

These details have not been verified by PyPI

Project links

Project description

voicetag

Know who said what. Automatically.

What is voicetag?

voicetag is a Python library for speaker diarization and named speaker identification. It combines pyannote.audio for diarization with resemblyzer for speaker embeddings, giving you a single interface to answer: who is speaking, and when?

Enroll speakers once with a few audio samples, then identify them in any recording -- meetings, podcasts, interviews, phone calls.

Features

:zap: Dead-simple API -- enroll speakers and identify them in three lines of code
:globe_with_meridians: Language agnostic -- works with Hebrew, English, Mandarin, or any spoken language
:busts_in_silhouette: Built-in overlap detection -- flags regions where multiple speakers talk simultaneously
:rocket: Fast parallel processing -- concurrent embedding computation with configurable thread pools
:keyboard: CLI tool included -- enroll, identify, and manage profiles from the terminal
:floppy_disk: Save/load speaker profiles -- persist enrolled speakers to disk and reuse across sessions
:white_check_mark: Pydantic result models -- fully typed, validated, immutable result objects
:speech_balloon: Built-in transcription -- plug in OpenAI, Groq, Fireworks, Whisper, or Deepgram to get "who said what"

Quick Start

from voicetag import VoiceTag

vt = VoiceTag()
vt.enroll("Alice", ["alice1.wav", "alice2.wav"])
vt.enroll("Bob", ["bob1.wav"])

result = vt.identify("meeting.wav")
for segment in result.segments:
    print(f"{segment.speaker}: {segment.start:.1f}s - {segment.end:.1f}s")

Output:

Alice: 0.0s - 4.2s
Bob: 4.5s - 8.1s
Alice: 8.3s - 12.7s
UNKNOWN: 13.0s - 15.4s

Installation

pip install voicetag

For transcription support, install with a provider:

pip install voicetag[openai]    # OpenAI Whisper API
pip install voicetag[groq]      # Groq (fast Whisper)
pip install voicetag[whisper]   # Local Whisper (no API key needed)
pip install voicetag[deepgram]  # Deepgram
pip install voicetag[all-stt]   # All providers

voicetag requires access to the pyannote.audio speaker diarization model, which is gated behind a HuggingFace license agreement.

Prerequisites

Accept the pyannote model licenses at:
Create a HuggingFace token at huggingface.co/settings/tokens
Set the token via environment variable or config:

export HF_TOKEN="hf_your_token_here"

Or pass it directly:

from voicetag import VoiceTag, VoiceTagConfig

vt = VoiceTag(config=VoiceTagConfig(hf_token="hf_your_token_here"))

GPU Acceleration (optional)

For faster processing on CUDA or Apple Silicon:

vt = VoiceTag(config=VoiceTagConfig(device="cuda"))  # NVIDIA GPU
vt = VoiceTag(config=VoiceTagConfig(device="mps"))    # Apple Silicon

CLI Usage

voicetag ships with a full-featured command-line interface.

Enroll a speaker

voicetag enroll "Alice" alice_sample1.wav alice_sample2.wav

Enrolled speaker Alice from 2 sample(s).
Profiles saved to voicetag_profiles.json

Identify speakers in a recording

voicetag identify meeting.wav --threshold 0.8

Speaker Timeline -- meeting.wav
+---------+----------+----------+----------+------------+
| Speaker | Start    | End      | Duration | Confidence |
+---------+----------+----------+----------+------------+
| Alice   | 00:00.00 | 00:04.20 | 00:04.20 | 0.92       |
| Bob     | 00:04.50 | 00:08.10 | 00:03.60 | 0.87       |
| Alice   | 00:08.30 | 00:12.70 | 00:04.40 | 0.91       |
| UNKNOWN | 00:13.00 | 00:15.40 | 00:02.40 | --         |
+---------+----------+----------+----------+------------+

Summary
  Total duration:  00:15.400
  Speakers:        2
  Segments:        4
  Overlaps:        0

Save results to JSON:

voicetag identify meeting.wav --output results.json

Manage profiles

voicetag profiles list
voicetag profiles remove "Alice"

Transcribe (speaker + text)

voicetag transcribe meeting.wav --provider openai --language en
voicetag transcribe meeting.wav --provider groq --language he
voicetag transcribe meeting.wav --provider whisper --model base

Transcript -- meeting.wav
+---------+----------+----------+------------------------------------+
| Speaker | Start    | End      | Text                               |
+---------+----------+----------+------------------------------------+
| Alice   | 00:00.00 | 00:04.20 | Let's start the meeting            |
| Bob     | 00:04.50 | 00:08.10 | Sure, I have the agenda ready      |
| Alice   | 00:08.30 | 00:12.70 | Great, let's go through it         |
+---------+----------+----------+------------------------------------+

List available providers:

voicetag providers

All CLI options

voicetag --help
voicetag identify --help

Option	Description
`--profiles PATH`	Path to speaker profiles file (default: `voicetag_profiles.json`)
`--output, -o PATH`	Save results as JSON
`--threshold FLOAT`	Similarity threshold override (0.0-1.0)
`--hf-token TEXT`	HuggingFace API token
`--device TEXT`	Torch device: `cpu`, `cuda`, `mps`
`--unknown-only`	Skip speaker matching, just diarize

API Reference

`VoiceTag`

The main entry point. Wraps the full diarization + identification pipeline.

from voicetag import VoiceTag, VoiceTagConfig

vt = VoiceTag(config=VoiceTagConfig(...))

Method	Returns	Description
`enroll(name, audio_paths)`	`SpeakerProfile`	Register a speaker from one or more audio files
`identify(audio_path)`	`DiarizationResult`	Run full identification pipeline on an audio file
`save(path)`	`None`	Save enrolled speaker profiles to disk
`load(path)`	`None`	Load speaker profiles from disk
`remove_speaker(name)`	`None`	Remove an enrolled speaker by name
`enrolled_speakers`	`list[str]`	Property: list of enrolled speaker names
`transcribe(audio_path, provider, ...)`	`TranscriptResult`	Identify speakers and transcribe what they said

Transcription example

result = vt.transcribe("meeting.wav", provider="openai", language="en")

for seg in result.segments:
    print(f"[{seg.speaker}] {seg.text}")

# Full transcript
print(result.full_transcript)

# Group by speaker
for speaker, segments in result.by_speaker.items():
    print(f"\n{speaker}:")
    for seg in segments:
        print(f"  {seg.text}")

Supported providers: openai, groq, fireworks, whisper (local), deepgram

`VoiceTagConfig`

Configuration model (Pydantic v2, frozen/immutable).

config = VoiceTagConfig(
    hf_token="hf_...",          # HuggingFace token (or set HF_TOKEN env var)
    similarity_threshold=0.75,  # min cosine similarity for a match
    overlap_threshold=0.5,      # min overlap ratio to flag
    max_workers=4,              # parallel embedding threads
    min_segment_duration=0.5,   # discard segments shorter than this (seconds)
    device="cpu",               # "cpu", "cuda", or "mps"
)

Result Models

DiarizationResult -- returned by identify():

Field	Type	Description
`segments`	`list[SpeakerSegment \| OverlapSegment]`	Ordered timeline of speaker segments
`audio_duration`	`float`	Total audio length in seconds
`num_speakers`	`int`	Number of distinct speakers detected
`processing_time`	`float`	Wall-clock pipeline time in seconds

SpeakerSegment:

Field	Type	Description
`speaker`	`str`	Identified speaker name or `"UNKNOWN"`
`start`	`float`	Start time in seconds
`end`	`float`	End time in seconds
`confidence`	`float`	Cosine similarity score (0.0-1.0)
`duration`	`float`	Property: `end - start`

OverlapSegment:

Field	Type	Description
`speakers`	`list[str]`	Names of overlapping speakers
`start`	`float`	Start time in seconds
`end`	`float`	End time in seconds
`speaker`	`Literal["OVERLAP"]`	Always `"OVERLAP"`
`duration`	`float`	Property: `end - start`

SpeakerProfile:

Field	Type	Description
`name`	`str`	Speaker name
`embedding`	`list[float]`	256-dimensional mean embedding vector
`num_samples`	`int`	Number of audio files used for enrollment
`created_at`	`datetime`	UTC timestamp of enrollment

Error Handling

All exceptions inherit from VoiceTagError:

from voicetag import VoiceTagError

try:
    result = vt.identify("audio.wav")
except VoiceTagError as e:
    print(f"Error: {e}")

Exception	When
`VoiceTagConfigError`	Invalid config or missing HuggingFace token
`EnrollmentError`	Enrollment fails (no audio, bad format)
`DiarizationError`	Pyannote processing failure
`AudioLoadError`	Audio file not found or unsupported format

Real-World Use Cases

Podcasts -- automatically label host vs. guest segments for transcription
Interviews -- separate interviewer and interviewee speech for analysis
Meeting recordings -- identify who said what in team meetings, generate per-speaker summaries
Court recordings -- tag judge, attorney, and witness speech segments
Call centers -- distinguish agent from customer in call recordings for QA
Media monitoring -- track specific speakers across broadcast recordings

How It Works

voicetag runs a three-stage pipeline:

Audio File
    |
    v
1. DIARIZE (pyannote.audio)
   "When does each speaker talk?"
   -> segments: [(0.0-4.2, SPEAKER_00), (4.5-8.1, SPEAKER_01), ...]
    |
    v
2. EMBED (resemblyzer)
   "What does each speaker sound like?"
   -> 256-dim embedding vector per segment (computed in parallel)
    |
    v
3. MATCH (cosine similarity)
   "Which enrolled speaker does this sound like?"
   -> Alice (0.92), Bob (0.87), UNKNOWN (below threshold)
    |
    v
DiarizationResult with named speaker timeline

Diarize -- pyannote.audio segments the audio into speaker turns with anonymous labels (SPEAKER_00, SPEAKER_01, etc.)
Embed -- resemblyzer computes a 256-dimensional voice embedding for each segment, running in parallel via a thread pool
Match -- each embedding is compared against enrolled speaker profiles using cosine similarity. Matches above the threshold get assigned the speaker's name; others are labeled "UNKNOWN"

Overlap detection runs in parallel with matching, identifying regions where two or more speakers talk simultaneously.

Comparison

Feature	voicetag	pyannote alone	WhisperX	Manual labeling
Speaker diarization	Yes	Yes	Yes	N/A
Named speaker identification	Yes	No	No	Yes
Overlap detection	Yes	Yes	No	Varies
CLI tool	Yes	No	Yes	N/A
Save/load speaker profiles	Yes	N/A	N/A	N/A
Language agnostic	Yes	Yes	Yes	Yes
Typed result models	Yes (Pydantic)	No	No	N/A
Lines of code to identify	3	~30	~20	N/A

Configuration

VoiceTagConfig controls all tunable parameters:

Field	Type	Default	Description
`hf_token`	`Optional[str]`	`None`	HuggingFace token. Falls back to `HF_TOKEN` env var.
`similarity_threshold`	`float`	`0.75`	Minimum cosine similarity for a match. Range: (0.0, 1.0).
`overlap_threshold`	`float`	`0.5`	Minimum overlap ratio to flag as overlapping speech.
`max_workers`	`int`	`4`	Thread count for parallel embedding computation.
`min_segment_duration`	`float`	`0.5`	Segments shorter than this (seconds) are discarded.
`device`	`str`	`"cpu"`	Torch device: `"cpu"`, `"cuda"`, or `"mps"`.

Token resolution order:

config.hf_token (explicit)
HF_TOKEN environment variable
Raise VoiceTagConfigError with a link to huggingface.co/settings/tokens

Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines on setting up the development environment, running tests, and submitting pull requests.

License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Mar 16, 2026

0.1.1

Mar 16, 2026

0.1.0

Mar 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

voicetag-0.2.0.tar.gz (39.3 kB view details)

Uploaded Mar 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

voicetag-0.2.0-py3-none-any.whl (32.6 kB view details)

Uploaded Mar 16, 2026 Python 3

File details

Details for the file voicetag-0.2.0.tar.gz.

File metadata

Download URL: voicetag-0.2.0.tar.gz
Upload date: Mar 16, 2026
Size: 39.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for voicetag-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`60c31e8a1218d235f9a7503dca64dc6d0bddb32191a1d054ba20107887094ee0`
MD5	`c1622def839356dcc7ddebb09d788df3`
BLAKE2b-256	`da1587c1462ac546ec4d9b1335547d2d7882df7affd5072c7c67b05f9ea97b83`

See more details on using hashes here.

File details

Details for the file voicetag-0.2.0-py3-none-any.whl.

File metadata

Download URL: voicetag-0.2.0-py3-none-any.whl
Upload date: Mar 16, 2026
Size: 32.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for voicetag-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4be77efb47935f1040d3aec8c253c268fd7ddd997ed3b65a0facaec0c6f1f591`
MD5	`6d0098ff75106aaff15e3b7f4c5d588e`
BLAKE2b-256	`c1b2651d7b137e722f56e4d8be8055f5566fb7d314efde8d5d0da38cf0c57bb6`

See more details on using hashes here.

voicetag 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

voicetag

What is voicetag?

Features

Quick Start

Installation

Prerequisites

GPU Acceleration (optional)

CLI Usage

Enroll a speaker

Identify speakers in a recording

Manage profiles

Transcribe (speaker + text)

All CLI options

API Reference

VoiceTag

Transcription example

VoiceTagConfig

Result Models

Error Handling

Real-World Use Cases

How It Works

Comparison

Configuration

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`VoiceTag`

`VoiceTagConfig`