Skip to main content

Voice Identity Management Platform

Project description

VoxID

Release Python License Issues PRs

Voice Identity Management Platform — a local-first Python library, CLI, and REST API for managing persistent voice identities across multiple TTS engines.

VoxID sits between your application and TTS engines. It introduces voice identities — named entities that own multiple voice styles, each backed by precomputed speaker embeddings, versioned on disk, and automatically selected based on text content.

Features

  • Multi-style voice identities — named entities with multiple registers (conversational, technical, narration, emphatic), persisted as TOML + SafeTensors
  • Three-tier style routing — rule-based (~0ms) → semantic MLP classifier (~10ms) → centroid fallback (~15ms) with SQLite LRU cache
  • Engine-agnostic generation — single API across Qwen3-TTS, Fish Speech, CosyVoice2, IndexTTS-2, and Chatterbox
  • Segment-level routing — long-form text is split at prosodic boundaries, each segment routed independently with smoothing to prevent style thrashing
  • Context-aware generation — rolling-window context tracking for prosodic continuity across long documents with SSML conditioning and adaptive pause durations
  • Unified tokenizer — engine-agnostic speaker representation combining acoustic (WavTokenizer) and semantic (HuBERT) tokens with linear projection to engine-specific embeddings
  • Synthesis detection — anti-spoofing ensemble (AASIST + RawNet2 + LCNN) with diffusion artifact analysis for deepfake detection
  • Cross-lingual identity — voice generation across 10+ languages while maintaining speaker identity consistency
  • Multi-GPU serving — async GPU dispatcher with round-robin and least-loaded strategies, per-worker queue management, and vLLM plugin integration
  • Portable .voxid archives — HMAC-signed archives with consent records for identity transfer and backup
  • AudioSeal watermarking — provenance tracking embedded in generated audio (optional, requires audioseal)
  • Scripted voice enrollment — guided recording with phonetically balanced prompts, real-time quality feedback, adaptive phoneme coverage tracking, and multi-sample fusion
  • Web enrollment UI — browser-based enrollment with real-time waveform visualization, quality meters, and session persistence
  • Voice drift detection — cosine similarity monitoring against enrollment baseline with re-enrollment recommendations
  • Re-enrollment health checks — age-based and drift-based triggers for enrollment refresh
  • Video pipeline integration — SceneManifest contract for Manim and Remotion with word-level timing
  • Prompt-as-cache architecture — engine-specific prompts are a derived cache; switching engines rebuilds the cache, not the enrollment

Supported Engines

Engine Slug Streaming Emotion Control Languages
Qwen3-TTS qwen3-tts 10 (en, zh, ja, ko, de, fr, ru, pt, es, it)
Fish Speech fish-speech Yes 10 (en, zh, ja, ko, es, pt, ar, ru, fr, de)
CosyVoice2 cosyvoice2 Yes 9 (en, zh, ja, ko, de, fr, ru, pt, es)
IndexTTS-2 indextts2 Yes Yes (disentangled) 2 (en, zh)
Chatterbox chatterbox Yes Paralinguistic tags 22
Stub stub Yes 3 (en, zh, ja) — sine wave, no model needed

Engines are optional dependencies. Install only what you need:

uv add voxid[qwen3-tts]       # CUDA/MPS
uv add voxid[qwen3-tts-mlx]   # Apple Silicon via mlx-audio

Installation

Requires Python 3.12+.

# Core library (includes stub adapter for testing)
uv add voxid

# With Qwen3-TTS on Apple Silicon
uv add voxid[qwen3-tts-mlx]

# Development
git clone https://github.com/Mathews-Tom/VoxID.git
cd VoxID
uv sync --all-extras --group dev

Quickstart

Python Library

from voxid import VoxID

vox = VoxID()

# Create an identity
vox.create_identity(id="alice", name="Alice")

# Add a voice style with reference audio
vox.add_style(
    identity_id="alice",
    id="conversational",
    label="Conversational",
    description="Warm, relaxed, natural pacing",
    ref_audio="samples/alice_casual.wav",
    ref_text="This is how I normally speak in conversation.",
)

# Or enroll with guided prompts (creates session + generates prompts)
session = vox.enroll("alice", ["conversational", "technical"])

# Generate — style is auto-routed from text content
audio_path, sr = vox.generate(
    text="Let me walk you through how this works.",
    identity_id="alice",
)

# Dry-run routing
decision = vox.route(text="The p99 latency increased after the migration.", identity_id="alice")
# {'style': 'technical', 'confidence': 0.92, 'tier': 'rule-based', 'scores': {...}}

CLI

# Create identity and add a style
voxid identity create alice --name "Alice"
voxid style add alice conversational \
    --audio samples/alice_casual.wav \
    --transcript "This is how I normally speak." \
    --description "Warm, relaxed, natural pacing"

# Enroll with guided recording (interactive)
voxid enroll alice --styles conversational,technical

# Enroll from pre-recorded audio (non-interactive)
voxid enroll alice --styles conversational --import-audio ./recordings/

# Generate audio
voxid generate "Hello, welcome to the demo." --identity alice

# Generate with explicit style
voxid generate "The API returns a 429 status code." --identity alice --style technical

# Long-form segment generation
voxid generate --file script.txt --identity alice --segments

# Generate from a scene manifest
voxid generate --manifest scenes.json --identity alice

# Check routing decision without generating
voxid route "Breaking news from the lab." --identity alice

# Export/import identities
voxid export alice alice_backup.voxid --key my-signing-key
voxid import alice_backup.voxid --key my-signing-key

# Start the REST API server
voxid serve --port 8765

# Start with multi-GPU dispatch
voxid serve --port 8765 --config serving.toml

# Enroll with cross-lingual support
voxid enroll alice --styles conversational --language zh

REST API

# Start the server
voxid serve

# Create identity
curl -X POST http://localhost:8765/api/identities \
  -H "Content-Type: application/json" \
  -d '{"id": "alice", "name": "Alice"}'

# Generate audio
curl -X POST http://localhost:8765/api/generate \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world.", "identity_id": "alice"}'

# Route without generating
curl -X POST http://localhost:8765/api/route \
  -H "Content-Type: application/json" \
  -d '{"text": "The gradient exploded during training.", "identity_id": "alice"}'

# Create enrollment session
curl -X POST http://localhost:8765/api/enroll/sessions \
  -H "Content-Type: application/json" \
  -d '{"identity_id": "alice", "styles": ["conversational"], "prompts_per_style": 5}'

# Upload audio sample
curl -X POST http://localhost:8765/api/enroll/sessions/{id}/samples \
  -F "file=@recording.wav"

# Multi-GPU serving health
curl http://localhost:8765/api/v1/serving/health

Set VOXID_API_KEY to enable API key authentication. Set VOXID_RATE_LIMIT and VOXID_RATE_WINDOW to configure rate limiting on generation endpoints.

Docker

docker build -t voxid .
docker run -p 8765:8765 -v ~/.voxid:/data/voxid voxid

Architecture

┌──────────────────────────────────────────────────────────────────┐
│                        Consumer Layer                            │
│   Python Library  │  REST API  │  CLI  │  Web UI  │  VoiceBox   │
└────────┬──────────┴─────┬──────┴───┬───┴─────┬────┴──────┬──────┘
         │                │          │         │           │
┌────────▼────────────────▼──────────▼─────────▼───────────▼──────┐
│                         VoxID Core                              │
│  ┌──────────────┐ ┌─────────────┐ ┌──────────────────────────┐  │
│  │  Identity    │ │   Style     │ │   Generation Dispatcher  │  │
│  │  Registry    │ │   Router    │ │   + Context Conditioner  │  │
│  └──────┬───────┘ └──────┬──────┘ └────────┬─────────────────┘  │
│         │           3-tier│                 │                    │
│  ┌──────▼──────────┐  ┌──▼──────────┐  ┌───▼─────────────────┐  │
│  │   Enrollment    │  │  Unified    │  │  Voice Prompt Store │  │
│  │   Pipeline      │  │  Tokenizer  │  │  (TOML+SafeTensors) │  │
│  └──────┬──────────┘  └─────────────┘  └───────────┬─────────┘  │
│         │                                          │            │
│  ┌──────▼──────────────────────────────────────────▼─────────┐  │
│  │   Security: Spoofing Detection │ Consent │ Drift │ Seal   │  │
│  └───────────────────────────────────────────────────────────┘  │
└──────────────────────────┬──────────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────────┐
│                  GPU Dispatcher / Engine Adapters                │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  Multi-GPU Serving (vLLM): round-robin / least-loaded   │   │
│  └────┬────────────┬────────────┬────────────┬──────────────┘   │
│  Qwen3-TTS │ Fish Speech │ CosyVoice2 │ IndexTTS-2 │ Chatterbox│
└─────────────────────────────────────────────────────────────────┘

Storage layout:

~/.voxid/
├── config.toml
├── serving.toml                           # multi-GPU dispatch config (optional)
├── identities/
│   └── alice/
│       ├── identity.toml
│       ├── consent.json
│       ├── consent_audio.wav              # recorded consent (enrollment)
│       └── styles/
│           └── conversational/
│               ├── style.toml
│               ├── ref_audio.wav          # source of truth
│               ├── ref_text.txt           # source of truth
│               ├── tokenized.safetensors  # unified speaker tokens (optional)
│               └── prompts/               # derived cache
│                   ├── qwen3-tts.safetensors
│                   └── fish-speech.safetensors
├── enrollment_sessions/                   # resumable enrollment state
│   └── {session_id}.json
├── projections/                           # engine projector weights
│   └── {engine}.safetensors
├── cache/
│   └── router/
│       └── router_cache.db
└── output/

Configuration

VoxID reads ~/.voxid/config.toml:

store_path = "~/.voxid"
default_engine = "qwen3-tts"
router_confidence_threshold = 0.8
cache_ttl_seconds = 3600
max_embedding_versions = 3

Environment Variables

Variable Description Default
VOXID_API_KEY API key for REST authentication (unset = open access)
VOXID_RATE_LIMIT Max requests per window on /generate* endpoints 60
VOXID_RATE_WINDOW Rate limit window in seconds 60
VOXID_STORE_PATH Override store path (used by Docker)

Documentation

Document Description
Usage Guide CLI, Python library, REST API, segments, manifests, video integration
Developer Guide Setup, project structure, testing, writing adapters, contributing
System Design Architecture, data model, router algorithms, security
Overview Product overview, market analysis, technology landscape

License

Apache-2.0 — see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

voxid-0.3.0.tar.gz (3.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

voxid-0.3.0-py3-none-any.whl (255.3 kB view details)

Uploaded Python 3

File details

Details for the file voxid-0.3.0.tar.gz.

File metadata

  • Download URL: voxid-0.3.0.tar.gz
  • Upload date:
  • Size: 3.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.14

File hashes

Hashes for voxid-0.3.0.tar.gz
Algorithm Hash digest
SHA256 5275ff94e42d3730d12da90b10c97e062d611ab54035e1858f3787974c45bf81
MD5 215890cc37aab5cdf94291c2058bf916
BLAKE2b-256 6985c1e3ed63644ef375efc17527f865d2cdd3468789eb648d872bda2d70007f

See more details on using hashes here.

File details

Details for the file voxid-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: voxid-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 255.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.14

File hashes

Hashes for voxid-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d0f26f2b377248cf8ef3465f38fc0b87b25278dc767ac9427d86aed001a3f709
MD5 1082612b8328e012f68ceaecca520b0f
BLAKE2b-256 86c17a644bbfcd0b701abf7f1bb3d126570135d531096150157415fdcb3df92c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page