Voice AI Platform - Local STT/TTS with Chinese Language Support
Project description
LocalKin Service Audio
Local Voice AI Platform - Speech-to-Text and Text-to-Speech with Chinese language support, voice cloning, and Claude integration via MCP.
What's New in v2.0.10
- 12 new models (29 → 40 total): Whisper large-v3-turbo, Parakeet v3, Canary v2/Qwen, CosyVoice2, Orpheus TTS, Qwen3-TTS, Dia
- Removed: ~2,500 lines of legacy v1.x code — fully migrated to v2.0
ModelRegistryandAudioEngine - Fixed: Build failure (#1), bare
except:clauses, version mismatch, HeartMuLa hardcoded path
See CHANGELOG.md for full history.
Features
- Multiple STT Engines: Whisper, faster-whisper, whisper.cpp, SenseVoice, Paraformer
- Multiple TTS Engines: Kokoro, CosyVoice, ChatTTS, F5-TTS, native OS
- Music Generation: HeartMuLa (multilingual, tag-based), MusicGen
- Chinese Language Support: Optimized models for Mandarin, Cantonese, and mixed Chinese-English
- Voice Cloning: Zero-shot voice cloning with F5-TTS and CosyVoice
- MCP Integration: Use with Claude Code and Claude Desktop
- WebSocket Streaming: Real-time transcription and synthesis
- REST API: FastAPI-based server with OpenAPI docs
Quick Start
# Install (uv recommended)
uv pip install localkin-service-audio
# Get model recommendations for your hardware
kin audio recommend
# View configuration
kin audio config
# Transcribe audio
kin audio transcribe audio.wav
# Text-to-speech
kin audio tts "Hello world"
# Generate music (with Chinese support!)
kin audio music generate "在月光下弹钢琴" # Chinese lyrics
kin audio music generate "happy wedding" --tags "piano,romantic,wedding" --model heartmula:3b
# Real-time listening (microphone)
kin audio listen
# Voice AI conversation
kin audio listen --llm ollama --tts --stream
# List available models
kin audio models
# Start API server
kin audio serve --port 8000
# Start web interface
kin web
Installation
Using uv (recommended — 10x faster)
This project has heavy ML dependencies (~4GB: PyTorch, Whisper, transformers). uv resolves and installs them 10-100x faster than pip.
# Install uv (one-time)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install
uv pip install localkin-service-audio
# Or from source
git clone https://github.com/LocalKinAI/localkin-service-audio.git
cd localkin-service-audio
uv sync
Using in a new terminal: The virtual environment needs to be activated each session:
# Option 1: Activate the venv
source .venv/bin/activate
kin audio models
# Option 2: Use uv run (no activation needed)
uv run kin audio models
To auto-activate, add to your ~/.zshrc or ~/.bashrc:
# Activate .venv automatically when entering a project directory
cd() { builtin cd "$@" && [ -f .venv/bin/activate ] && source .venv/bin/activate; }
Using pip
pip install localkin-service-audio
pip works but is significantly slower due to dependency resolution with large ML packages. Expect 10-30 minutes on first install.
Upgrading
# Upgrade to latest version
uv pip install --upgrade localkin-service-audio
# If upgrading from v2.0.3 or earlier, also upgrade torch (required for v2.0.4+)
uv pip install --upgrade torch torchaudio torchvision
Optional Dependencies
# Chinese language models
uv pip install localkin-service-audio[chinese]
# Voice cloning models
uv pip install localkin-service-audio[cloning]
# MCP server for Claude
uv pip install localkin-service-audio[mcp]
# All features
uv pip install localkin-service-audio[all-new]
Replace
uv pipwithpipif not using uv.
CLI Usage
Speech-to-Text
# Basic transcription (auto-selects best model)
kin audio transcribe audio.wav
# Specify model
kin audio transcribe audio.wav --model whisper-cpp:base
kin audio transcribe audio.wav --model faster-whisper:large-v3
kin audio transcribe audio.wav --model sensevoice:small # Chinese
# With language hint
kin audio transcribe audio.wav --language zh
# Output formats
kin audio transcribe audio.wav --format json
kin audio transcribe audio.wav --format srt --timestamps
Text-to-Speech
# Basic synthesis (uses Kokoro with af_heart voice)
kin audio tts "Hello world"
# List all available voices
kin audio tts "" --model kokoro --list-voices
# American English voices
kin audio tts "Hello world" --voice af_bella # Bella (Female)
kin audio tts "Hello world" --voice am_adam # Adam (Male)
kin audio tts "Hello world" --voice af_nova # Nova (Female)
# British English voices
kin audio tts "Good morning" --voice bf_emma # Emma (British Female)
kin audio tts "Good morning" --voice bm_george # George (British Male)
# Chinese (Mandarin) voices
kin audio tts "你好世界" --voice zf_xiaoxiao # Xiaoxiao (Chinese Female)
kin audio tts "今天天气真好" --voice zm_yunyang # Yunyang (Chinese Male)
# Japanese voices
kin audio tts "こんにちは" --voice jf_alpha # Alpha (Japanese Female)
kin audio tts "ありがとう" --voice jm_kumo # Kumo (Japanese Male)
# French, Spanish, Italian, Hindi, Portuguese
kin audio tts "Bonjour le monde" --voice ff_siwis # French
kin audio tts "Hola mundo" --voice ef_dora # Spanish
kin audio tts "Ciao mondo" --voice if_sara # Italian
kin audio tts "नमस्ते" --voice hf_alpha # Hindi
kin audio tts "Olá mundo" --voice pf_dora # Portuguese
# Adjust speech speed (0.5 = slow, 2.0 = fast)
kin audio tts "Hello world" --speed 0.8
kin audio tts "Hello world" --speed 1.5
# Save to file
kin audio tts "Hello world" --output speech.wav
# Save without auto-playing
kin audio tts "Hello world" --output speech.wav --no-play
# CosyVoice for Chinese (voice cloning capable)
kin audio tts "你好世界" --model cosyvoice:300m --voice 中文女
Music Generation
# MusicGen — text-to-music (small/medium/large)
kin audio music generate "calm piano melody"
kin audio music generate "upbeat electronic" --duration 20 --model musicgen:medium
kin audio music generate "ambient soundscape" -o ambient.wav --device mps
# HeartMuLa — multilingual with Chinese lyrics support
kin audio music generate "在月光下弹钢琴" --model heartmula:3b
kin audio music generate "happy wedding day" --tags "piano,romantic,wedding" --model heartmula:3b --duration 30
kin audio music generate "春天来了,鸟儿在唱歌" --tags "acoustic,happy,upbeat" -o spring.wav
# List music models and requirements
kin audio music models
kin audio music models --verbose
HeartMuLa style tags: piano, acoustic, electric, synthesizer, happy, sad, romantic, calm, upbeat, wedding, ambient, orchestral, rock, pop, jazz, folk, classical, cinematic
| Model | Sizes | VRAM | Languages | Duration |
|---|---|---|---|---|
| MusicGen | small (2GB), medium (4GB), large (16GB) | 2–16 GB | English | 5–30s |
| HeartMuLa | 3B (6GB), 7B (16GB) | 6–16 GB | en, zh, ja, ko, es | 5–240s |
HeartMuLa setup — auto-installs on first use, or pull in advance:
kin audio pull heartmula:3b
Real-time Listening
# Basic real-time transcription
kin audio listen
# With TTS echo
kin audio listen --tts --tts-model kokoro
# Voice AI with LLM (requires Ollama)
kin audio listen --llm ollama --tts --stream
# Custom models
kin audio listen --model sensevoice:small --language zh --tts --tts-model cosyvoice:300m
# Adjust silence detection
kin audio listen --silence-threshold 0.02 --silence-duration 2.0
Model Management
# List all models with availability status
kin audio models
# Filter by type, language, engine, or tag
kin audio models --type stt
kin audio models --type tts
kin audio models --language zh
kin audio models --engine kokoro
kin audio models --tag voice-cloning
kin audio models --search whisper
# Pull a model
kin audio pull whisper-cpp:base
kin audio pull heartmula:3b
# Remove a model
kin audio rm whisper-cpp:base
# Add a model from a template
kin audio add-model --template whisper_stt --name my-whisper
# Add a model from HuggingFace
kin audio add-model --repo openai/whisper-medium --name whisper-med --type stt
# List available model templates
kin audio list-templates
Model Recommendations
# Get hardware-aware model recommendations
kin audio recommend
# With detailed hardware info
kin audio recommend --verbose
The recommend command detects your hardware (GPU, RAM, CPU) and suggests optimal STT/TTS models for your system.
Configuration
# View configuration overview
kin audio config
# Show configuration file paths
kin audio config --path
# Show all registered models
kin audio config --models
# Initialize config directory with sample config
kin audio config --init
# Change settings
kin audio config set default_tts_model kokoro
kin audio config set default_stt_model faster-whisper:large-v3
kin audio config set api_port 9000
kin audio config set default_device cuda
Configuration files are stored in $LOCALKIN_HOME/ (default: ~/.localkin-service-audio/).
Set LOCALKIN_HOME to relocate all data (cache, config, models) to another disk:
export LOCALKIN_HOME="/path/to/large/disk/.localkin-service-audio"
System Status & Diagnostics
# Check system status (libraries, registry, cache)
kin audio status
# Show cache info
kin audio cache info
# Clear cache for a specific model
kin audio cache clear whisper-large
# Clear all cached models
kin audio cache clear
# Show running LocalKin Audio servers
kin audio ps
API Server
# Start REST API server
kin audio serve --port 8000
# Start web interface
kin web --port 5000
Supported Models
kin audio models shows all 40 models with real-time availability status:
- ✅ Ready — engine installed, usable now
- 📦 Not installed — strategy code exists, just needs
pip install - 🔮 Planned — future implementation
STT Models (24)
| Model | Engine | Languages | Features | Status |
|---|---|---|---|---|
whisper:tiny/base/small/medium/large-v3 |
OpenAI Whisper | Multilingual | Standard reference | Ready |
whisper:large-v3-turbo |
OpenAI Whisper | Multilingual | 6x faster than large-v3, 809M params | Ready |
faster-whisper:tiny/base/large-v3/turbo/distil-large-v3 |
CTranslate2 | Multilingual | 4x faster, GPU | Ready |
faster-whisper:large-v3-turbo |
CTranslate2 | Multilingual | CTranslate2 turbo variant | Ready |
whisper-cpp:tiny/base/small/medium |
whisper.cpp | Multilingual | Fast CPU inference | Ready |
moonshine:tiny/base |
Moonshine | English | 5x real-time, ~20MB | Install needed |
sensevoice:small |
FunASR (Alibaba) | zh, en, ja, ko | 15x faster, emotion detection | Install needed |
paraformer:zh |
FunASR (Alibaba) | Chinese | Fast Chinese ASR | Install needed |
parakeet:0.6b |
NVIDIA NeMo | 25 languages | 10x faster than Whisper turbo | Install needed |
parakeet:1.1b |
NVIDIA NeMo | English | >2000x real-time | Install needed |
canary:1b-v2 |
NVIDIA NeMo | 25 languages | Transcription + translation | Install needed |
canary-qwen:2.5b |
NVIDIA NeMo | English | #1 HuggingFace ASR leaderboard, STT + understanding | Install needed |
TTS Models (14)
| Model | Engine | Languages | Features | Status |
|---|---|---|---|---|
native |
pyttsx3 | System | No download needed | Ready |
kokoro / kokoro:82m |
Kokoro | en, es, fr, hi, it, ja, pt, zh | 54 voices, multilingual | Ready |
cosyvoice:300m |
CosyVoice (Alibaba) | zh, en, ja, ko, yue | Voice cloning, streaming | Install needed |
cosyvoice2:0.5b |
CosyVoice2 (Alibaba) | 9 langs + 18 Chinese dialects | 30-50% fewer errors than v1 | Install needed |
qwen3-tts:0.6b/1.7b |
Qwen3-TTS (Alibaba) | 10 langs (zh, en, ja, ko, de, fr...) | 97ms latency, 3s voice cloning, voice design | Install needed |
orpheus:150m/1b/3b |
Orpheus | English | Best emotional expressiveness, GGUF | Install needed |
dia:1.6b |
Dia | English | Multi-speaker dialogue, nonverbal sounds | Install needed |
chattts |
ChatTTS | zh, en | Conversational, emotion | Install needed |
f5-tts |
F5-TTS | en, zh | Zero-shot voice cloning | Install needed |
gpt-sovits |
GPT-SoVITS | zh, en, ja | Voice cloning with 5s audio | Planned |
parler-tts |
Parler | English | Text-described voice | Planned |
Music Models (2)
| Model | Engine | Languages | Features | Status |
|---|---|---|---|---|
musicgen:small/medium/large |
MusicGen (Meta) | English | Text-to-music, 5–30s | Install needed |
heartmula:3b/7b |
HeartMuLa | en, zh, ja, ko, es | Chinese lyrics, tag control, up to 240s | Install needed |
Kokoro Voice Reference
Kokoro supports 54 voices across 9 languages. Voice IDs follow the pattern {lang}{gender}_{name}:
| Prefix | Language | Example Voices |
|---|---|---|
af_ |
American English (Female) | af_heart, af_bella, af_nova, af_sarah, af_sky |
am_ |
American English (Male) | am_adam, am_michael, am_echo, am_puck |
bf_ |
British English (Female) | bf_emma, bf_alice, bf_lily, bf_isabella |
bm_ |
British English (Male) | bm_george, bm_lewis, bm_daniel, bm_fable |
zf_ |
Chinese Mandarin (Female) | zf_xiaoxiao, zf_xiaobei, zf_xiaoni, zf_xiaoyi |
zm_ |
Chinese Mandarin (Male) | zm_yunyang, zm_yunxi, zm_yunjian, zm_yunxia |
jf_ |
Japanese (Female) | jf_alpha, jf_nezumi, jf_gongitsune, jf_tebukuro |
jm_ |
Japanese (Male) | jm_kumo |
ff_ |
French (Female) | ff_siwis |
ef_ |
Spanish (Female) | ef_dora |
em_ |
Spanish (Male) | em_alex |
hf_ |
Hindi (Female) | hf_alpha, hf_beta |
hm_ |
Hindi (Male) | hm_omega, hm_psi |
if_ |
Italian (Female) | if_sara |
im_ |
Italian (Male) | im_nicola |
pf_ |
Portuguese (Female) | pf_dora |
pm_ |
Portuguese (Male) | pm_alex |
Python API
from localkin_service_audio import AudioEngine, transcribe, synthesize
# Quick functions
result = transcribe("audio.wav", model="whisper-cpp:base")
print(result.text)
audio = synthesize("Hello world", model="kokoro")
audio.save("output.wav")
# Full engine control
engine = AudioEngine()
# Load and use STT
engine.load_stt("whisper-cpp:base")
result = engine.transcribe("audio.wav", language="en")
print(f"Text: {result.text}")
print(f"Language: {result.language}")
# Load and use TTS - English
engine.load_tts("kokoro")
audio = engine.synthesize("Hello world", voice="af_heart")
audio.save("english.wav")
# TTS - Chinese (auto-selects Chinese pipeline)
audio = engine.synthesize("你好世界", voice="zf_xiaoxiao")
audio.save("chinese.wav")
# TTS - Japanese
audio = engine.synthesize("こんにちは世界", voice="jf_alpha")
audio.save("japanese.wav")
# TTS - with speed control
audio = engine.synthesize("Hello", voice="am_adam", speed=0.8)
audio.save("slow.wav")
# List available voices
voices = engine.list_voices()
for v in voices:
print(f"{v.id}: {v.name} ({v.language}, {v.gender})")
# Voice cloning (with supported models)
engine.load_tts("f5-tts")
audio = engine.clone_voice(
reference_audio="reference.wav",
text="Text to speak in cloned voice"
)
MCP Integration
Use LocalKin Audio with Claude Code or Claude Desktop:
# Start MCP server
kin mcp
Add to Claude Desktop config (~/.config/claude/claude_desktop_config.json):
{
"mcpServers": {
"localkin-audio": {
"command": "kin",
"args": ["mcp"]
}
}
}
Available MCP tools:
transcribe_audio- Transcribe audio filessynthesize_speech- Generate speech from textclone_voice- Clone voice from reference audiolist_models- List available modelslist_voices- List available voices
REST API
Start the server:
kin audio serve --port 8000
Endpoints
POST /transcribe - Transcribe audio
curl -X POST "http://localhost:8000/transcribe" \
-F "file=@audio.wav" \
-F "model=whisper-cpp:base" \
-F "language=en"
POST /synthesize - Synthesize speech
curl -X POST "http://localhost:8000/synthesize" \
-H "Content-Type: application/json" \
-d '{"text": "Hello world", "model": "kokoro", "voice": "af_bella"}' \
--output speech.wav
GET /models - List models
curl "http://localhost:8000/models"
WebSocket /stream - Real-time transcription
const ws = new WebSocket("ws://localhost:8000/stream");
ws.send(audioChunk);
ws.onmessage = (e) => console.log(JSON.parse(e.data).text);
Configuration
Environment Variables
# Base directory for all data (cache, config, models)
export LOCALKIN_HOME="/Volumes/Data/.localkin-service-audio"
# Override individual directories
export LOCALKIN_CACHE_DIR="/tmp/my-cache"
export LOCALKIN_CONFIG_DIR="/path/to/config"
export LOCALKIN_MODELS_DIR="/path/to/models"
# Default engine settings
export LOCALKIN_DEFAULT_STT="faster-whisper:large-v3"
export LOCALKIN_DEFAULT_TTS="kokoro"
export LOCALKIN_DEVICE=cuda # or cpu, mps, auto
# API server
export LOCALKIN_API_HOST="127.0.0.1"
export LOCALKIN_API_PORT="8000"
Custom Models
Create $LOCALKIN_HOME/models.json (default: ~/.localkin-service-audio/models.json):
{
"models": {
"my-custom-model": {
"type": "stt",
"engine": "whisper",
"model_size": "base",
"languages": ["en", "zh"],
"description": "My custom model"
}
}
}
Architecture
LocalKin Audio v2.0 uses a modular architecture:
- Strategy Pattern: Pluggable STT/TTS engines
- Facade Pattern: AudioEngine provides unified interface
- Registry Pattern: Centralized model configuration
- Singleton Pattern: Shared engine instance
localkin_service_audio/
├── core/
│ ├── audio_processing/
│ │ ├── engine.py # AudioEngine facade
│ │ ├── stt/ # STT strategies
│ │ │ ├── base.py
│ │ │ ├── whisper_strategy.py
│ │ │ ├── sensevoice_strategy.py
│ │ │ └── ...
│ │ └── tts/ # TTS strategies
│ │ ├── base.py
│ │ ├── kokoro_strategy.py
│ │ ├── cosyvoice_strategy.py
│ │ └── ...
│ ├── config/
│ │ └── model_registry.py # Model registry
│ └── types.py # Core dataclasses
├── cli/ # Click CLI
├── api/ # FastAPI server
├── mcp/ # MCP server
└── ui/ # Web interface
Development
# Clone repository
git clone https://github.com/LocalKinAI/localkin-service-audio.git
cd localkin-service-audio
# Install with dev dependencies (uv recommended)
uv venv && source .venv/bin/activate
uv pip install -e ".[dev]"
# Or with pip
pip install -e ".[dev]"
# Run tests
pytest tests/
# Run linting
ruff check .
black --check .
Tip: With
uv, you can skip activation and run commands directly:uv run kin audio models uv run pytest tests/
Troubleshooting
Model Loading Errors
# Check model is registered
kin audio models
# Pull the model
kin audio pull whisper-cpp:base
# Check system info
kin info --verbose
PyTorch Version
Requires torch >= 2.6.0. Older versions will fail to load models that only ship .bin weights (e.g. MusicGen medium/large) due to a torch.load security check (CVE-2025-32434).
# Check your version
python -c "import torch; print(torch.__version__)"
# Upgrade if needed (keep torchvision in sync)
pip install "torch>=2.6.0" "torchaudio>=2.6.0" "torchvision>=0.21"
numpy/pandas Binary Incompatibility
If you see numpy.dtype size changed, may indicate binary incompatibility, pandas or scikit-learn was compiled against a different numpy version:
# Fix: force-reinstall the affected packages
uv pip install --force-reinstall numpy pandas scikit-learn
# Or nuke and rebuild the venv
rm -rf .venv && uv venv && uv pip install localkin-service-audio
CUDA/GPU Issues
# Force CPU
kin audio transcribe audio.wav --device cpu
# Check PyTorch CUDA
python -c "import torch; print(torch.cuda.is_available())"
HeartMuLa on Apple Silicon (MPS)
HeartMuLa 3B requires ~12-14GB. On a 16GB Mac, close memory-heavy apps before running. The codec runs on CPU automatically (shared unified memory, no performance impact).
# If you hit OOM, try shorter duration
kin audio music generate "prompt" --model heartmula:3b --duration 5
# Or force CPU (slower but more stable memory management)
kin audio music generate "prompt" --model heartmula:3b --device cpu
Chinese Model Dependencies
# Install FunASR for Chinese models
pip install funasr modelscope
# Then use Chinese models
kin audio transcribe audio.wav --model sensevoice:small
License
MIT License - see LICENSE file.
Acknowledgments
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file localkin_service_audio-2.0.10.tar.gz.
File metadata
- Download URL: localkin_service_audio-2.0.10.tar.gz
- Upload date:
- Size: 2.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0a46036d8a4b552e6855259d1cdd5aa384c6fbb40bcbab29b83a87f6cfa3cd25
|
|
| MD5 |
92c391c26f509b7d1bfb0c9b35274de9
|
|
| BLAKE2b-256 |
4aab0ec00eca2e08add0bae51e1a0f9df1cf6a396ad11217b5382b900660b227
|
File details
Details for the file localkin_service_audio-2.0.10-py3-none-any.whl.
File metadata
- Download URL: localkin_service_audio-2.0.10-py3-none-any.whl
- Upload date:
- Size: 2.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a420dc6342d131b584aee4d348534a2735b2ccf6351ac55b0f4f3e49a3f2c718
|
|
| MD5 |
4b9e43ae59b0259b084d9c620b383983
|
|
| BLAKE2b-256 |
ac4d8d94c6d9c9ef4efcffe82988eab3608ec172e912e45fecbb0c20a010768c
|