Voice AI Platform - Local STT/TTS with Chinese Language Support
Reason this release was yanked:
not working: use 1.1.3
Project description
LocalKin Service Audio
Local Voice AI Platform - Speech-to-Text and Text-to-Speech with Chinese language support, voice cloning, and Claude integration via MCP.
Features
- Multiple STT Engines: Whisper, faster-whisper, whisper.cpp, SenseVoice, Paraformer
- Multiple TTS Engines: Kokoro, CosyVoice, ChatTTS, F5-TTS, native OS
- Chinese Language Support: Optimized models for Mandarin, Cantonese, and mixed Chinese-English
- Voice Cloning: Zero-shot voice cloning with F5-TTS and CosyVoice
- MCP Integration: Use with Claude Code and Claude Desktop
- WebSocket Streaming: Real-time transcription and synthesis
- REST API: FastAPI-based server with OpenAPI docs
Quick Start
# Install (uv recommended)
uv pip install localkin-service-audio
# Get model recommendations for your hardware
kin audio recommend
# View configuration
kin audio config
# Transcribe audio
kin audio transcribe audio.wav
# Text-to-speech
kin audio tts "Hello world"
# Real-time listening (microphone)
kin audio listen
# Voice AI conversation
kin audio listen --llm ollama --tts --stream
# List available models
kin audio models
# Start API server
kin audio serve --port 8000
# Start web interface
kin web
Installation
Using uv (recommended)
uv is 10-100x faster than pip with better dependency resolution.
# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create a virtual environment and install
uv venv
uv pip install localkin-service-audio
# Or install from source (for development)
git clone https://github.com/LocalKinAI/localkin-service-audio.git
cd localkin-service-audio
uv sync
Using in a new terminal: The virtual environment needs to be activated each session:
# Option 1: Activate the venv
source .venv/bin/activate
kin audio models
# Option 2: Use uv run (no activation needed)
uv run kin audio models
To auto-activate, add to your ~/.zshrc or ~/.bashrc:
# Activate .venv automatically when entering a project directory
cd() { builtin cd "$@" && [ -f .venv/bin/activate ] && source .venv/bin/activate; }
Using pip
pip install localkin-service-audio
Optional Dependencies
# Chinese language models
uv pip install localkin-service-audio[chinese]
# Voice cloning models
uv pip install localkin-service-audio[cloning]
# MCP server for Claude
uv pip install localkin-service-audio[mcp]
# All features
uv pip install localkin-service-audio[all-new]
Replace
uv pipwithpipif not using uv.
CLI Usage
Speech-to-Text
# Basic transcription (auto-selects best model)
kin audio transcribe audio.wav
# Specify model
kin audio transcribe audio.wav --model whisper-cpp:base
kin audio transcribe audio.wav --model faster-whisper:large-v3
kin audio transcribe audio.wav --model sensevoice:small # Chinese
# With language hint
kin audio transcribe audio.wav --language zh
# Output formats
kin audio transcribe audio.wav --format json
kin audio transcribe audio.wav --format srt --timestamps
Text-to-Speech
# Basic synthesis (uses Kokoro with af_heart voice)
kin audio tts "Hello world"
# List all available voices
kin audio tts "" --model kokoro --list-voices
# American English voices
kin audio tts "Hello world" --voice af_bella # Bella (Female)
kin audio tts "Hello world" --voice am_adam # Adam (Male)
kin audio tts "Hello world" --voice af_nova # Nova (Female)
# British English voices
kin audio tts "Good morning" --voice bf_emma # Emma (British Female)
kin audio tts "Good morning" --voice bm_george # George (British Male)
# Chinese (Mandarin) voices
kin audio tts "你好世界" --voice zf_xiaoxiao # Xiaoxiao (Chinese Female)
kin audio tts "今天天气真好" --voice zm_yunyang # Yunyang (Chinese Male)
# Japanese voices
kin audio tts "こんにちは" --voice jf_alpha # Alpha (Japanese Female)
kin audio tts "ありがとう" --voice jm_kumo # Kumo (Japanese Male)
# French, Spanish, Italian, Hindi, Portuguese
kin audio tts "Bonjour le monde" --voice ff_siwis # French
kin audio tts "Hola mundo" --voice ef_dora # Spanish
kin audio tts "Ciao mondo" --voice if_sara # Italian
kin audio tts "नमस्ते" --voice hf_alpha # Hindi
kin audio tts "Olá mundo" --voice pf_dora # Portuguese
# Adjust speech speed (0.5 = slow, 2.0 = fast)
kin audio tts "Hello world" --speed 0.8
kin audio tts "Hello world" --speed 1.5
# Save to file
kin audio tts "Hello world" --output speech.wav
# Save without auto-playing
kin audio tts "Hello world" --output speech.wav --no-play
# CosyVoice for Chinese (voice cloning capable)
kin audio tts "你好世界" --model cosyvoice:300m --voice 中文女
Real-time Listening
# Basic real-time transcription
kin audio listen
# With TTS echo
kin audio listen --tts --tts-model kokoro
# Voice AI with LLM (requires Ollama)
kin audio listen --llm ollama --tts --stream
# Custom models
kin audio listen --model sensevoice:small --language zh --tts --tts-model cosyvoice:300m
# Adjust silence detection
kin audio listen --silence-threshold 0.02 --silence-duration 2.0
Model Management
# List all models with availability status
kin audio models
# Filter by type, language, engine, or tag
kin audio models --type stt
kin audio models --type tts
kin audio models --language zh
kin audio models --engine kokoro
kin audio models --tag voice-cloning
kin audio models --search whisper
# Pull a model
kin audio pull whisper-cpp:base
# Remove a model
kin audio rm whisper-cpp:base
# Add a model from a template
kin audio add-model --template whisper_stt --name my-whisper
# Add a model from HuggingFace
kin audio add-model --repo openai/whisper-medium --name whisper-med --type stt
# List available model templates
kin audio list-templates
Model Recommendations
# Get hardware-aware model recommendations
kin audio recommend
# With detailed hardware info
kin audio recommend --verbose
The recommend command detects your hardware (GPU, RAM, CPU) and suggests optimal STT/TTS models for your system.
Configuration
# View configuration overview
kin audio config
# Show configuration file paths
kin audio config --path
# Show all registered models
kin audio config --models
# Initialize config directory with sample config
kin audio config --init
# Change settings
kin audio config set default_tts_model kokoro
kin audio config set default_stt_model faster-whisper:large-v3
kin audio config set api_port 9000
kin audio config set default_device cuda
Configuration files are stored in $LOCALKIN_HOME/ (default: ~/.localkin-service-audio/).
Set LOCALKIN_HOME to relocate all data (cache, config, models) to another disk:
export LOCALKIN_HOME="/path/to/large/disk/.localkin-service-audio"
System Status & Diagnostics
# Check system status (libraries, registry, cache)
kin audio status
# Show cache info
kin audio cache info
# Clear cache for a specific model
kin audio cache clear whisper-large
# Clear all cached models
kin audio cache clear
# Show running LocalKin Audio servers
kin audio ps
API Server
# Start REST API server
kin audio serve --port 8000
# Start web interface
kin web --port 5000
Supported Models
kin audio models shows all 28 models with real-time availability status:
- ✅ Ready — engine installed, usable now
- 📦 Not installed — strategy code exists, just needs
pip install - 🔮 Planned — future implementation
STT Models (20)
| Model | Engine | Languages | Features | Status |
|---|---|---|---|---|
whisper:tiny/base/small/medium/large-v3 |
OpenAI Whisper | Multilingual | Standard reference | Ready |
faster-whisper:tiny/base/large-v3/turbo/distil-large-v3 |
CTranslate2 | Multilingual | 4x faster, GPU | Ready |
whisper-cpp:tiny/base/small/medium |
whisper.cpp | Multilingual | Fast CPU inference | Ready |
moonshine:tiny/base |
Moonshine | English | 5x real-time, ~20MB | Install needed |
sensevoice:small |
FunASR | zh, en, ja, ko | 15x faster, emotion detection | Install needed |
paraformer:zh |
FunASR | Chinese | Fast Chinese ASR | Install needed |
parakeet:1.1b |
NVIDIA NeMo | English | >2000x real-time | Planned |
canary:1b |
NVIDIA NeMo | Multilingual | #1 HuggingFace leaderboard | Planned |
TTS Models (8)
| Model | Engine | Languages | Features | Status |
|---|---|---|---|---|
native |
pyttsx3 | System | No download needed | Ready |
kokoro / kokoro:82m |
Kokoro | en, es, fr, hi, it, ja, pt, zh | 54 voices, multilingual | Ready |
cosyvoice:300m |
CosyVoice | zh, en, ja, ko, yue | Voice cloning, streaming | Install needed |
chattts |
ChatTTS | zh, en | Conversational, emotion | Install needed |
f5-tts |
F5-TTS | en, zh | Zero-shot voice cloning | Install needed |
gpt-sovits |
GPT-SoVITS | zh, en, ja | Voice cloning with 5s audio | Planned |
parler-tts |
Parler | English | Text-described voice | Planned |
Kokoro Voice Reference
Kokoro supports 54 voices across 9 languages. Voice IDs follow the pattern {lang}{gender}_{name}:
| Prefix | Language | Example Voices |
|---|---|---|
af_ |
American English (Female) | af_heart, af_bella, af_nova, af_sarah, af_sky |
am_ |
American English (Male) | am_adam, am_michael, am_echo, am_puck |
bf_ |
British English (Female) | bf_emma, bf_alice, bf_lily, bf_isabella |
bm_ |
British English (Male) | bm_george, bm_lewis, bm_daniel, bm_fable |
zf_ |
Chinese Mandarin (Female) | zf_xiaoxiao, zf_xiaobei, zf_xiaoni, zf_xiaoyi |
zm_ |
Chinese Mandarin (Male) | zm_yunyang, zm_yunxi, zm_yunjian, zm_yunxia |
jf_ |
Japanese (Female) | jf_alpha, jf_nezumi, jf_gongitsune, jf_tebukuro |
jm_ |
Japanese (Male) | jm_kumo |
ff_ |
French (Female) | ff_siwis |
ef_ |
Spanish (Female) | ef_dora |
em_ |
Spanish (Male) | em_alex |
hf_ |
Hindi (Female) | hf_alpha, hf_beta |
hm_ |
Hindi (Male) | hm_omega, hm_psi |
if_ |
Italian (Female) | if_sara |
im_ |
Italian (Male) | im_nicola |
pf_ |
Portuguese (Female) | pf_dora |
pm_ |
Portuguese (Male) | pm_alex |
Python API
from localkin_service_audio import AudioEngine, transcribe, synthesize
# Quick functions
result = transcribe("audio.wav", model="whisper-cpp:base")
print(result.text)
audio = synthesize("Hello world", model="kokoro")
audio.save("output.wav")
# Full engine control
engine = AudioEngine()
# Load and use STT
engine.load_stt("whisper-cpp:base")
result = engine.transcribe("audio.wav", language="en")
print(f"Text: {result.text}")
print(f"Language: {result.language}")
# Load and use TTS - English
engine.load_tts("kokoro")
audio = engine.synthesize("Hello world", voice="af_heart")
audio.save("english.wav")
# TTS - Chinese (auto-selects Chinese pipeline)
audio = engine.synthesize("你好世界", voice="zf_xiaoxiao")
audio.save("chinese.wav")
# TTS - Japanese
audio = engine.synthesize("こんにちは世界", voice="jf_alpha")
audio.save("japanese.wav")
# TTS - with speed control
audio = engine.synthesize("Hello", voice="am_adam", speed=0.8)
audio.save("slow.wav")
# List available voices
voices = engine.list_voices()
for v in voices:
print(f"{v.id}: {v.name} ({v.language}, {v.gender})")
# Voice cloning (with supported models)
engine.load_tts("f5-tts")
audio = engine.clone_voice(
reference_audio="reference.wav",
text="Text to speak in cloned voice"
)
MCP Integration
Use LocalKin Audio with Claude Code or Claude Desktop:
# Start MCP server
kin mcp
Add to Claude Desktop config (~/.config/claude/claude_desktop_config.json):
{
"mcpServers": {
"localkin-audio": {
"command": "kin",
"args": ["mcp"]
}
}
}
Available MCP tools:
transcribe_audio- Transcribe audio filessynthesize_speech- Generate speech from textclone_voice- Clone voice from reference audiolist_models- List available modelslist_voices- List available voices
REST API
Start the server:
kin audio serve --port 8000
Endpoints
POST /transcribe - Transcribe audio
curl -X POST "http://localhost:8000/transcribe" \
-F "file=@audio.wav" \
-F "model=whisper-cpp:base" \
-F "language=en"
POST /synthesize - Synthesize speech
curl -X POST "http://localhost:8000/synthesize" \
-H "Content-Type: application/json" \
-d '{"text": "Hello world", "model": "kokoro", "voice": "af_bella"}' \
--output speech.wav
GET /models - List models
curl "http://localhost:8000/models"
WebSocket /stream - Real-time transcription
const ws = new WebSocket("ws://localhost:8000/stream");
ws.send(audioChunk);
ws.onmessage = (e) => console.log(JSON.parse(e.data).text);
Configuration
Environment Variables
# Base directory for all data (cache, config, models)
export LOCALKIN_HOME="/Volumes/Data/.localkin-service-audio"
# Override individual directories
export LOCALKIN_CACHE_DIR="/tmp/my-cache"
export LOCALKIN_CONFIG_DIR="/path/to/config"
export LOCALKIN_MODELS_DIR="/path/to/models"
# Default engine settings
export LOCALKIN_DEFAULT_STT="faster-whisper:large-v3"
export LOCALKIN_DEFAULT_TTS="kokoro"
export LOCALKIN_DEVICE=cuda # or cpu, mps, auto
# API server
export LOCALKIN_API_HOST="127.0.0.1"
export LOCALKIN_API_PORT="8000"
Custom Models
Create $LOCALKIN_HOME/models.json (default: ~/.localkin-service-audio/models.json):
{
"models": {
"my-custom-model": {
"type": "stt",
"engine": "whisper",
"model_size": "base",
"languages": ["en", "zh"],
"description": "My custom model"
}
}
}
Architecture
LocalKin Audio v2.0 uses a modular architecture:
- Strategy Pattern: Pluggable STT/TTS engines
- Facade Pattern: AudioEngine provides unified interface
- Registry Pattern: Centralized model configuration
- Singleton Pattern: Shared engine instance
localkin_service_audio/
├── core/
│ ├── audio_processing/
│ │ ├── engine.py # AudioEngine facade
│ │ ├── stt/ # STT strategies
│ │ │ ├── base.py
│ │ │ ├── whisper_strategy.py
│ │ │ ├── sensevoice_strategy.py
│ │ │ └── ...
│ │ └── tts/ # TTS strategies
│ │ ├── base.py
│ │ ├── kokoro_strategy.py
│ │ ├── cosyvoice_strategy.py
│ │ └── ...
│ ├── config/
│ │ └── model_registry.py # Model registry
│ └── types.py # Core dataclasses
├── cli/ # Click CLI
├── api/ # FastAPI server
├── mcp/ # MCP server
└── ui/ # Web interface
Development
# Clone repository
git clone https://github.com/LocalKinAI/localkin-service-audio.git
cd localkin-service-audio
# Install with dev dependencies (uv recommended)
uv venv && source .venv/bin/activate
uv pip install -e ".[dev]"
# Or with pip
pip install -e ".[dev]"
# Run tests
pytest tests/
# Run linting
ruff check .
black --check .
Tip: With
uv, you can skip activation and run commands directly:uv run kin audio models uv run pytest tests/
Troubleshooting
Model Loading Errors
# Check model is registered
kin audio models
# Pull the model
kin audio pull whisper-cpp:base
# Check system info
kin info --verbose
CUDA/GPU Issues
# Force CPU
kin audio transcribe audio.wav --device cpu
# Check PyTorch CUDA
python -c "import torch; print(torch.cuda.is_available())"
Chinese Model Dependencies
# Install FunASR for Chinese models
pip install funasr modelscope
# Then use Chinese models
kin audio transcribe audio.wav --model sensevoice:small
License
MIT License - see LICENSE file.
Acknowledgments
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file localkin_service_audio-2.0.3.tar.gz.
File metadata
- Download URL: localkin_service_audio-2.0.3.tar.gz
- Upload date:
- Size: 2.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c9140764a105340717e0ddaac79162b8a806fa0cfa7df8928d5c696d4d0917a5
|
|
| MD5 |
5873d1a7d48c1d529cc1eb4a2644e83d
|
|
| BLAKE2b-256 |
9a226aafc888df0a0f056f4fc79eda1ef6f652e90fa0665a652bf49f71e37d19
|
File details
Details for the file localkin_service_audio-2.0.3-py3-none-any.whl.
File metadata
- Download URL: localkin_service_audio-2.0.3-py3-none-any.whl
- Upload date:
- Size: 2.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
15799a16a3a69947dc2097fc732205ac3685f021879bdd99c60ea8ee5c0ab2c7
|
|
| MD5 |
319fc23d46324cf4b889e3eb5dfc894f
|
|
| BLAKE2b-256 |
8dc37c442592dd3dcbd44c760333f9d52b37231a062e13a9cf286b3f3d06368f
|