Skip to main content

Voice AI Platform - Local STT/TTS with Chinese Language Support

Reason this release was yanked:

not working: use 1.1.3

Project description

LocalKin Service Audio

PyPI version Python 3.10+ License: MIT

Local Voice AI Platform - Speech-to-Text and Text-to-Speech with Chinese language support, voice cloning, and Claude integration via MCP.

Features

  • Multiple STT Engines: Whisper, faster-whisper, whisper.cpp, SenseVoice, Paraformer
  • Multiple TTS Engines: Kokoro, CosyVoice, ChatTTS, F5-TTS, native OS
  • Chinese Language Support: Optimized models for Mandarin, Cantonese, and mixed Chinese-English
  • Voice Cloning: Zero-shot voice cloning with F5-TTS and CosyVoice
  • MCP Integration: Use with Claude Code and Claude Desktop
  • WebSocket Streaming: Real-time transcription and synthesis
  • REST API: FastAPI-based server with OpenAPI docs

Quick Start

# Install (uv recommended)
uv pip install localkin-service-audio

# Get model recommendations for your hardware
kin audio recommend

# View configuration
kin audio config

# Transcribe audio
kin audio transcribe audio.wav

# Text-to-speech
kin audio tts "Hello world"

# Real-time listening (microphone)
kin audio listen

# Voice AI conversation
kin audio listen --llm ollama --tts --stream

# List available models
kin audio models

# Start API server
kin audio serve --port 8000

# Start web interface
kin web

Installation

Using uv (recommended)

uv is 10-100x faster than pip with better dependency resolution.

# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create a virtual environment and install
uv venv
uv pip install localkin-service-audio

# Or install from source (for development)
git clone https://github.com/LocalKinAI/localkin-service-audio.git
cd localkin-service-audio
uv sync

Using in a new terminal: The virtual environment needs to be activated each session:

# Option 1: Activate the venv
source .venv/bin/activate
kin audio models

# Option 2: Use uv run (no activation needed)
uv run kin audio models

To auto-activate, add to your ~/.zshrc or ~/.bashrc:

# Activate .venv automatically when entering a project directory
cd() { builtin cd "$@" && [ -f .venv/bin/activate ] && source .venv/bin/activate; }

Using pip

pip install localkin-service-audio

Optional Dependencies

# Chinese language models
uv pip install localkin-service-audio[chinese]

# Voice cloning models
uv pip install localkin-service-audio[cloning]

# MCP server for Claude
uv pip install localkin-service-audio[mcp]

# All features
uv pip install localkin-service-audio[all-new]

Replace uv pip with pip if not using uv.

CLI Usage

Speech-to-Text

# Basic transcription (auto-selects best model)
kin audio transcribe audio.wav

# Specify model
kin audio transcribe audio.wav --model whisper-cpp:base
kin audio transcribe audio.wav --model faster-whisper:large-v3
kin audio transcribe audio.wav --model sensevoice:small  # Chinese

# With language hint
kin audio transcribe audio.wav --language zh

# Output formats
kin audio transcribe audio.wav --format json
kin audio transcribe audio.wav --format srt --timestamps

Text-to-Speech

# Basic synthesis (uses Kokoro with af_heart voice)
kin audio tts "Hello world"

# List all available voices
kin audio tts "" --model kokoro --list-voices

# American English voices
kin audio tts "Hello world" --voice af_bella       # Bella (Female)
kin audio tts "Hello world" --voice am_adam         # Adam (Male)
kin audio tts "Hello world" --voice af_nova         # Nova (Female)

# British English voices
kin audio tts "Good morning" --voice bf_emma        # Emma (British Female)
kin audio tts "Good morning" --voice bm_george      # George (British Male)

# Chinese (Mandarin) voices
kin audio tts "你好世界" --voice zf_xiaoxiao         # Xiaoxiao (Chinese Female)
kin audio tts "今天天气真好" --voice zm_yunyang       # Yunyang (Chinese Male)

# Japanese voices
kin audio tts "こんにちは" --voice jf_alpha           # Alpha (Japanese Female)
kin audio tts "ありがとう" --voice jm_kumo            # Kumo (Japanese Male)

# French, Spanish, Italian, Hindi, Portuguese
kin audio tts "Bonjour le monde" --voice ff_siwis   # French
kin audio tts "Hola mundo" --voice ef_dora           # Spanish
kin audio tts "Ciao mondo" --voice if_sara           # Italian
kin audio tts "नमस्ते" --voice hf_alpha              # Hindi
kin audio tts "Olá mundo" --voice pf_dora            # Portuguese

# Adjust speech speed (0.5 = slow, 2.0 = fast)
kin audio tts "Hello world" --speed 0.8
kin audio tts "Hello world" --speed 1.5

# Save to file
kin audio tts "Hello world" --output speech.wav

# Save without auto-playing
kin audio tts "Hello world" --output speech.wav --no-play

# CosyVoice for Chinese (voice cloning capable)
kin audio tts "你好世界" --model cosyvoice:300m --voice 中文女

Real-time Listening

# Basic real-time transcription
kin audio listen

# With TTS echo
kin audio listen --tts --tts-model kokoro

# Voice AI with LLM (requires Ollama)
kin audio listen --llm ollama --tts --stream

# Custom models
kin audio listen --model sensevoice:small --language zh --tts --tts-model cosyvoice:300m

# Adjust silence detection
kin audio listen --silence-threshold 0.02 --silence-duration 2.0

Model Management

# List all models with availability status
kin audio models

# Filter by type, language, engine, or tag
kin audio models --type stt
kin audio models --type tts
kin audio models --language zh
kin audio models --engine kokoro
kin audio models --tag voice-cloning
kin audio models --search whisper

# Pull a model
kin audio pull whisper-cpp:base

# Remove a model
kin audio rm whisper-cpp:base

# Add a model from a template
kin audio add-model --template whisper_stt --name my-whisper

# Add a model from HuggingFace
kin audio add-model --repo openai/whisper-medium --name whisper-med --type stt

# List available model templates
kin audio list-templates

Model Recommendations

# Get hardware-aware model recommendations
kin audio recommend

# With detailed hardware info
kin audio recommend --verbose

The recommend command detects your hardware (GPU, RAM, CPU) and suggests optimal STT/TTS models for your system.

Configuration

# View configuration overview
kin audio config

# Show configuration file paths
kin audio config --path

# Show all registered models
kin audio config --models

# Initialize config directory with sample config
kin audio config --init

# Change settings
kin audio config set default_tts_model kokoro
kin audio config set default_stt_model faster-whisper:large-v3
kin audio config set api_port 9000
kin audio config set default_device cuda

Configuration files are stored in $LOCALKIN_HOME/ (default: ~/.localkin-service-audio/).

Set LOCALKIN_HOME to relocate all data (cache, config, models) to another disk:

export LOCALKIN_HOME="/path/to/large/disk/.localkin-service-audio"

System Status & Diagnostics

# Check system status (libraries, registry, cache)
kin audio status

# Show cache info
kin audio cache info

# Clear cache for a specific model
kin audio cache clear whisper-large

# Clear all cached models
kin audio cache clear

# Show running LocalKin Audio servers
kin audio ps

API Server

# Start REST API server
kin audio serve --port 8000

# Start web interface
kin web --port 5000

Supported Models

kin audio models shows all 28 models with real-time availability status:

  • ✅ Ready — engine installed, usable now
  • 📦 Not installed — strategy code exists, just needs pip install
  • 🔮 Planned — future implementation

STT Models (20)

Model Engine Languages Features Status
whisper:tiny/base/small/medium/large-v3 OpenAI Whisper Multilingual Standard reference Ready
faster-whisper:tiny/base/large-v3/turbo/distil-large-v3 CTranslate2 Multilingual 4x faster, GPU Ready
whisper-cpp:tiny/base/small/medium whisper.cpp Multilingual Fast CPU inference Ready
moonshine:tiny/base Moonshine English 5x real-time, ~20MB Install needed
sensevoice:small FunASR zh, en, ja, ko 15x faster, emotion detection Install needed
paraformer:zh FunASR Chinese Fast Chinese ASR Install needed
parakeet:1.1b NVIDIA NeMo English >2000x real-time Planned
canary:1b NVIDIA NeMo Multilingual #1 HuggingFace leaderboard Planned

TTS Models (8)

Model Engine Languages Features Status
native pyttsx3 System No download needed Ready
kokoro / kokoro:82m Kokoro en, es, fr, hi, it, ja, pt, zh 54 voices, multilingual Ready
cosyvoice:300m CosyVoice zh, en, ja, ko, yue Voice cloning, streaming Install needed
chattts ChatTTS zh, en Conversational, emotion Install needed
f5-tts F5-TTS en, zh Zero-shot voice cloning Install needed
gpt-sovits GPT-SoVITS zh, en, ja Voice cloning with 5s audio Planned
parler-tts Parler English Text-described voice Planned

Kokoro Voice Reference

Kokoro supports 54 voices across 9 languages. Voice IDs follow the pattern {lang}{gender}_{name}:

Prefix Language Example Voices
af_ American English (Female) af_heart, af_bella, af_nova, af_sarah, af_sky
am_ American English (Male) am_adam, am_michael, am_echo, am_puck
bf_ British English (Female) bf_emma, bf_alice, bf_lily, bf_isabella
bm_ British English (Male) bm_george, bm_lewis, bm_daniel, bm_fable
zf_ Chinese Mandarin (Female) zf_xiaoxiao, zf_xiaobei, zf_xiaoni, zf_xiaoyi
zm_ Chinese Mandarin (Male) zm_yunyang, zm_yunxi, zm_yunjian, zm_yunxia
jf_ Japanese (Female) jf_alpha, jf_nezumi, jf_gongitsune, jf_tebukuro
jm_ Japanese (Male) jm_kumo
ff_ French (Female) ff_siwis
ef_ Spanish (Female) ef_dora
em_ Spanish (Male) em_alex
hf_ Hindi (Female) hf_alpha, hf_beta
hm_ Hindi (Male) hm_omega, hm_psi
if_ Italian (Female) if_sara
im_ Italian (Male) im_nicola
pf_ Portuguese (Female) pf_dora
pm_ Portuguese (Male) pm_alex

Python API

from localkin_service_audio import AudioEngine, transcribe, synthesize

# Quick functions
result = transcribe("audio.wav", model="whisper-cpp:base")
print(result.text)

audio = synthesize("Hello world", model="kokoro")
audio.save("output.wav")

# Full engine control
engine = AudioEngine()

# Load and use STT
engine.load_stt("whisper-cpp:base")
result = engine.transcribe("audio.wav", language="en")
print(f"Text: {result.text}")
print(f"Language: {result.language}")

# Load and use TTS - English
engine.load_tts("kokoro")
audio = engine.synthesize("Hello world", voice="af_heart")
audio.save("english.wav")

# TTS - Chinese (auto-selects Chinese pipeline)
audio = engine.synthesize("你好世界", voice="zf_xiaoxiao")
audio.save("chinese.wav")

# TTS - Japanese
audio = engine.synthesize("こんにちは世界", voice="jf_alpha")
audio.save("japanese.wav")

# TTS - with speed control
audio = engine.synthesize("Hello", voice="am_adam", speed=0.8)
audio.save("slow.wav")

# List available voices
voices = engine.list_voices()
for v in voices:
    print(f"{v.id}: {v.name} ({v.language}, {v.gender})")

# Voice cloning (with supported models)
engine.load_tts("f5-tts")
audio = engine.clone_voice(
    reference_audio="reference.wav",
    text="Text to speak in cloned voice"
)

MCP Integration

Use LocalKin Audio with Claude Code or Claude Desktop:

# Start MCP server
kin mcp

Add to Claude Desktop config (~/.config/claude/claude_desktop_config.json):

{
  "mcpServers": {
    "localkin-audio": {
      "command": "kin",
      "args": ["mcp"]
    }
  }
}

Available MCP tools:

  • transcribe_audio - Transcribe audio files
  • synthesize_speech - Generate speech from text
  • clone_voice - Clone voice from reference audio
  • list_models - List available models
  • list_voices - List available voices

REST API

Start the server:

kin audio serve --port 8000

Endpoints

POST /transcribe - Transcribe audio

curl -X POST "http://localhost:8000/transcribe" \
  -F "file=@audio.wav" \
  -F "model=whisper-cpp:base" \
  -F "language=en"

POST /synthesize - Synthesize speech

curl -X POST "http://localhost:8000/synthesize" \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world", "model": "kokoro", "voice": "af_bella"}' \
  --output speech.wav

GET /models - List models

curl "http://localhost:8000/models"

WebSocket /stream - Real-time transcription

const ws = new WebSocket("ws://localhost:8000/stream");
ws.send(audioChunk);
ws.onmessage = (e) => console.log(JSON.parse(e.data).text);

Configuration

Environment Variables

# Base directory for all data (cache, config, models)
export LOCALKIN_HOME="/Volumes/Data/.localkin-service-audio"

# Override individual directories
export LOCALKIN_CACHE_DIR="/tmp/my-cache"
export LOCALKIN_CONFIG_DIR="/path/to/config"
export LOCALKIN_MODELS_DIR="/path/to/models"

# Default engine settings
export LOCALKIN_DEFAULT_STT="faster-whisper:large-v3"
export LOCALKIN_DEFAULT_TTS="kokoro"
export LOCALKIN_DEVICE=cuda  # or cpu, mps, auto

# API server
export LOCALKIN_API_HOST="127.0.0.1"
export LOCALKIN_API_PORT="8000"

Custom Models

Create $LOCALKIN_HOME/models.json (default: ~/.localkin-service-audio/models.json):

{
  "models": {
    "my-custom-model": {
      "type": "stt",
      "engine": "whisper",
      "model_size": "base",
      "languages": ["en", "zh"],
      "description": "My custom model"
    }
  }
}

Architecture

LocalKin Audio v2.0 uses a modular architecture:

  • Strategy Pattern: Pluggable STT/TTS engines
  • Facade Pattern: AudioEngine provides unified interface
  • Registry Pattern: Centralized model configuration
  • Singleton Pattern: Shared engine instance
localkin_service_audio/
├── core/
│   ├── audio_processing/
│   │   ├── engine.py          # AudioEngine facade
│   │   ├── stt/               # STT strategies
│   │   │   ├── base.py
│   │   │   ├── whisper_strategy.py
│   │   │   ├── sensevoice_strategy.py
│   │   │   └── ...
│   │   └── tts/               # TTS strategies
│   │       ├── base.py
│   │       ├── kokoro_strategy.py
│   │       ├── cosyvoice_strategy.py
│   │       └── ...
│   ├── config/
│   │   └── model_registry.py  # Model registry
│   └── types.py               # Core dataclasses
├── cli/                       # Click CLI
├── api/                       # FastAPI server
├── mcp/                       # MCP server
└── ui/                        # Web interface

Development

# Clone repository
git clone https://github.com/LocalKinAI/localkin-service-audio.git
cd localkin-service-audio

# Install with dev dependencies (uv recommended)
uv venv && source .venv/bin/activate
uv pip install -e ".[dev]"

# Or with pip
pip install -e ".[dev]"

# Run tests
pytest tests/

# Run linting
ruff check .
black --check .

Tip: With uv, you can skip activation and run commands directly:

uv run kin audio models
uv run pytest tests/

Troubleshooting

Model Loading Errors

# Check model is registered
kin audio models

# Pull the model
kin audio pull whisper-cpp:base

# Check system info
kin info --verbose

CUDA/GPU Issues

# Force CPU
kin audio transcribe audio.wav --device cpu

# Check PyTorch CUDA
python -c "import torch; print(torch.cuda.is_available())"

Chinese Model Dependencies

# Install FunASR for Chinese models
pip install funasr modelscope

# Then use Chinese models
kin audio transcribe audio.wav --model sensevoice:small

License

MIT License - see LICENSE file.

Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

localkin_service_audio-2.0.3.tar.gz (2.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

localkin_service_audio-2.0.3-py3-none-any.whl (2.1 MB view details)

Uploaded Python 3

File details

Details for the file localkin_service_audio-2.0.3.tar.gz.

File metadata

  • Download URL: localkin_service_audio-2.0.3.tar.gz
  • Upload date:
  • Size: 2.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.0

File hashes

Hashes for localkin_service_audio-2.0.3.tar.gz
Algorithm Hash digest
SHA256 c9140764a105340717e0ddaac79162b8a806fa0cfa7df8928d5c696d4d0917a5
MD5 5873d1a7d48c1d529cc1eb4a2644e83d
BLAKE2b-256 9a226aafc888df0a0f056f4fc79eda1ef6f652e90fa0665a652bf49f71e37d19

See more details on using hashes here.

File details

Details for the file localkin_service_audio-2.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for localkin_service_audio-2.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 15799a16a3a69947dc2097fc732205ac3685f021879bdd99c60ea8ee5c0ab2c7
MD5 319fc23d46324cf4b889e3eb5dfc894f
BLAKE2b-256 8dc37c442592dd3dcbd44c760333f9d52b37231a062e13a9cf286b3f3d06368f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page