OpenAI-compatible inference server: Llama 3.1 8B + Whisper + Kokoro TTS exposed via ngrok
Project description
llm-host
OpenAI-compatible inference server that runs an LLM (via vLLM), Whisper transcription/translation, and Kokoro TTS on a GPU and exposes them all at a single URL — optionally via ngrok for a public endpoint.
Default model: Qwen/Qwen3.5-2B + whisper-small — tuned to run on a T4 GPU (15 GB VRAM). Swap to larger models via --model / --whisper-model or env vars.
Designed for Google Colab (T4 / L4 / A100) but works on any GPU machine with CUDA.
Install
pip install llm-host
# vLLM must be installed separately (GPU/CUDA-specific build)
pip install "vllm>=0.6.0"
# Kokoro TTS requires espeak-ng for phonemization
apt-get install -y espeak-ng # Debian/Ubuntu/Colab
Quickstart
With ngrok (public URL):
llm-host \
--ngrok-token YOUR_NGROK_TOKEN \
--hf-token YOUR_HF_TOKEN
Without ngrok (localhost / LAN only):
llm-host --hf-token YOUR_HF_TOKEN
# accessible at http://localhost:5001 and http://<server-ip>:5001
Or with environment variables:
NGROK_TOKEN=xxx HF_TOKEN=xxx llm-host
Without --ngrok-token the server binds to 0.0.0.0 and prints both the
localhost and network IP URLs. Pass --ngrok-token to get a public ngrok URL.
Endpoints
| Method | Path | Description |
|---|---|---|
GET |
/ |
Dashboard UI |
GET |
/health |
Service status |
GET |
/v1/models |
List models |
POST |
/v1/chat/completions |
LLM chat (streaming supported) |
POST |
/v1/audio/transcriptions |
Whisper STT (keep source language) |
POST |
/v1/audio/translations |
Whisper STT → English |
POST |
/v1/audio/speech |
Kokoro TTS |
All API endpoints are OpenAI-compatible — drop in the ngrok URL as base_url with any OpenAI SDK.
from openai import OpenAI
client = OpenAI(
base_url="https://<ngrok-id>.ngrok-free.app/v1",
api_key="dummy",
)
# Chat (model name = --served-model-name, default is last part of --model)
resp = client.chat.completions.create(
model="Qwen3-8B",
messages=[{"role": "user", "content": "Hello!"}],
)
# Transcription
with open("audio.wav", "rb") as f:
text = client.audio.transcriptions.create(model="whisper-1", file=f)
# TTS
client.audio.speech.create(
model="tts-1", input="Hello!", voice="nova"
).stream_to_file("out.mp3")
Model configuration
LLM (reasoning / chat)
Set via --model or MODEL= env var. Default is Qwen/Qwen3.5-2B — runs on T4, no HuggingFace token required.
# Default — T4-friendly, no HF token needed
llm-host
# Larger Qwen3 variants (A100 recommended)
MODEL=Qwen/Qwen3-8B llm-host
MODEL=Qwen/Qwen3-14B llm-host
MODEL=Qwen/Qwen3-32B llm-host
# Llama 3.1 (gated — requires HF token + accepted licence)
MODEL=meta-llama/Llama-3.1-8B-Instruct HF_TOKEN=hf_xxx llm-host
# AWQ-quantized Llama (lower VRAM, still needs A100 for large-v3 Whisper)
MODEL=hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 QUANTIZATION=awq llm-host
The model is served under its last path component by default (e.g. Qwen3.5-2B).
Override with --served-model-name / SERVED_MODEL_NAME=.
Whisper (speech-to-text)
Set via --whisper-model or WHISPER_MODEL= env var. Default: large-v3.
| Model | VRAM | Speed | Accuracy |
|---|---|---|---|
tiny |
~1 GB | fastest | lowest |
base |
~1 GB | fast | low |
small |
~2 GB | fast | good |
medium |
~5 GB | moderate | better |
large-v2 |
~10 GB | slow | high |
large-v3 |
~10 GB | slow | highest |
large-v3-turbo |
~6 GB | fast | high |
WHISPER_MODEL=small llm-host # default — T4-friendly (~1 GB VRAM)
WHISPER_MODEL=medium llm-host # better accuracy, ~2 GB VRAM
WHISPER_MODEL=large-v3 llm-host # highest accuracy, ~10 GB VRAM (A100)
WHISPER_MODEL=large-v3-turbo llm-host # good balance on A100
CLI options
llm-host --help
--ngrok-token ngrok authtoken (optional; omit for localhost/LAN only)
--hf-token HuggingFace token (needed only for gated models)
--model HuggingFace model ID (default: Qwen/Qwen3.5-2B)
--served-model-name name used in API calls (default: last part of --model)
--quantization awq | bitsandbytes | none (default: none)
--whisper-model tiny | base | small | medium | large-v1 | large-v2 | large-v3 | large-v3-turbo
(default: small)
--tts-voice alloy | echo | fable | onyx | nova | shimmer (default: alloy)
--vllm-port internal vLLM port (default: 8000)
--gateway-port public gateway port (default: 5001)
--gpu-memory-utilization vLLM GPU memory fraction (default: 0.82)
--max-model-len context length (default: 8192)
--no-vllm skip starting vLLM (use existing instance)
All flags can also be set via UPPER_SNAKE_CASE environment variables:
MODEL=Qwen/Qwen3-14B \
WHISPER_MODEL=large-v3-turbo \
NGROK_TOKEN=xxx \
llm-host
TTS voices
| Voice | Character | Kokoro name |
|---|---|---|
alloy |
Neutral female | af_heart |
echo |
Male | am_echo |
fable |
British female | bf_emma |
onyx |
Deep male | am_adam |
nova |
Energetic female | af_nova |
shimmer |
Soft female | af_bella |
Raw Kokoro voice names (e.g. af_sky) are also accepted directly.
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_host-0.2.0.tar.gz.
File metadata
- Download URL: llm_host-0.2.0.tar.gz
- Upload date:
- Size: 18.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c432c60bf4345617db7d3408e73defbfec8b7f79e28a033f4b48ec7160de8129
|
|
| MD5 |
eb75ac7fb35ebc47bbf0cc733b7a9982
|
|
| BLAKE2b-256 |
5b3fec94d52cb6e4e1b116cacc835105af6baf59e5a9b16492a7b7450ad36002
|
File details
Details for the file llm_host-0.2.0-py3-none-any.whl.
File metadata
- Download URL: llm_host-0.2.0-py3-none-any.whl
- Upload date:
- Size: 17.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8289d62fc11a7919f386dc963eb4c56e0aece9cdb2cf0b5a786ac19936a10bb1
|
|
| MD5 |
bf92dd49ed4f79f5daebb76d3fd87912
|
|
| BLAKE2b-256 |
c9239894f43c66b8a69e1264244f448300d3fd635ebab53f2f1d45d4eb035c9b
|