Skip to main content

OpenAI-compatible inference server: Llama 3.1 8B + Whisper + Kokoro TTS exposed via ngrok

Project description

llm-host

OpenAI-compatible inference server that runs an LLM (via vLLM), Whisper transcription/translation, and Kokoro TTS on a GPU and exposes them all at a single URL — optionally via ngrok for a public endpoint.

Default model: Qwen/Qwen3.5-2B + whisper-small — tuned to run on a T4 GPU (15 GB VRAM). Swap to larger models via --model / --whisper-model or env vars.

Designed for Google Colab (T4 / L4 / A100) but works on any GPU machine with CUDA.


Install

pip install llm-host

# vLLM must be installed separately (GPU/CUDA-specific build)
pip install "vllm>=0.6.0"

# Kokoro TTS requires espeak-ng for phonemization
apt-get install -y espeak-ng   # Debian/Ubuntu/Colab

Quickstart

With ngrok (public URL):

llm-host \
  --ngrok-token YOUR_NGROK_TOKEN \
  --hf-token    YOUR_HF_TOKEN

Without ngrok (localhost / LAN only):

llm-host --hf-token YOUR_HF_TOKEN
# accessible at http://localhost:5001  and  http://<server-ip>:5001

Or with environment variables:

NGROK_TOKEN=xxx HF_TOKEN=xxx llm-host

Without --ngrok-token the server binds to 0.0.0.0 and prints both the localhost and network IP URLs. Pass --ngrok-token to get a public ngrok URL.


Endpoints

Method Path Description
GET / Dashboard UI
GET /health Service status
GET /v1/models List models
POST /v1/chat/completions LLM chat (streaming supported)
POST /v1/audio/transcriptions Whisper STT (keep source language)
POST /v1/audio/translations Whisper STT → English
POST /v1/audio/speech Kokoro TTS

All API endpoints are OpenAI-compatible — drop in the ngrok URL as base_url with any OpenAI SDK.

from openai import OpenAI

client = OpenAI(
    base_url="https://<ngrok-id>.ngrok-free.app/v1",
    api_key="dummy",
)

# Chat  (model name = --served-model-name, default is last part of --model)
resp = client.chat.completions.create(
    model="Qwen3-8B",
    messages=[{"role": "user", "content": "Hello!"}],
)

# Transcription
with open("audio.wav", "rb") as f:
    text = client.audio.transcriptions.create(model="whisper-1", file=f)

# TTS
client.audio.speech.create(
    model="tts-1", input="Hello!", voice="nova"
).stream_to_file("out.mp3")

Model configuration

LLM (reasoning / chat)

Set via --model or MODEL= env var. Default is Qwen/Qwen3.5-2B — runs on T4, no HuggingFace token required.

# Default — T4-friendly, no HF token needed
llm-host

# Larger Qwen3 variants (A100 recommended)
MODEL=Qwen/Qwen3-8B  llm-host
MODEL=Qwen/Qwen3-14B llm-host
MODEL=Qwen/Qwen3-32B llm-host

# Llama 3.1 (gated — requires HF token + accepted licence)
MODEL=meta-llama/Llama-3.1-8B-Instruct HF_TOKEN=hf_xxx llm-host

# AWQ-quantized Llama (lower VRAM, still needs A100 for large-v3 Whisper)
MODEL=hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 QUANTIZATION=awq llm-host

The model is served under its last path component by default (e.g. Qwen3.5-2B). Override with --served-model-name / SERVED_MODEL_NAME=.

Whisper (speech-to-text)

Set via --whisper-model or WHISPER_MODEL= env var. Default: large-v3.

Model VRAM Speed Accuracy
tiny ~1 GB fastest lowest
base ~1 GB fast low
small ~2 GB fast good
medium ~5 GB moderate better
large-v2 ~10 GB slow high
large-v3 ~10 GB slow highest
large-v3-turbo ~6 GB fast high
WHISPER_MODEL=small       llm-host   # default — T4-friendly (~1 GB VRAM)
WHISPER_MODEL=medium      llm-host   # better accuracy, ~2 GB VRAM
WHISPER_MODEL=large-v3    llm-host   # highest accuracy, ~10 GB VRAM (A100)
WHISPER_MODEL=large-v3-turbo llm-host  # good balance on A100

CLI options

llm-host --help

  --ngrok-token              ngrok authtoken (optional; omit for localhost/LAN only)
  --hf-token                 HuggingFace token (needed only for gated models)
  --model                    HuggingFace model ID (default: Qwen/Qwen3.5-2B)
  --served-model-name        name used in API calls (default: last part of --model)
  --quantization             awq | bitsandbytes | none  (default: none)
  --whisper-model            tiny | base | small | medium | large-v1 | large-v2 | large-v3 | large-v3-turbo
                             (default: small)
  --tts-voice                alloy | echo | fable | onyx | nova | shimmer  (default: alloy)
  --vllm-port                internal vLLM port (default: 8000)
  --gateway-port             public gateway port (default: 5001)
  --gpu-memory-utilization   vLLM GPU memory fraction (default: 0.82)
  --max-model-len            context length (default: 8192)
  --no-vllm                  skip starting vLLM (use existing instance)

All flags can also be set via UPPER_SNAKE_CASE environment variables:

MODEL=Qwen/Qwen3-14B \
WHISPER_MODEL=large-v3-turbo \
NGROK_TOKEN=xxx \
llm-host

TTS voices

Voice Character Kokoro name
alloy Neutral female af_heart
echo Male am_echo
fable British female bf_emma
onyx Deep male am_adam
nova Energetic female af_nova
shimmer Soft female af_bella

Raw Kokoro voice names (e.g. af_sky) are also accepted directly.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_host-0.2.0.tar.gz (18.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_host-0.2.0-py3-none-any.whl (17.5 kB view details)

Uploaded Python 3

File details

Details for the file llm_host-0.2.0.tar.gz.

File metadata

  • Download URL: llm_host-0.2.0.tar.gz
  • Upload date:
  • Size: 18.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.10

File hashes

Hashes for llm_host-0.2.0.tar.gz
Algorithm Hash digest
SHA256 c432c60bf4345617db7d3408e73defbfec8b7f79e28a033f4b48ec7160de8129
MD5 eb75ac7fb35ebc47bbf0cc733b7a9982
BLAKE2b-256 5b3fec94d52cb6e4e1b116cacc835105af6baf59e5a9b16492a7b7450ad36002

See more details on using hashes here.

File details

Details for the file llm_host-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: llm_host-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 17.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.10

File hashes

Hashes for llm_host-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8289d62fc11a7919f386dc963eb4c56e0aece9cdb2cf0b5a786ac19936a10bb1
MD5 bf92dd49ed4f79f5daebb76d3fd87912
BLAKE2b-256 c9239894f43c66b8a69e1264244f448300d3fd635ebab53f2f1d45d4eb035c9b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page