Skip to main content

OpenAI-compatible inference server: Llama 3.1 8B + Whisper + Kokoro TTS exposed via ngrok

Project description

llm-host

OpenAI-compatible inference server that runs Llama 3.1 8B (via vLLM), Whisper large-v3 transcription, and Kokoro TTS on a GPU and exposes them all at a single public URL via ngrok.

Designed for Google Colab (T4 / L4 / A100) but works on any GPU machine with CUDA.


Install

pip install llm-host

# vLLM must be installed separately (GPU/CUDA-specific build)
pip install "vllm>=0.6.0"

# Kokoro TTS requires espeak-ng for phonemization
apt-get install -y espeak-ng   # Debian/Ubuntu/Colab

Quickstart

With ngrok (public URL):

llm-host \
  --ngrok-token YOUR_NGROK_TOKEN \
  --hf-token    YOUR_HF_TOKEN

Without ngrok (localhost / LAN only):

llm-host --hf-token YOUR_HF_TOKEN
# accessible at http://localhost:5001  and  http://<server-ip>:5001

Or with environment variables:

NGROK_TOKEN=xxx HF_TOKEN=xxx llm-host

Without --ngrok-token the server binds to 0.0.0.0 and prints both the localhost and network IP URLs. Pass --ngrok-token to get a public ngrok URL.


Endpoints

Method Path Description
GET / Dashboard UI
GET /health Service status
GET /v1/models List models
POST /v1/chat/completions Llama 3.1 8B (streaming supported)
POST /v1/audio/transcriptions Whisper large-v3
POST /v1/audio/speech Kokoro TTS

All API endpoints are OpenAI-compatible — drop in the ngrok URL as base_url with any OpenAI SDK.

from openai import OpenAI

client = OpenAI(
    base_url="https://<ngrok-id>.ngrok-free.app/v1",
    api_key="dummy",
)

# Chat
resp = client.chat.completions.create(
    model="llama-3.1-8b-instruct",
    messages=[{"role": "user", "content": "Hello!"}],
)

# Transcription
with open("audio.wav", "rb") as f:
    text = client.audio.transcriptions.create(model="whisper-1", file=f)

# TTS
client.audio.speech.create(
    model="tts-1", input="Hello!", voice="nova"
).stream_to_file("out.mp3")

CLI options

llm-host --help

  --ngrok-token              ngrok authtoken (optional; omit for localhost/LAN only)
  --hf-token                 HuggingFace token (for gated models)
  --model                    HuggingFace model ID (default: hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4)
  --quantization             awq | bitsandbytes | none  (default: awq)
  --whisper-model            tiny | base | small | medium | large-v3  (default: large-v3)
  --tts-voice                alloy | echo | fable | onyx | nova | shimmer  (default: alloy)
  --vllm-port                internal vLLM port (default: 8000)
  --gateway-port             public gateway port (default: 5001)
  --gpu-memory-utilization   vLLM GPU memory fraction (default: 0.82)
  --max-model-len            context length (default: 8192)
  --no-vllm                  skip starting vLLM (use existing instance)

All flags can also be set via UPPER_SNAKE_CASE environment variables.


TTS voices

Voice Character Kokoro name
alloy Neutral female af_heart
echo Male am_echo
fable British female bf_emma
onyx Deep male am_adam
nova Energetic female af_nova
shimmer Soft female af_bella

Raw Kokoro voice names (e.g. af_sky) are also accepted directly.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_host-0.1.0.tar.gz (15.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_host-0.1.0-py3-none-any.whl (15.6 kB view details)

Uploaded Python 3

File details

Details for the file llm_host-0.1.0.tar.gz.

File metadata

  • Download URL: llm_host-0.1.0.tar.gz
  • Upload date:
  • Size: 15.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.10

File hashes

Hashes for llm_host-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6b9f86d3295a38aef81b479bfaa06340cae2084161ebe4f187324ee30034779a
MD5 929ec2b14e7926486aa666a71a304bf4
BLAKE2b-256 eb3997186b415250ac83cf35b32410d9641a796b17af00163ba902e038a41cb9

See more details on using hashes here.

File details

Details for the file llm_host-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: llm_host-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 15.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.10

File hashes

Hashes for llm_host-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5a29fc9c193ee5b23ca8423d37bf625dcb5b31b6e727f3927520c830cd3c20cd
MD5 a9add26f1919aa5f4ac35ef33c15721b
BLAKE2b-256 02dbad43489af3e713891caa4e39dcdabb894eeb70b82040cd1d7ddf9966a912

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page