OpenAI-compatible inference server: Llama 3.1 8B + Whisper + Kokoro TTS exposed via ngrok

These details have not been verified by PyPI

Project links

Project description

llm-host

OpenAI-compatible inference server that runs Llama 3.1 8B (via vLLM), Whisper large-v3 transcription, and Kokoro TTS on a GPU and exposes them all at a single public URL via ngrok.

Designed for Google Colab (T4 / L4 / A100) but works on any GPU machine with CUDA.

Install

pip install llm-host

# vLLM must be installed separately (GPU/CUDA-specific build)
pip install "vllm>=0.6.0"

# Kokoro TTS requires espeak-ng for phonemization
apt-get install -y espeak-ng   # Debian/Ubuntu/Colab

Quickstart

With ngrok (public URL):

llm-host \
  --ngrok-token YOUR_NGROK_TOKEN \
  --hf-token    YOUR_HF_TOKEN

Without ngrok (localhost / LAN only):

llm-host --hf-token YOUR_HF_TOKEN
# accessible at http://localhost:5001  and  http://<server-ip>:5001

Or with environment variables:

NGROK_TOKEN=xxx HF_TOKEN=xxx llm-host

Without --ngrok-token the server binds to 0.0.0.0 and prints both the localhost and network IP URLs. Pass --ngrok-token to get a public ngrok URL.

Endpoints

Method	Path	Description
`GET`	`/`	Dashboard UI
`GET`	`/health`	Service status
`GET`	`/v1/models`	List models
`POST`	`/v1/chat/completions`	Llama 3.1 8B (streaming supported)
`POST`	`/v1/audio/transcriptions`	Whisper large-v3
`POST`	`/v1/audio/speech`	Kokoro TTS

All API endpoints are OpenAI-compatible — drop in the ngrok URL as base_url with any OpenAI SDK.

from openai import OpenAI

client = OpenAI(
    base_url="https://<ngrok-id>.ngrok-free.app/v1",
    api_key="dummy",
)

# Chat
resp = client.chat.completions.create(
    model="llama-3.1-8b-instruct",
    messages=[{"role": "user", "content": "Hello!"}],
)

# Transcription
with open("audio.wav", "rb") as f:
    text = client.audio.transcriptions.create(model="whisper-1", file=f)

# TTS
client.audio.speech.create(
    model="tts-1", input="Hello!", voice="nova"
).stream_to_file("out.mp3")

CLI options

llm-host --help

  --ngrok-token              ngrok authtoken (optional; omit for localhost/LAN only)
  --hf-token                 HuggingFace token (for gated models)
  --model                    HuggingFace model ID (default: hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4)
  --quantization             awq | bitsandbytes | none  (default: awq)
  --whisper-model            tiny | base | small | medium | large-v3  (default: large-v3)
  --tts-voice                alloy | echo | fable | onyx | nova | shimmer  (default: alloy)
  --vllm-port                internal vLLM port (default: 8000)
  --gateway-port             public gateway port (default: 5001)
  --gpu-memory-utilization   vLLM GPU memory fraction (default: 0.82)
  --max-model-len            context length (default: 8192)
  --no-vllm                  skip starting vLLM (use existing instance)

All flags can also be set via UPPER_SNAKE_CASE environment variables.

TTS voices

Voice	Character	Kokoro name
`alloy`	Neutral female	af_heart
`echo`	Male	am_echo
`fable`	British female	bf_emma
`onyx`	Deep male	am_adam
`nova`	Energetic female	af_nova
`shimmer`	Soft female	af_bella

Raw Kokoro voice names (e.g. af_sky) are also accepted directly.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.6

May 18, 2026

0.2.5

May 18, 2026

0.2.4

May 18, 2026

0.2.3

May 18, 2026

0.2.2

May 18, 2026

0.2.1

May 15, 2026

0.2.0

May 15, 2026

This version

0.1.0

May 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_host-0.1.0.tar.gz (15.9 kB view details)

Uploaded May 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_host-0.1.0-py3-none-any.whl (15.6 kB view details)

Uploaded May 13, 2026 Python 3

File details

Details for the file llm_host-0.1.0.tar.gz.

File metadata

Download URL: llm_host-0.1.0.tar.gz
Upload date: May 13, 2026
Size: 15.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.8.10

File hashes

Hashes for llm_host-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`6b9f86d3295a38aef81b479bfaa06340cae2084161ebe4f187324ee30034779a`
MD5	`929ec2b14e7926486aa666a71a304bf4`
BLAKE2b-256	`eb3997186b415250ac83cf35b32410d9641a796b17af00163ba902e038a41cb9`

See more details on using hashes here.

File details

Details for the file llm_host-0.1.0-py3-none-any.whl.

File metadata

Download URL: llm_host-0.1.0-py3-none-any.whl
Upload date: May 13, 2026
Size: 15.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.8.10

File hashes

Hashes for llm_host-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5a29fc9c193ee5b23ca8423d37bf625dcb5b31b6e727f3927520c830cd3c20cd`
MD5	`a9add26f1919aa5f4ac35ef33c15721b`
BLAKE2b-256	`02dbad43489af3e713891caa4e39dcdabb894eeb70b82040cd1d7ddf9966a912`

See more details on using hashes here.

llm-host 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

llm-host

Install

Quickstart

Endpoints

CLI options

TTS voices

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes