Skip to main content

Run any Hugging Face GGUF model on your own GPU — chat/vision, text-to-speech (OuteTTS), and image generation (SDXL/Flux/Z-Image/Qwen-Image via stable-diffusion.cpp), all behind one OpenAI-compatible endpoint. Pulls official upstream binaries, no compiling. Type `inferhost` and you're done.

Project description

🛰️ inferhost

Your own private, multi-modal AI server — one command, any GPU, no compiling.

PyPI Python License Docs

Chat · Vision · Speech · Image generation — all behind one OpenAI-compatible endpoint.

inferhost dashboard

inferhost turns any GPU box into a private AI server. It wraps llama.cpp and stable-diffusion.cpp behind a single OpenAI-compatible endpoint — pulls the official upstream binaries for you (nothing to compile), auto-fetches the right model files when you paste a Hugging Face link, and hot-swaps models in and out of VRAM so one card can serve a big LLM and image generation. You only ever touch a keyboard-driven dashboard (and an optional .env).

⚡ Quick start

uv tool install inferhost      # or:  pipx install inferhost  /  pip install inferhost
inferhost                      # opens the dashboard — press 'a' to add a model

That's the whole setup. First launch fetches the runtime binaries automatically. To add a model, press a and paste a Hugging Face repo — inferhost lists the files, downloads what's needed, and serves it. Then call it like OpenAI:

inferhost quick start: chat, speech, and image generation on one endpoint
🗣️ Chat / LLM 🔊 Text-to-speech 🎨 Image generation
paste
Qwen/Qwen2.5-7B-Instruct-GGUF
paste
OuteAI/OuteTTS-0.2-500M-GGUF
paste
OlegSkutte/sdxl-turbo-GGUF
# 🗣️  Chat  →  /v1/chat/completions
curl http://localhost:9001/v1/chat/completions -H 'Content-Type: application/json' \
  -d '{"model":"<name-from-dashboard>","messages":[{"role":"user","content":"Hello!"}]}'

# 🔊  Speech  →  /v1/audio/speech   (returns WAV)
curl http://localhost:9001/v1/audio/speech -H 'Content-Type: application/json' \
  -d '{"model":"<name>","input":"Hello from inferhost.","voice":"default"}' --output speech.wav

# 🎨  Image  →  /v1/images/generations   (returns base64 PNG)
curl http://localhost:9001/v1/images/generations -H 'Content-Type: application/json' \
  -d '{"model":"<name>","prompt":"a red apple on a table","size":"512x512"}' \
  | jq -r '.data[0].b64_json' | base64 -d > out.png

Everything lives on http://localhost:9001/v1 — point any OpenAI client (Python SDK, Open WebUI, your app) at it. The model name is whatever shows in the dashboard.

✨ Why inferhost

  • One endpoint, every modality — chat, vision, speech, and images on the same OpenAI-compatible :9001. No per-model servers to wire up.
  • Nothing to compile — official llama-server / sd-server binaries are pulled from upstream for your hardware (NVIDIA Vulkan, ROCm, SYCL, CPU, Apple Metal).
  • Paste a link, it figures out the rest — picks the best quant for your VRAM, and for multi-file models (Flux, Z-Image, Qwen-Image) auto-downloads the right VAE + text encoders from known-good repos.
  • One GPU, many models — llama-swap lazy-loads and hot-swaps models in/out of VRAM on demand, so a 24 GB card serves a 27B LLM and Flux image generation.
  • TUI or headless — drive everything from a keyboard dashboard, or run inferhost start/stop/status on a server with no terminal.
  • Tuned by default — q8_0 KV-cache compression, stacked MTP + ngram speculative decoding, and honest context windows, all overridable from a .env.

🧩 Supported models

Modality Models How
Chat / Vision any GGUF LLM (Qwen, Llama, Gemma, DeepSeek…), vision via mmproj paste repo → pick quant
Speech (TTS) OuteTTS paste repo (vocoder auto-detected)
Image — single-file SD 1.5, SDXL (incl. Turbo) paste repo → pick file
Image — Flux.1 schnell / dev auto-fetches VAE + CLIP-L + T5XXL
Image — Flux.2 Klein incl. Bonsai-Image (1-bit) auto-fetches VAE + Qwen3-4B
Image — Z-Image Z-Image-Turbo auto-fetches VAE + Qwen3-4B
Image — Qwen-Image Qwen-Image / Qwen-Image-Edit auto-fetches VAE + Qwen2.5-VL + mmproj

All image families above were verified end-to-end on a Vulkan GPU (SDXL-Turbo ~2 s, Flux-schnell ~4 s, Bonsai ~2 s, Z-Image-Turbo ~11 s, Qwen-Image-Edit via /v1/images/edits).

📚 Documentation

Full guides live in docs (and in the docs/ folder):

  • Installation — install, upgrade, uninstall, requirements
  • Usage — the dashboard, keyboard keys, and chat / TTS / image / Flux / Z-Image / Qwen-Image walkthroughs
  • Configuration — every .env variable, KV-cache quant, custom binaries
  • Troubleshooting — ports, tmux mouse, common errors
🏗️ Architecture
Your app ──HTTP──▶  LiteLLM gateway        llama-swap (loopback)       llama-server  (chat/vision)
                    :9001 (public)   ──▶    127.0.0.1:9090      ──┬──▶  sd-server     (images)
                                                                  └──▶  inferhost-tts (speech)
  • llama.cpp (llama-server) runs chat/vision inference — official upstream binary, backend auto-detected.
  • llama-swap fronts the model backends and lazy-loads / hot-swaps them on demand (loopback only). Image models (sd-server) ride here too, so they swap VRAM with LLMs.
  • inferhost-tts wraps llama.cpp's llama-tts behind /v1/audio/speech (started only when a TTS model is registered).
  • LiteLLM is the single always-on public gateway on :9001, routing each request to the right backend.

The extra engines (llama-tts, sd-server) are fetched automatically the first time you add a model that needs them.

🛠️ Development

The repo ships a run.sh wrapper for source-tree work (end users never need it — they only type inferhost):

git clone git@github.com:amirrouh/inferhost.git && cd inferhost
./run.sh install     # venv + editable install
./run.sh start       # launch the TUI (downloads binaries on first run)
./run.sh status      # headless status
./run.sh stop        # stop daemons
./run.sh test        # pytest

Run ./run.sh help for the full list.

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inferhost-0.7.5.tar.gz (746.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

inferhost-0.7.5-py3-none-any.whl (98.0 kB view details)

Uploaded Python 3

File details

Details for the file inferhost-0.7.5.tar.gz.

File metadata

  • Download URL: inferhost-0.7.5.tar.gz
  • Upload date:
  • Size: 746.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for inferhost-0.7.5.tar.gz
Algorithm Hash digest
SHA256 7d46fb7ac50a6f9c22ae2f7899a3a1e2e02e9180a3a2c9a02fa7c3bd60bff524
MD5 bd72e8a77ec4d8097f5b2c2ec8a83c76
BLAKE2b-256 ceac32e96b51ee1be37fe6af996f756578e1a71ac8bc10e328d38e66a9b2d2c0

See more details on using hashes here.

Provenance

The following attestation bundles were made for inferhost-0.7.5.tar.gz:

Publisher: publish.yml on amirrouh/inferhost

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file inferhost-0.7.5-py3-none-any.whl.

File metadata

  • Download URL: inferhost-0.7.5-py3-none-any.whl
  • Upload date:
  • Size: 98.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for inferhost-0.7.5-py3-none-any.whl
Algorithm Hash digest
SHA256 16d2c8d19c6d3b0e10c3dd1622936d2e1a2e5fa5c1707be3937cc9be32f592a6
MD5 7842dcbfa72549103c740b47b80b8e0e
BLAKE2b-256 b8de61e750b81724a49875199b16930311ebc98802a52f01a883f89d3dc88fc5

See more details on using hashes here.

Provenance

The following attestation bundles were made for inferhost-0.7.5-py3-none-any.whl:

Publisher: publish.yml on amirrouh/inferhost

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page