Run any Hugging Face GGUF model on your own GPU — chat/vision, text-to-speech (OuteTTS), and image generation (SDXL/Flux/Z-Image/Qwen-Image via stable-diffusion.cpp), all behind one OpenAI-compatible endpoint. Pulls official upstream binaries, no compiling. Type `inferhost` and you're done.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

frankflorida

These details have not been verified by PyPI

Project description

🛰️ inferhost

Your own private, multi-modal AI server — one command, any GPU, no compiling.

Chat · Vision · Speech · Image generation — all behind one OpenAI-compatible endpoint.

inferhost turns any GPU box into a private AI server. It wraps llama.cpp and stable-diffusion.cpp behind a single OpenAI-compatible endpoint — pulls the official upstream binaries for you (nothing to compile), auto-fetches the right model files when you paste a Hugging Face link, and hot-swaps models in and out of VRAM so one card can serve a big LLM and image generation. You only ever touch a keyboard-driven dashboard (and an optional .env).

⚡ Quick start

uv tool install inferhost      # or:  pipx install inferhost  /  pip install inferhost
inferhost                      # opens the dashboard — press 'a' to add a model

That's the whole setup. First launch fetches the runtime binaries automatically. To add a model, press a and paste a Hugging Face repo — inferhost lists the files, downloads what's needed, and serves it. Then call it like OpenAI:

inferhost quick start: chat, speech, and image generation on one endpoint

🗣️ Chat / LLM	🔊 Text-to-speech	🎨 Image generation
paste `Qwen/Qwen2.5-7B-Instruct-GGUF`	paste `OuteAI/OuteTTS-0.2-500M-GGUF`	paste `OlegSkutte/sdxl-turbo-GGUF`

# 🗣️  Chat  →  /v1/chat/completions
curl http://localhost:9001/v1/chat/completions -H 'Content-Type: application/json' \
  -d '{"model":"<name-from-dashboard>","messages":[{"role":"user","content":"Hello!"}]}'

# 🔊  Speech  →  /v1/audio/speech   (returns WAV)
curl http://localhost:9001/v1/audio/speech -H 'Content-Type: application/json' \
  -d '{"model":"<name>","input":"Hello from inferhost.","voice":"default"}' --output speech.wav

# 🎨  Image  →  /v1/images/generations   (returns base64 PNG)
curl http://localhost:9001/v1/images/generations -H 'Content-Type: application/json' \
  -d '{"model":"<name>","prompt":"a red apple on a table","size":"512x512"}' \
  | jq -r '.data[0].b64_json' | base64 -d > out.png

Everything lives on http://localhost:9001/v1 — point any OpenAI client (Python SDK, Open WebUI, your app) at it. The model name is whatever shows in the dashboard.

✨ Why inferhost

One endpoint, every modality — chat, vision, speech, and images on the same OpenAI-compatible :9001. No per-model servers to wire up.
Nothing to compile — official llama-server / sd-server binaries are pulled from upstream for your hardware (NVIDIA Vulkan, ROCm, SYCL, CPU, Apple Metal).
Paste a link, it figures out the rest — picks the best quant for your VRAM, and for multi-file models (Flux, Z-Image, Qwen-Image) auto-downloads the right VAE + text encoders from known-good repos.
One GPU, many models — llama-swap lazy-loads and hot-swaps models in/out of VRAM on demand, so a 24 GB card serves a 27B LLM and Flux image generation.
TUI or headless — drive everything from a keyboard dashboard, or run inferhost start/stop/status on a server with no terminal.
Tuned by default — q8_0 KV-cache compression, stacked MTP + ngram speculative decoding, and honest context windows, all overridable from a .env.

🧩 Supported models

Modality	Models	How
Chat / Vision	any GGUF LLM (Qwen, Llama, Gemma, DeepSeek…), vision via `mmproj`	paste repo → pick quant
Speech (TTS)	OuteTTS	paste repo (vocoder auto-detected)
Image — single-file	SD 1.5, SDXL (incl. Turbo)	paste repo → pick file
Image — Flux.1	schnell / dev	auto-fetches VAE + CLIP-L + T5XXL
Image — Flux.2 Klein	incl. Bonsai-Image (1-bit)	auto-fetches VAE + Qwen3-4B
Image — Z-Image	Z-Image-Turbo	auto-fetches VAE + Qwen3-4B
Image — Qwen-Image	Qwen-Image / Qwen-Image-Edit	auto-fetches VAE + Qwen2.5-VL + mmproj

All image families above were verified end-to-end on a Vulkan GPU (SDXL-Turbo ~2 s, Flux-schnell ~4 s, Bonsai ~2 s, Z-Image-Turbo ~11 s, Qwen-Image-Edit via /v1/images/edits).

📚 Documentation

Full guides live in docs (and in the docs/ folder):

Installation — install, upgrade, uninstall, requirements
Usage — the dashboard, keyboard keys, and chat / TTS / image / Flux / Z-Image / Qwen-Image walkthroughs
Configuration — every .env variable, KV-cache quant, custom binaries
Troubleshooting — ports, tmux mouse, common errors

🏗️ Architecture

Your app ──HTTP──▶  LiteLLM gateway        llama-swap (loopback)       llama-server  (chat/vision)
                    :9001 (public)   ──▶    127.0.0.1:9090      ──┬──▶  sd-server     (images)
                                                                  └──▶  inferhost-tts (speech)

llama.cpp (llama-server) runs chat/vision inference — official upstream binary, backend auto-detected.
llama-swap fronts the model backends and lazy-loads / hot-swaps them on demand (loopback only). Image models (sd-server) ride here too, so they swap VRAM with LLMs.
inferhost-tts wraps llama.cpp's llama-tts behind /v1/audio/speech (started only when a TTS model is registered).
LiteLLM is the single always-on public gateway on :9001, routing each request to the right backend.

The extra engines (llama-tts, sd-server) are fetched automatically the first time you add a model that needs them.

🛠️ Development

The repo ships a run.sh wrapper for source-tree work (end users never need it — they only type inferhost):

git clone git@github.com:amirrouh/inferhost.git && cd inferhost
./run.sh install     # venv + editable install
./run.sh start       # launch the TUI (downloads binaries on first run)
./run.sh status      # headless status
./run.sh stop        # stop daemons
./run.sh test        # pytest

Run ./run.sh help for the full list.

License

Apache 2.0

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

frankflorida

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.7.10

Jun 1, 2026

0.7.9

Jun 1, 2026

0.7.8

Jun 1, 2026

0.7.7

Jun 1, 2026

0.7.6

Jun 1, 2026

This version

0.7.5

Jun 1, 2026

0.7.4

Jun 1, 2026

0.7.2

May 31, 2026

0.7.1

May 31, 2026

0.7.0

May 31, 2026

0.6.9

May 31, 2026

0.6.8

May 31, 2026

0.6.7

May 26, 2026

0.6.6

May 26, 2026

0.6.5

May 26, 2026

0.6.4

May 26, 2026

0.6.3

May 26, 2026

0.6.2

May 26, 2026

0.6.1

May 26, 2026

0.6.0

May 26, 2026

0.5.27

May 25, 2026

0.5.26

May 25, 2026

0.5.25

May 24, 2026

0.5.24

May 24, 2026

0.5.23

May 24, 2026

0.5.22

May 24, 2026

0.5.21

May 24, 2026

0.5.20

May 24, 2026

0.5.19

May 24, 2026

0.5.18

May 24, 2026

0.5.17

May 24, 2026

0.5.16

May 24, 2026

0.5.15

May 24, 2026

0.5.14

May 23, 2026

0.5.13

May 23, 2026

0.5.12

May 23, 2026

0.5.11

May 23, 2026

0.5.10

May 23, 2026

0.5.9

May 23, 2026

0.5.8

May 23, 2026

0.5.7

May 23, 2026

0.5.6

May 23, 2026

0.5.5

May 23, 2026

0.5.4

May 23, 2026

0.5.3

May 23, 2026

0.5.2

May 23, 2026

0.5.1

May 23, 2026

0.5.0

May 23, 2026

0.4.13

May 23, 2026

0.4.12

May 22, 2026

0.4.11

May 21, 2026

0.4.10

May 21, 2026

0.4.9

May 21, 2026

0.4.8

May 21, 2026

0.4.7

May 21, 2026

0.4.6

May 21, 2026

0.4.5

May 21, 2026

0.4.4

May 21, 2026

0.4.3

May 21, 2026

0.4.2

May 21, 2026

0.4.1

May 21, 2026

0.4.0

May 21, 2026

0.2.1

May 20, 2026

0.2.0

May 20, 2026

0.1.0

May 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inferhost-0.7.5.tar.gz (746.0 kB view details)

Uploaded Jun 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

inferhost-0.7.5-py3-none-any.whl (98.0 kB view details)

Uploaded Jun 1, 2026 Python 3

File details

Details for the file inferhost-0.7.5.tar.gz.

File metadata

Download URL: inferhost-0.7.5.tar.gz
Upload date: Jun 1, 2026
Size: 746.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for inferhost-0.7.5.tar.gz
Algorithm	Hash digest
SHA256	`7d46fb7ac50a6f9c22ae2f7899a3a1e2e02e9180a3a2c9a02fa7c3bd60bff524`
MD5	`bd72e8a77ec4d8097f5b2c2ec8a83c76`
BLAKE2b-256	`ceac32e96b51ee1be37fe6af996f756578e1a71ac8bc10e328d38e66a9b2d2c0`

See more details on using hashes here.

Provenance

The following attestation bundles were made for inferhost-0.7.5.tar.gz:

Publisher: publish.yml on amirrouh/inferhost

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: inferhost-0.7.5.tar.gz
- Subject digest: 7d46fb7ac50a6f9c22ae2f7899a3a1e2e02e9180a3a2c9a02fa7c3bd60bff524
- Sigstore transparency entry: 1688086613
- Sigstore integration time: Jun 1, 2026
Source repository:
- Permalink: amirrouh/inferhost@5a6424281f6615b61832bcd6768a00f160cbbf10
- Branch / Tag: refs/tags/v0.7.5
- Owner: https://github.com/amirrouh
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@5a6424281f6615b61832bcd6768a00f160cbbf10
- Trigger Event: push

File details

Details for the file inferhost-0.7.5-py3-none-any.whl.

File metadata

Download URL: inferhost-0.7.5-py3-none-any.whl
Upload date: Jun 1, 2026
Size: 98.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for inferhost-0.7.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`16d2c8d19c6d3b0e10c3dd1622936d2e1a2e5fa5c1707be3937cc9be32f592a6`
MD5	`7842dcbfa72549103c740b47b80b8e0e`
BLAKE2b-256	`b8de61e750b81724a49875199b16930311ebc98802a52f01a883f89d3dc88fc5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for inferhost-0.7.5-py3-none-any.whl:

Publisher: publish.yml on amirrouh/inferhost

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: inferhost-0.7.5-py3-none-any.whl
- Subject digest: 16d2c8d19c6d3b0e10c3dd1622936d2e1a2e5fa5c1707be3937cc9be32f592a6
- Sigstore transparency entry: 1688086627
- Sigstore integration time: Jun 1, 2026
Source repository:
- Permalink: amirrouh/inferhost@5a6424281f6615b61832bcd6768a00f160cbbf10
- Branch / Tag: refs/tags/v0.7.5
- Owner: https://github.com/amirrouh
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@5a6424281f6615b61832bcd6768a00f160cbbf10
- Trigger Event: push

inferhost 0.7.5

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

🛰️ inferhost

⚡ Quick start

✨ Why inferhost

🧩 Supported models

📚 Documentation

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance