Run any Hugging Face model on your own GPU in one command. No configs, no YAML.

These details have not been verified by PyPI

Project links

Project description

inferhost

Run any Hugging Face GGUF model on your own machine in a single command. inferhost is a small Python framework that wraps llama.cpp, llama-swap, and (optionally) LiteLLM behind one CLI and a Textual TUI. Point it at a Hugging Face repository and it returns an OpenAI-compatible endpoint.

pip install inferhost
inferhost install
inferhost serve Qwen/Qwen2.5-7B-Instruct-GGUF
# OpenAI-compatible endpoint: http://localhost:9090/v1

Features

One-command serving of any GGUF model published on Hugging Face.
Automatic quantization selection based on available VRAM (Q6 → Q5 → Q4 → IQ4 fallback).
OpenAI-compatible API out of the box; works with the official SDKs and any compatible client.
Multi-model support via llama-swap, which lazy-loads model backends on demand.
Textual TUI for adding, starting, stopping, and tailing logs of registered models.
Auto-detected hardware: NVIDIA via Vulkan, AMD via ROCm, Intel via SYCL/OpenVINO, or CPU.
All defaults overridable through environment variables or a .env file.

Installation

Requirements: Python 3.11+, Linux or macOS. NVIDIA, AMD, Intel, or Apple Silicon GPUs are auto-detected; CPU-only is supported.

# Recommended
uv tool install inferhost

# Or with pip
pip install inferhost

# With the LiteLLM gateway (unified endpoint + routing + aliases)
pip install 'inferhost[gateway]'

Then download the runtime binaries (llama-server from llama.cpp, plus llama-swap):

inferhost install
inferhost doctor

doctor prints a summary of detected hardware, installed binaries, and configured paths.

Usage

One-command serve

inferhost serve bartowski/SmolLM2-360M-Instruct-GGUF

inferhost will:

List the GGUF files in the repository.
Pick the highest-quality quant that fits in your VRAM.
Download it via the standard Hugging Face cache.
Register the model and render the llama-swap configuration.
Start llama-swap and expose the endpoint.

Test the endpoint with any OpenAI client:

curl http://localhost:9090/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "smollm2-360m-instruct-f16",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Manage multiple models

inferhost add meta-llama/Llama-3.2-3B-Instruct-GGUF
inferhost add bartowski/gemma-2-9b-it-GGUF --quant Q5_K_M
inferhost ls
inferhost start

llama-swap loads each model on the first request and unloads it after a configurable idle period.

TUI

inferhost tui

inferhost                              llama-swap http://localhost:9090/v1

Models                       Details
---------------------------  -------------------------------------------------
qwen2.5-7b-instruct          name:  qwen2.5-7b-instruct-q4-k-m
llama-3.2-3b-instruct        repo:  Qwen/Qwen2.5-7B-Instruct-GGUF
gemma-2-9b-it                quant: Q4_K_M    size: 4.4 GiB    ctx: 8192
                             port:  9091

                             Logs
                             llm_load_tensors: offloaded 33/33 layers to GPU
                             ...

a=add  s=start  x=stop  r=restart  d=remove  q=quit

The add-model modal accepts any Hugging Face repository ID, fetches available GGUF files, and highlights the recommended quantization for your hardware.

Command reference

Command	Description
`inferhost install`	First-time setup: download llama-server and llama-swap.
`inferhost serve <repo>`	Add a model and start serving it.
`inferhost add <repo>`	Add a model without starting llama-swap.
`inferhost start`	Start llama-swap.
`inferhost stop [--all]`	Stop llama-swap (and the gateway with `--all`).
`inferhost restart`	Restart llama-swap with the current configuration.
`inferhost ls`	List registered models.
`inferhost rm <name>`	Remove a model from the registry.
`inferhost logs <name> [-f]`	Show or follow logs.
`inferhost status`	Show daemon status.
`inferhost doctor`	Environment check.
`inferhost gateway start\|stop`	Manage the LiteLLM gateway (optional).
`inferhost tui`	Launch the dashboard.

Configuration

Every setting is overridable through environment variables or a .env file in the working directory. Copy .env.example for the full list.

Variable	Default	Purpose
`INFERHOST_SWAP_PORT`	`9090`	llama-swap listen port (user-facing OpenAI endpoint).
`INFERHOST_GATEWAY_PORT`	`9001`	LiteLLM gateway port when enabled.
`INFERHOST_DATA_DIR`	`~/.local/share/inferhost`	Binaries, logs, and PID files.
`INFERHOST_CONFIG_DIR`	`~/.config/inferhost`	Model registry and generated YAML.
`INFERHOST_HF_CACHE`	`~/.cache/huggingface`	Hugging Face model cache.
`INFERHOST_GPU_LAYERS`	`99`	`-ngl` value passed to llama-server.
`INFERHOST_DEFAULT_CTX`	`8192`	Default context length for new models.
`INFERHOST_FLASH_ATTENTION`	`on`	`-fa` flag for llama-server.
`INFERHOST_LLAMACPP_BACKEND`	auto	Force a backend: `vulkan`, `cuda`, `rocm`, `sycl`, `openvino`, or `cpu`.
`INFERHOST_LLAMACPP_VERSION`	`latest`	Pin a specific llama.cpp release tag.
`INFERHOST_LLAMASWAP_VERSION`	`latest`	Pin a specific llama-swap release tag.

Architecture

   Client                inferhost                       Inference
   ------                ---------                       ---------
   Your app  --HTTP-->   llama-swap        spawns/kills  llama-server
                         :9090                           (llama.cpp)
                            ^
                            |
                  (optional) LiteLLM
                         :9001

llama.cpp runs the inference (using a prebuilt Vulkan, CUDA, ROCm, SYCL, OpenVINO, or CPU binary, whichever fits the host).
llama-swap sits in front of multiple llama-server instances and lazy-loads them on demand.
LiteLLM (optional) provides a unified gateway with friendly aliases, routing, rate limits, and fallbacks across local and hosted providers.

Development

git clone git@github.com:amirrouh/inferhost.git
cd inferhost
./run.sh install            # creates venv, installs in editable mode, fetches binaries
./run.sh test               # runs pytest
./run.sh doctor             # confirms environment
./run.sh serve bartowski/SmolLM2-360M-Instruct-GGUF

run.sh is a thin wrapper around the inferhost CLI that activates the project venv and forwards arguments. Run ./run.sh help for the full list of dev shortcuts.

License

Apache 2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.4.13

May 23, 2026

0.4.12

May 22, 2026

0.4.11

May 21, 2026

0.4.10

May 21, 2026

0.4.9

May 21, 2026

0.4.8

May 21, 2026

0.4.7

May 21, 2026

0.4.6

May 21, 2026

0.4.5

May 21, 2026

0.4.4

May 21, 2026

0.4.3

May 21, 2026

0.4.2

May 21, 2026

0.4.1

May 21, 2026

0.4.0

May 21, 2026

0.2.1

May 20, 2026

0.2.0

May 20, 2026

This version

0.1.0

May 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inferhost-0.1.0.tar.gz (27.6 kB view details)

Uploaded May 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

inferhost-0.1.0-py3-none-any.whl (34.4 kB view details)

Uploaded May 20, 2026 Python 3

File details

Details for the file inferhost-0.1.0.tar.gz.

File metadata

Download URL: inferhost-0.1.0.tar.gz
Upload date: May 20, 2026
Size: 27.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.12 {"installer":{"name":"uv","version":"0.11.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for inferhost-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`60fae3f8154330bcf9bb32eddb565d567e5ab610a5ffe08d48d489f144865202`
MD5	`6bef515520169e4be82f853d0fa641d7`
BLAKE2b-256	`07c636f8fd1c084c49251f442b59cd8be50960df7e29981b14693ad91c53a72a`

See more details on using hashes here.

File details

Details for the file inferhost-0.1.0-py3-none-any.whl.

File metadata

Download URL: inferhost-0.1.0-py3-none-any.whl
Upload date: May 20, 2026
Size: 34.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.12 {"installer":{"name":"uv","version":"0.11.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for inferhost-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b74a92e047bc8aa29a8c6438516540720d9ca1c333e778bfbab85357495b4b2e`
MD5	`ab523e57ac7f400e4d49f602fed8e164`
BLAKE2b-256	`d3289fd7dbe33303fefa0c156eb372d2d9a97e1bfbb0bd403b3f430383bf806f`

See more details on using hashes here.

inferhost 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

inferhost

Features

Installation

Usage

One-command serve

Manage multiple models

TUI

Command reference

Configuration

Architecture

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes