Run any Hugging Face model on your own GPU in one command. No configs, no YAML.
Project description
inferhost
Run any Hugging Face GGUF model on your own machine in a single command. inferhost is a small Python framework that wraps llama.cpp, llama-swap, and (optionally) LiteLLM behind one CLI and a Textual TUI. Point it at a Hugging Face repository and it returns an OpenAI-compatible endpoint.
pip install inferhost
inferhost install
inferhost serve Qwen/Qwen2.5-7B-Instruct-GGUF
# OpenAI-compatible endpoint: http://localhost:9090/v1
Features
- One-command serving of any GGUF model published on Hugging Face.
- Automatic quantization selection based on available VRAM (
Q6 → Q5 → Q4 → IQ4fallback). - OpenAI-compatible API out of the box; works with the official SDKs and any compatible client.
- Multi-model support via llama-swap, which lazy-loads model backends on demand.
- Textual TUI for adding, starting, stopping, and tailing logs of registered models.
- Auto-detected hardware: NVIDIA via Vulkan, AMD via ROCm, Intel via SYCL/OpenVINO, or CPU.
- All defaults overridable through environment variables or a
.envfile.
Installation
Requirements: Python 3.11+, Linux or macOS. NVIDIA, AMD, Intel, or Apple Silicon GPUs are auto-detected; CPU-only is supported.
# Recommended
uv tool install inferhost
# Or with pip
pip install inferhost
# With the LiteLLM gateway (unified endpoint + routing + aliases)
pip install 'inferhost[gateway]'
Then download the runtime binaries (llama-server from llama.cpp, plus llama-swap):
inferhost install
inferhost doctor
doctor prints a summary of detected hardware, installed binaries, and configured paths.
Usage
One-command serve
inferhost serve bartowski/SmolLM2-360M-Instruct-GGUF
inferhost will:
- List the GGUF files in the repository.
- Pick the highest-quality quant that fits in your VRAM.
- Download it via the standard Hugging Face cache.
- Register the model and render the llama-swap configuration.
- Start llama-swap and expose the endpoint.
Test the endpoint with any OpenAI client:
curl http://localhost:9090/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "smollm2-360m-instruct-f16",
"messages": [{"role": "user", "content": "Hello"}]
}'
Manage multiple models
inferhost add meta-llama/Llama-3.2-3B-Instruct-GGUF
inferhost add bartowski/gemma-2-9b-it-GGUF --quant Q5_K_M
inferhost ls
inferhost start
llama-swap loads each model on the first request and unloads it after a configurable idle period.
TUI
inferhost tui
inferhost llama-swap http://localhost:9090/v1
Models Details
--------------------------- -------------------------------------------------
qwen2.5-7b-instruct name: qwen2.5-7b-instruct-q4-k-m
llama-3.2-3b-instruct repo: Qwen/Qwen2.5-7B-Instruct-GGUF
gemma-2-9b-it quant: Q4_K_M size: 4.4 GiB ctx: 8192
port: 9091
Logs
llm_load_tensors: offloaded 33/33 layers to GPU
...
a=add s=start x=stop r=restart d=remove q=quit
The add-model modal accepts any Hugging Face repository ID, fetches available GGUF files, and highlights the recommended quantization for your hardware.
Command reference
| Command | Description |
|---|---|
inferhost install |
First-time setup: download llama-server and llama-swap. |
inferhost serve <repo> |
Add a model and start serving it. |
inferhost add <repo> |
Add a model without starting llama-swap. |
inferhost start |
Start llama-swap. |
inferhost stop [--all] |
Stop llama-swap (and the gateway with --all). |
inferhost restart |
Restart llama-swap with the current configuration. |
inferhost ls |
List registered models. |
inferhost rm <name> |
Remove a model from the registry. |
inferhost logs <name> [-f] |
Show or follow logs. |
inferhost status |
Show daemon status. |
inferhost doctor |
Environment check. |
inferhost gateway start|stop |
Manage the LiteLLM gateway (optional). |
inferhost tui |
Launch the dashboard. |
Configuration
Every setting is overridable through environment variables or a .env file in the working directory. Copy .env.example for the full list.
| Variable | Default | Purpose |
|---|---|---|
INFERHOST_SWAP_PORT |
9090 |
llama-swap listen port (user-facing OpenAI endpoint). |
INFERHOST_GATEWAY_PORT |
9001 |
LiteLLM gateway port when enabled. |
INFERHOST_DATA_DIR |
~/.local/share/inferhost |
Binaries, logs, and PID files. |
INFERHOST_CONFIG_DIR |
~/.config/inferhost |
Model registry and generated YAML. |
INFERHOST_HF_CACHE |
~/.cache/huggingface |
Hugging Face model cache. |
INFERHOST_GPU_LAYERS |
99 |
-ngl value passed to llama-server. |
INFERHOST_DEFAULT_CTX |
8192 |
Default context length for new models. |
INFERHOST_FLASH_ATTENTION |
on |
-fa flag for llama-server. |
INFERHOST_LLAMACPP_BACKEND |
auto | Force a backend: vulkan, cuda, rocm, sycl, openvino, or cpu. |
INFERHOST_LLAMACPP_VERSION |
latest |
Pin a specific llama.cpp release tag. |
INFERHOST_LLAMASWAP_VERSION |
latest |
Pin a specific llama-swap release tag. |
Architecture
Client inferhost Inference
------ --------- ---------
Your app --HTTP--> llama-swap spawns/kills llama-server
:9090 (llama.cpp)
^
|
(optional) LiteLLM
:9001
- llama.cpp runs the inference (using a prebuilt Vulkan, CUDA, ROCm, SYCL, OpenVINO, or CPU binary, whichever fits the host).
- llama-swap sits in front of multiple llama-server instances and lazy-loads them on demand.
- LiteLLM (optional) provides a unified gateway with friendly aliases, routing, rate limits, and fallbacks across local and hosted providers.
Development
git clone git@github.com:amirrouh/inferhost.git
cd inferhost
./run.sh install # creates venv, installs in editable mode, fetches binaries
./run.sh test # runs pytest
./run.sh doctor # confirms environment
./run.sh serve bartowski/SmolLM2-360M-Instruct-GGUF
run.sh is a thin wrapper around the inferhost CLI that activates the project venv and forwards arguments. Run ./run.sh help for the full list of dev shortcuts.
License
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file inferhost-0.1.0.tar.gz.
File metadata
- Download URL: inferhost-0.1.0.tar.gz
- Upload date:
- Size: 27.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.12 {"installer":{"name":"uv","version":"0.11.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
60fae3f8154330bcf9bb32eddb565d567e5ab610a5ffe08d48d489f144865202
|
|
| MD5 |
6bef515520169e4be82f853d0fa641d7
|
|
| BLAKE2b-256 |
07c636f8fd1c084c49251f442b59cd8be50960df7e29981b14693ad91c53a72a
|
File details
Details for the file inferhost-0.1.0-py3-none-any.whl.
File metadata
- Download URL: inferhost-0.1.0-py3-none-any.whl
- Upload date:
- Size: 34.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.12 {"installer":{"name":"uv","version":"0.11.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b74a92e047bc8aa29a8c6438516540720d9ca1c333e778bfbab85357495b4b2e
|
|
| MD5 |
ab523e57ac7f400e4d49f602fed8e164
|
|
| BLAKE2b-256 |
d3289fd7dbe33303fefa0c156eb372d2d9a97e1bfbb0bd403b3f430383bf806f
|