Run any Hugging Face GGUF model on your own GPU — TUI only. Type `inferhost` and you're done.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

frankflorida

These details have not been verified by PyPI

Project description

inferhost

📖 Full documentation: https://amirrouh.github.io/inferhost/

Run any Hugging Face GGUF model on your own machine — TUI only. inferhost is a small Python framework that wraps llama.cpp, llama-swap, and (optionally) LiteLLM behind a single Textual TUI. Point it at a Hugging Face repository and it returns an OpenAI-compatible endpoint.

inferhost TUI dashboard

pip install inferhost
inferhost

That's it. The first launch downloads the runtime binaries (llama-server + llama-swap) for you with a progress bar; then the dashboard opens and you can add, start, stop, and inspect models from the keyboard.

What it does

One-key serving of any GGUF model published on Hugging Face.
Automatic quantization selection based on available VRAM (Q6 → Q5 → Q4 → IQ4 fallback).
OpenAI-compatible API out of the box; works with the official SDKs and any compatible client.
Multi-model support via llama-swap, which lazy-loads model backends on demand.
Auto-detected hardware: NVIDIA via Vulkan, AMD via ROCm, Intel via SYCL/OpenVINO, or CPU.
Live download progress for both runtime binaries and Hugging Face model files.
All defaults overridable through environment variables or a .env file.

Installation

Requirements: Python 3.11+, Linux or macOS. NVIDIA, AMD, Intel, or Apple Silicon GPUs are auto-detected; CPU-only is supported.

# Recommended
uv tool install inferhost

# Or with pip
pip install inferhost

# With the LiteLLM gateway (unified endpoint + routing + aliases)
pip install 'inferhost[gateway]'

Usage

There is exactly one command:

inferhost

This opens the TUI. On first launch it downloads llama-server and llama-swap with a progress bar. Afterward you land on the dashboard.

Keys

Key	Action
`a`	Add a Hugging Face model (downloads the GGUF with a progress bar)
`s`	Start llama-swap
`x`	Stop llama-swap
`r`	Restart llama-swap
`d` / `Delete`	Remove the highlighted model from the registry
`R`	Refresh
`q`	Quit

Adding a model

Press a, type a Hugging Face repo id (e.g. Qwen/Qwen2.5-7B-Instruct-GGUF), and press Enter. The TUI lists the available GGUF files, marks the recommended quant for your hardware, and shows a live progress bar while it downloads. The model is registered against llama-swap and ready to serve.

Endpoint

The dashboard shows the current OpenAI-compatible endpoint, e.g. http://localhost:9090/v1. Use the model name column in any OpenAI client:

curl http://localhost:9090/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-7b-instruct-q4-k-m",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Configuration

Every setting is overridable through environment variables or a .env file in the working directory. Copy .env.example for the full list.

Variable	Default	Purpose
`INFERHOST_SWAP_PORT`	`9090`	llama-swap listen port (user-facing OpenAI endpoint).
`INFERHOST_GATEWAY_PORT`	`9001`	LiteLLM gateway port when enabled.
`INFERHOST_DATA_DIR`	`~/.local/share/inferhost`	Binaries, logs, and PID files.
`INFERHOST_CONFIG_DIR`	`~/.config/inferhost`	Model registry and generated YAML.
`INFERHOST_HF_CACHE`	`~/.cache/huggingface`	Hugging Face model cache.
`INFERHOST_GPU_LAYERS`	`99`	`-ngl` value passed to llama-server.
`INFERHOST_DEFAULT_CTX`	`8192`	Default context length for new models.
`INFERHOST_FLASH_ATTENTION`	`on`	`-fa` flag for llama-server.
`INFERHOST_LLAMACPP_BACKEND`	auto	Force a backend: `vulkan`, `cuda`, `rocm`, `sycl`, `openvino`, or `cpu`.
`INFERHOST_LLAMACPP_VERSION`	`latest`	Pin a specific llama.cpp release tag.
`INFERHOST_LLAMASWAP_VERSION`	`latest`	Pin a specific llama-swap release tag.

Architecture

   Client                inferhost                       Inference
   ------                ---------                       ---------
   Your app  --HTTP-->   llama-swap        spawns/kills  llama-server
                         :9090                           (llama.cpp)
                            ^
                            |
                  (optional) LiteLLM
                         :9001

llama.cpp runs the inference (using a prebuilt Vulkan, CUDA, ROCm, SYCL, OpenVINO, or CPU binary, whichever fits the host).
llama-swap sits in front of multiple llama-server instances and lazy-loads them on demand.
LiteLLM (optional) provides a unified gateway with friendly aliases, routing, rate limits, and fallbacks across local and hosted providers.

Development

The repo ships a run.sh wrapper for source-tree work:

git clone git@github.com:amirrouh/inferhost.git
cd inferhost
./run.sh install     # creates venv, installs in editable mode
./run.sh start       # launches the TUI (downloads binaries on first run)
./run.sh status      # headless status print
./run.sh stop        # stop daemons
./run.sh test        # run pytest

Run ./run.sh help for the full list. End users do not need run.sh — they only ever type inferhost.

License

Apache 2.0

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

frankflorida

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.4.13

May 23, 2026

0.4.12

May 22, 2026

0.4.11

May 21, 2026

0.4.10

May 21, 2026

0.4.9

May 21, 2026

0.4.8

May 21, 2026

0.4.7

May 21, 2026

0.4.6

May 21, 2026

0.4.5

May 21, 2026

0.4.4

May 21, 2026

0.4.3

May 21, 2026

0.4.2

May 21, 2026

0.4.1

May 21, 2026

0.4.0

May 21, 2026

This version

0.2.1

May 20, 2026

0.2.0

May 20, 2026

0.1.0

May 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inferhost-0.2.1.tar.gz (478.1 kB view details)

Uploaded May 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

inferhost-0.2.1-py3-none-any.whl (34.6 kB view details)

Uploaded May 20, 2026 Python 3

File details

Details for the file inferhost-0.2.1.tar.gz.

File metadata

Download URL: inferhost-0.2.1.tar.gz
Upload date: May 20, 2026
Size: 478.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for inferhost-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`683dfe0051abb7c886f1036718ca4f45be16d69493f5f347d1be516a6da2a4b4`
MD5	`ff0cf43eb64f35785a792e26cb669f36`
BLAKE2b-256	`f36b908b79171bfb788e03b0c6408189415e71a037a3542311a9d8504b027963`

See more details on using hashes here.

Provenance

The following attestation bundles were made for inferhost-0.2.1.tar.gz:

Publisher: publish.yml on amirrouh/inferhost

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: inferhost-0.2.1.tar.gz
- Subject digest: 683dfe0051abb7c886f1036718ca4f45be16d69493f5f347d1be516a6da2a4b4
- Sigstore transparency entry: 1587977841
- Sigstore integration time: May 20, 2026
Source repository:
- Permalink: amirrouh/inferhost@56a330a7ff2a556d1ef4de4f941573899a04baee
- Branch / Tag: refs/tags/v0.2.1
- Owner: https://github.com/amirrouh
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@56a330a7ff2a556d1ef4de4f941573899a04baee
- Trigger Event: push

File details

Details for the file inferhost-0.2.1-py3-none-any.whl.

File metadata

Download URL: inferhost-0.2.1-py3-none-any.whl
Upload date: May 20, 2026
Size: 34.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for inferhost-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1e36ddfcb50b286366ef39cc366608a805a00ed9191ff58256c113cb43c37a48`
MD5	`0299352b11e37a4b9b56447cc33d08d0`
BLAKE2b-256	`4d2c0683ecf042ebd8f738a919f96f1adcf42cfc91c2ea1b675aac7701ee07ad`

See more details on using hashes here.

Provenance

The following attestation bundles were made for inferhost-0.2.1-py3-none-any.whl:

Publisher: publish.yml on amirrouh/inferhost

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: inferhost-0.2.1-py3-none-any.whl
- Subject digest: 1e36ddfcb50b286366ef39cc366608a805a00ed9191ff58256c113cb43c37a48
- Sigstore transparency entry: 1587977872
- Sigstore integration time: May 20, 2026
Source repository:
- Permalink: amirrouh/inferhost@56a330a7ff2a556d1ef4de4f941573899a04baee
- Branch / Tag: refs/tags/v0.2.1
- Owner: https://github.com/amirrouh
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@56a330a7ff2a556d1ef4de4f941573899a04baee
- Trigger Event: push

inferhost 0.2.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

inferhost

What it does

Installation

Usage

Keys

Adding a model

Endpoint

Configuration

Architecture

Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance