Skip to main content

Run local LLMs from Python. LangChain-compatible. llama.cpp + MLX backends.

Project description

openhost

Run local LLMs from Python. LangChain-compatible. No desktop app required.

openhost is a thin Python SDK that manages llama.cpp and mlx-lm servers as subprocesses, handles model downloads from HuggingFace, and plugs into LangChain like any other provider.

Install

pip install openhost

That one command pulls the bundled llama.cpp backend (llama-cpp-python) plus — on Apple Silicon — mlx-lm for native MLX inference. You can start running models immediately with no extra setup on:

  • macOS (Apple Silicon) — Metal GPU acceleration out of the box
  • Linux (CPU) — CPU baseline works
  • Windows (CPU) — CPU baseline works

GPU acceleration (NVIDIA / AMD)

pip can't pick the right CUDA/ROCm wheel for you at install time. After pip install openhost, run one additional line for your toolkit:

# NVIDIA CUDA 12.4
pip install --upgrade --force-reinstall llama-cpp-python \
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124

# AMD ROCm 5.7
pip install --upgrade --force-reinstall llama-cpp-python \
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/rocm5.7

Once that wheel is installed, openhost auto-detects the GPU and tunes -ngl (number of layers offloaded) based on available VRAM.

Optional extras

pip install 'openhost[whisper-mlx]'     # Apple Neural Engine whisper
pip install 'openhost[whisper-faster]'  # CUDA or CPU whisper (faster-whisper)

Power-user: use your own llama.cpp

If you already have an external llama-server binary on PATH, openhost prefers it over the bundled Python backend (faster startup, current llama.cpp builds). No action needed — auto-detected.

Usage

Quickest path: chat

import openhost

llm = openhost.make_chat("qwen3.6-35b-mlx-turbo", streaming=True)
for chunk in llm.stream("Write a haiku about subprocess management."):
    print(chunk.content, end="", flush=True)

That one line auto-downloads the model on first run, starts the server, picks a free port, and returns a fully-wired ChatOpenAI. No ports, no YAML, no gateway.

Model management

openhost.list_presets()                         # all built-in presets
openhost.pull("qwen3.5-35b-uncensored")         # just download
openhost.run("qwen3.5-35b-uncensored")          # start (auto-pulls if needed)
openhost.running()                              # list active runners
openhost.stop("qwen3.5-35b-uncensored")
openhost.stop_all()                             # kill everything

Any HuggingFace model — auto-detect from the repo id

If the model isn't in the built-in presets, just pass a HF repo string. OpenHost will inspect the repo, pick the right backend (GGUF → llama.cpp, safetensors → MLX), pick a quant, and register it on the fly.

# Llama 3.1 8B Q4_K_M (default quant pick) — downloads + runs in one call
llm = openhost.make_chat("bartowski/Meta-Llama-3.1-8B-Instruct-GGUF")

# Pick a specific quant
llm = openhost.make_chat("bartowski/Meta-Llama-3.1-8B-Instruct-GGUF:Q5_K_M")

# MLX model on Apple Silicon
llm = openhost.make_chat("mlx-community/Qwen2.5-7B-Instruct-4bit")

# More control
from openhost import from_hf
preset = from_hf(
    "bartowski/Meta-Llama-3.1-8B-Instruct-GGUF",
    filename="Meta-Llama-3.1-8B-Instruct-Q8_0.gguf",  # explicit file
    context_length=8192,
)

Register your own model:

from openhost import ModelPreset, register_preset

register_preset(ModelPreset(
    id="llama-3.1-8b-instruct-q6",
    display_name="Llama 3.1 8B Instruct (Q6_K)",
    backend="llama.cpp",
    hf_repo="bartowski/Meta-Llama-3.1-8B-Instruct-GGUF",
    primary_file="Meta-Llama-3.1-8B-Instruct-Q6_K.gguf",
    command_template=(
        "llama-server", "-m", "{path}/{primary_file}",
        "-c", "{context_length}", "--host", "127.0.0.1", "--port", "{port}",
        "--jinja", "-ngl", "99", "-fa", "on",
    ),
    context_length=8192,
))

Web search (LangChain tool)

from openhost import OpenHostSearchTool

tool = OpenHostSearchTool()  # keyless DuckDuckGo by default
print(tool.invoke("macOS 26 release date"))

# Use a different provider
from openhost.search import TavilyProvider
tool = OpenHostSearchTool(provider=TavilyProvider("tvly-..."))

# Plug into a LangGraph agent
from langgraph.prebuilt import create_react_agent
agent = create_react_agent(llm, tools=[OpenHostSearchTool()])

Transcription

import openhost

# Auto-picks mlx-whisper on Apple Silicon, faster-whisper elsewhere
result = openhost.transcribe("meeting.mp3")
print(result.text)

# As a LangChain document loader (verbose = per-segment Documents)
from openhost import OpenHostWhisper
docs = OpenHostWhisper("meeting.mp3", verbose=True).load()
for doc in docs:
    print(f"[{doc.metadata['start']:.1f}s] {doc.page_content}")

CLI

openhost list                            # show presets
openhost pull qwen3.5-35b-uncensored     # download
openhost run qwen3.5-35b-uncensored      # foreground until Ctrl-C

Built-in presets

id backend size
qwen3.6-35b-mlx-turbo mlx-lm ~20 GB
qwen3.5-35b-uncensored llama.cpp ~30 GB
qwen3-8b-gguf llama.cpp ~5 GB

How it works

  • No HTTP gateway. make_chat() returns a ChatOpenAI pointed straight at the model's own OpenAI-compatible endpoint. Zero proxy overhead.
  • Automatic port allocation. Each runner picks a free localhost port. Users never touch ports.
  • Process-scoped lifecycle. When your Python process exits, all runners it started get cleaned up (SIGTERM on the process group, SIGKILL fallback).
  • Platform support. macOS + Linux. MLX is Apple Silicon only; llama.cpp is cross-platform.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openhost-0.4.0.tar.gz (54.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

openhost-0.4.0-py3-none-any.whl (62.0 kB view details)

Uploaded Python 3

File details

Details for the file openhost-0.4.0.tar.gz.

File metadata

  • Download URL: openhost-0.4.0.tar.gz
  • Upload date:
  • Size: 54.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for openhost-0.4.0.tar.gz
Algorithm Hash digest
SHA256 3a757a766cdf0fd91d2f3382ff1c0fb345993bf1fd4306a3255ca8b45be07af8
MD5 cc2cb1fe0eec794889680d20d3fae041
BLAKE2b-256 7b6d4386b50f43d95b9212a2582832600db863829f51bbe74419807b63531d77

See more details on using hashes here.

Provenance

The following attestation bundles were made for openhost-0.4.0.tar.gz:

Publisher: publish.yml on atharvakhaire3443/openhost

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file openhost-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: openhost-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 62.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for openhost-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ec49de978007f5308bc88dbd819283eb8f18c7c3a8923510cd85e787fbf9a057
MD5 f99b8d2e03da7a0e69a8c02181a1ef48
BLAKE2b-256 8de40dd874981ae5612c5768703869226f74678390cb73df61bd791be3c89c6d

See more details on using hashes here.

Provenance

The following attestation bundles were made for openhost-0.4.0-py3-none-any.whl:

Publisher: publish.yml on atharvakhaire3443/openhost

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page