Local native and browser-backed AI inference bridge
Project description
xlocllm
xlocllm is a Python SDK for local AI inference. The default runtime is
native: Python starts a local supervisor, exposes an OpenAI-compatible
loopback API, and runs local engines such as llama.cpp/GGUF for LLMs and ONNX
Runtime for embeddings, rerankers, vision, audio, and other task models.
The browser/WebGPU runtime remains available with mode="web" and keeps the
old browser-backed behavior through MLC WebLLM and Transformers.js.
The goal is simple:
pip install xlocllm
Then:
import xlocllm
llm = xlocllm.unit("LLM", "Qwen-3.5-0.8b")
runtime = xlocllm.runtime([llm])
runtime.run()
print(runtime.url) # http://127.0.0.1:1146/v1
print(runtime.chat("Say hello", temperature=0))
What It Does
- Starts a local FastAPI bridge on
127.0.0.1. - Uses
nativemode by default; usexlocllm.mode = "web"orruntime(..., mode="web")for the browser runtime. - Opens a small dashboard window for status and controls. Native mode uses a non-browser desktop monitor; web mode keeps the paired browser runtime window.
- Runs model engines locally in native mode, or inside the paired browser in web mode.
- Provides OpenAI-compatible
/v1endpoints for local clients. - Supports LLMs, embeddings, rerankers, translation, TTS, vision, ASR, and more through a shared catalog.
- Provides local RAG with vector storage, embeddings, optional reranking,
automatic LLM retrieval, and
runtime.chatui(). - Keeps Python-side objects for models, units, runtimes, and bridges.
Install
pip install xlocllm
The package install stays light. In native mode, managed engine dependencies and
model artifacts are downloaded into the xlocllm cache on the first
runtime.run().
Optional OpenAI client helper:
pip install "xlocllm[openai]"
Development install from this repository:
python -m pip install -e .\python\xlocllm[dev,openai]
Quick Start
import xlocllm
runtime = xlocllm.runtime(
[
xlocllm.unit("LLM", "Qwen-3.5-0.8b"),
xlocllm.unit("embedding", "multilingual-e5-small"),
]
)
runtime.install()
runtime.run()
print(runtime.status())
OpenAI-Compatible Usage
import xlocllm
from openai import OpenAI
llm = xlocllm.unit(type="LLM", model="Qwen-3.5-0.8b-fp32")
client = OpenAI(base_url="http://127.0.0.1:1146/v1", api_key="xlocllm")
with xlocllm.runtime([llm]) as runtime:
runtime.run()
response = client.chat.completions.create(
model="Qwen-3.5-0.8b-fp32",
messages=[{"role": "user", "content": "What is lidar?"}],
max_tokens=64,
)
print(response.choices[0].message.content)
With the optional helper:
client = runtime.client()
Local RAG
import xlocllm
emb = xlocllm.unit("embedding", "multilingual-e5-small")
rag = xlocllm.rag(emb=emb, name="kb")
llm = xlocllm.unit("LLM", "Qwen-3.5-0.8b-fp32", rag=rag)
with xlocllm.runtime([llm]) as runtime:
runtime.run()
rag.add(["xlocllm keeps vectors in the active runtime storage."], ids=["storage"])
print(runtime.chat("Where does xlocllm keep vectors?"))
runtime.chatui(session="kb-demo")
Native mode uses local persistent storage. Browser mode uses IndexedDB in the paired browser runtime.
Core API
model = xlocllm.model("Qwen-3.5-0.8b", unit="LLM")
models = xlocllm.models(unit="LLM", max_vram_mb=1500)
native_models = xlocllm.models(mode="native")
cpu_models = xlocllm.models(mode="web", webgpu=False)
rag_embeddings = xlocllm.models(unit="embedding", mode="native", use_case="rag")
vlms = xlocllm.models(unit="vlm", mode="native", modality="image")
unit = xlocllm.unit("LLM", "Qwen-3.5-0.8b", quant="q4", reasoning=None)
store = xlocllm.vectorstorage("kb")
rag = xlocllm.rag(emb=xlocllm.unit("embedding", "multilingual-e5-small"), store=store)
runtime = xlocllm.runtime([unit], port=1146)
bridge = xlocllm.Bridge(port=1146)
print(runtime.url)
print(bridge.url)
print(xlocllm.bridges())
print(xlocllm.runtimes())
print(xlocllm.status())
print(xlocllm.benchmark())
print(xlocllm.benchmark("LLM"))
benchmark() checks CPU/RAM/disk, native engine availability, GPU/NPU signals,
and Hugging Face latency. In mode="web" it can temporarily open a paired mini
browser to detect real WebGPU/WebNN/NPU support. With a unit type, it returns
fast and quality recommendations.
Reasoning-capable LLMs can be configured at creation and updated hot:
llm = xlocllm.unit("LLM", "Qwen-3.5-0.8b-fp32", reasoning=False)
runtime.set_reasoning(llm.id, True)
For native GGUF LLMs, quant= accepts values such as q2, q4, q8,
fp16, and fp32. If omitted, xlocllm requests q4 and falls back to the
next available quantization without changing the model name.
Custom ONNX models can run as service units:
reg = xlocllm.unit(
"regression",
"local-sklearn-regression",
options={"model_path": "model.onnx", "input_name": "float_input"},
)
with xlocllm.runtime([reg]) as runtime:
runtime.run()
print(reg.predict([[1.0, 2.0, 3.0]]))
The native catalog now contains 300 curated public Hugging Face model entries across
LLM, embeddings, rerankers, VLM, speech, vision, OCR, translation, and text
tasks. New filters include subtype=, modality=, use_case=, license=,
and min_context=.
Mode decorators are available as both decorators and context managers:
@xlocllm.webgpu
def run_webgpu():
llm = xlocllm.unit("LLM", "SmolLM2-360M-Instruct-q4f16_1-MLC")
...
with xlocllm.web:
clf = xlocllm.unit("text-classification", "Xenova/distilbert-base-uncased-finetuned-sst-2-english")
unit() also accepts catalog objects and custom local models:
info = xlocllm.model("Qwen-3.5-0.8b", unit="LLM")
llm = xlocllm.unit(info)
classifier = xlocllm.unit(sklearn_model, type="text-classification", name="clf", labels=["no", "yes"])
onnx_unit = xlocllm.unit("model.onnx", type="regression", name="reg")
Custom sklearn/torch models are exported to ONNX. Native mode runs them through ONNX Runtime; web mode serves the artifact to the browser and runs it through ONNX Runtime Web/WASM.
CLI:
xlocllm status
xlocllm benchmark
xlocllm benchmark LLM
xlocllm benchmark LLM --mode web
xlocllm models --unit LLM
xlocllm models --unit LLM --mode web --no-webgpu
xlocllm run --unit LLM --model "Qwen-3.5-0.8b"
xlocllm run --unit LLM --model "Qwen-3.5-0.8b" --mode web
xlocllm cache delete --mode native --unit LLM --model "Qwen-3.5-0.8b" --yes
xlocllm cache clear --mode native --yes
Documentation
- Repository:
mgg789/xlocllm - Python Unit wiki for AI/tools:
Python-Unit - Full English SDK docs:
docs.md - Full Russian SDK docs:
docs_ru.md - Ready-to-run Russian recipes:
recipes_ru.md - English model catalog:
models.md - Russian model catalog:
models_ru.md
These URLs are also exposed from the Python package for agents and tooling:
import xlocllm
print(xlocllm.PROJECT_URLS)
print(xlocllm.DOCUMENTATION_URL)
Model Lookup
Use exact modelId, label, or aliases:
xlocllm.unit("LLM", "Qwen-3.5-0.8b")
xlocllm.unit("LLM", "Qwen3.5-0.8B-q4f16_1-MLC")
xlocllm.unit("embedding", "multilingual-e5-small")
Browse the complete catalog in models.md.
Local State
By default, xlocllm stores bridge metadata, native engine/model cache, vector stores, and browser profiles under:
- Windows:
%LOCALAPPDATA%\xlocllm - Linux/macOS:
$XDG_STATE_HOME/xlocllmor~/.local/state/xlocllm
Environment variables:
XLOCLLM_HOME- override local state directory.XLOCLLM_WEB_URL- use a custom web runtime URL.XLOCLLM_LOG_LEVEL- uvicorn log level.XLOCLLM_NATIVE_DISABLE_INSTALL=1- disable managed native dependency installation and fail with a diagnostic error instead.
Development Checks
python -m pytest python/xlocllm/tests
python -m ruff check python/xlocllm/src python/xlocllm/tests
python -m mypy python/xlocllm/src
Build the Python package:
cd python\xlocllm
python -m build
Notes
The bridge binds to loopback only. In native mode, the dashboard window is only a monitor/control surface; closing it does not run model weights in the browser. In web mode, the browser window must remain open while browser-backed models are running. Without WebGPU, web mode exposes only the CPU/WASM-compatible Transformers.js subset and rejects heavier models before loading.
License
BSD-3-Clause. Redistributions must retain the copyright notice:
Copyright (c) 2026, mgg789 / Droidje AI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file xlocllm-1.1.0.tar.gz.
File metadata
- Download URL: xlocllm-1.1.0.tar.gz
- Upload date:
- Size: 14.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
42be1af134c6a4de07d161cfaa7173bd41c872e4cf7a35731b2b24a0e04d6beb
|
|
| MD5 |
8198f7e331bf116dfdbbc9c7d573650a
|
|
| BLAKE2b-256 |
1a627451716ca15de3e137ea46261086383d7d21a2b83144362a37134c25d179
|
File details
Details for the file xlocllm-1.1.0-py3-none-any.whl.
File metadata
- Download URL: xlocllm-1.1.0-py3-none-any.whl
- Upload date:
- Size: 14.6 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a2e01e107d1ffed539454b535244bd5f710684a6fc6998c1ea0f50832aea18b9
|
|
| MD5 |
1df92f759b2faee4723f3f3ba0a06394
|
|
| BLAKE2b-256 |
f2b26530e2a1738425feeed1462559a34b56ce0cf31400f8e45831a4b780e5ea
|