Skip to main content

Local native and browser-backed AI inference bridge

Project description

xlocllm

xlocllm is a Python SDK for local AI inference. The default runtime is native: Python starts a local supervisor, exposes an OpenAI-compatible loopback API, and runs local engines such as llama.cpp/GGUF for LLMs and ONNX Runtime for embeddings, rerankers, vision, audio, and other task models.

The browser/WebGPU runtime remains available with mode="web" and keeps the old browser-backed behavior through MLC WebLLM and Transformers.js.

The goal is simple:

pip install xlocllm

Then:

import xlocllm

llm = xlocllm.unit("LLM", "Qwen-3.5-0.8b")
runtime = xlocllm.runtime([llm])
runtime.run()

print(runtime.url)  # http://127.0.0.1:1146/v1
print(runtime.chat("Say hello", temperature=0))

What It Does

  • Starts a local FastAPI bridge on 127.0.0.1.
  • Uses native mode by default; use xlocllm.mode = "web" or runtime(..., mode="web") for the browser runtime.
  • Opens a small dashboard window for status and controls. Native mode uses a non-browser desktop monitor; web mode keeps the paired browser runtime window.
  • Runs model engines locally in native mode, or inside the paired browser in web mode.
  • Provides OpenAI-compatible /v1 endpoints for local clients.
  • Supports LLMs, embeddings, rerankers, translation, TTS, vision, ASR, and more through a shared catalog.
  • Provides local RAG with vector storage, embeddings, optional reranking, automatic LLM retrieval, and runtime.chatui().
  • Keeps Python-side objects for models, units, runtimes, and bridges.

Install

pip install xlocllm

The package install stays light. In native mode, managed engine dependencies and model artifacts are downloaded into the xlocllm cache on the first runtime.run().

Optional OpenAI client helper:

pip install "xlocllm[openai]"

Development install from this repository:

python -m pip install -e .\python\xlocllm[dev,openai]

Quick Start

import xlocllm

runtime = xlocllm.runtime(
    [
        xlocllm.unit("LLM", "Qwen-3.5-0.8b"),
        xlocllm.unit("embedding", "multilingual-e5-small"),
    ]
)

runtime.install()
runtime.run()

print(runtime.status())

OpenAI-Compatible Usage

import xlocllm
from openai import OpenAI

llm = xlocllm.unit(type="LLM", model="Qwen-3.5-0.8b-fp32")
client = OpenAI(base_url="http://127.0.0.1:1146/v1", api_key="xlocllm")

with xlocllm.runtime([llm]) as runtime:
    runtime.run()
    response = client.chat.completions.create(
        model="Qwen-3.5-0.8b-fp32",
        messages=[{"role": "user", "content": "What is lidar?"}],
        max_tokens=64,
    )
    print(response.choices[0].message.content)

With the optional helper:

client = runtime.client()

Local RAG

import xlocllm

emb = xlocllm.unit("embedding", "multilingual-e5-small")
rag = xlocllm.rag(emb=emb, name="kb")
llm = xlocllm.unit("LLM", "Qwen-3.5-0.8b-fp32", rag=rag)

with xlocllm.runtime([llm]) as runtime:
    runtime.run()
    rag.add(["xlocllm keeps vectors in the active runtime storage."], ids=["storage"])
    print(runtime.chat("Where does xlocllm keep vectors?"))
    runtime.chatui(session="kb-demo")

Native mode uses local persistent storage. Browser mode uses IndexedDB in the paired browser runtime.

Core API

model = xlocllm.model("Qwen-3.5-0.8b", unit="LLM")
models = xlocllm.models(unit="LLM", max_vram_mb=1500)
native_models = xlocllm.models(mode="native")
cpu_models = xlocllm.models(mode="web", webgpu=False)
rag_embeddings = xlocllm.models(unit="embedding", mode="native", use_case="rag")
vlms = xlocllm.models(unit="vlm", mode="native", modality="image")

unit = xlocllm.unit("LLM", "Qwen-3.5-0.8b", quant="q4", reasoning=None)
store = xlocllm.vectorstorage("kb")
rag = xlocllm.rag(emb=xlocllm.unit("embedding", "multilingual-e5-small"), store=store)
runtime = xlocllm.runtime([unit], port=1146)
bridge = xlocllm.Bridge(port=1146)

print(runtime.url)
print(bridge.url)
print(xlocllm.bridges())
print(xlocllm.runtimes())
print(xlocllm.status())
print(xlocllm.benchmark())
print(xlocllm.benchmark("LLM"))

benchmark() checks CPU/RAM/disk, native engine availability, GPU/NPU signals, and Hugging Face latency. In mode="web" it can temporarily open a paired mini browser to detect real WebGPU/WebNN/NPU support. With a unit type, it returns fast and quality recommendations.

Reasoning-capable LLMs can be configured at creation and updated hot:

llm = xlocllm.unit("LLM", "Qwen-3.5-0.8b-fp32", reasoning=False)
runtime.set_reasoning(llm.id, True)

For native GGUF LLMs, quant= accepts values such as q2, q4, q8, fp16, and fp32. If omitted, xlocllm requests q4 and falls back to the next available quantization without changing the model name.

Custom ONNX models can run as service units:

reg = xlocllm.unit(
    "regression",
    "local-sklearn-regression",
    options={"model_path": "model.onnx", "input_name": "float_input"},
)
with xlocllm.runtime([reg]) as runtime:
    runtime.run()
    print(reg.predict([[1.0, 2.0, 3.0]]))

The native catalog now contains 300 curated public Hugging Face model entries across LLM, embeddings, rerankers, VLM, speech, vision, OCR, translation, and text tasks. New filters include subtype=, modality=, use_case=, license=, and min_context=.

Mode decorators are available as both decorators and context managers:

@xlocllm.webgpu
def run_webgpu():
    llm = xlocllm.unit("LLM", "SmolLM2-360M-Instruct-q4f16_1-MLC")
    ...

with xlocllm.web:
    clf = xlocllm.unit("text-classification", "Xenova/distilbert-base-uncased-finetuned-sst-2-english")

unit() also accepts catalog objects and custom local models:

info = xlocllm.model("Qwen-3.5-0.8b", unit="LLM")
llm = xlocllm.unit(info)

classifier = xlocllm.unit(sklearn_model, type="text-classification", name="clf", labels=["no", "yes"])
onnx_unit = xlocllm.unit("model.onnx", type="regression", name="reg")

Custom sklearn/torch models are exported to ONNX. Native mode runs them through ONNX Runtime; web mode serves the artifact to the browser and runs it through ONNX Runtime Web/WASM.

CLI:

xlocllm status
xlocllm benchmark
xlocllm benchmark LLM
xlocllm benchmark LLM --mode web
xlocllm models --unit LLM
xlocllm models --unit LLM --mode web --no-webgpu
xlocllm run --unit LLM --model "Qwen-3.5-0.8b"
xlocllm run --unit LLM --model "Qwen-3.5-0.8b" --mode web
xlocllm cache delete --mode native --unit LLM --model "Qwen-3.5-0.8b" --yes
xlocllm cache clear --mode native --yes

Documentation

These URLs are also exposed from the Python package for agents and tooling:

import xlocllm

print(xlocllm.PROJECT_URLS)
print(xlocllm.DOCUMENTATION_URL)

Model Lookup

Use exact modelId, label, or aliases:

xlocllm.unit("LLM", "Qwen-3.5-0.8b")
xlocllm.unit("LLM", "Qwen3.5-0.8B-q4f16_1-MLC")
xlocllm.unit("embedding", "multilingual-e5-small")

Browse the complete catalog in models.md.

Local State

By default, xlocllm stores bridge metadata, native engine/model cache, vector stores, and browser profiles under:

  • Windows: %LOCALAPPDATA%\xlocllm
  • Linux/macOS: $XDG_STATE_HOME/xlocllm or ~/.local/state/xlocllm

Environment variables:

  • XLOCLLM_HOME - override local state directory.
  • XLOCLLM_WEB_URL - use a custom web runtime URL.
  • XLOCLLM_LOG_LEVEL - uvicorn log level.
  • XLOCLLM_NATIVE_DISABLE_INSTALL=1 - disable managed native dependency installation and fail with a diagnostic error instead.

Development Checks

python -m pytest python/xlocllm/tests
python -m ruff check python/xlocllm/src python/xlocllm/tests
python -m mypy python/xlocllm/src

Build the Python package:

cd python\xlocllm
python -m build

Notes

The bridge binds to loopback only. In native mode, the dashboard window is only a monitor/control surface; closing it does not run model weights in the browser. In web mode, the browser window must remain open while browser-backed models are running. Without WebGPU, web mode exposes only the CPU/WASM-compatible Transformers.js subset and rejects heavier models before loading.

License

BSD-3-Clause. Redistributions must retain the copyright notice: Copyright (c) 2026, mgg789 / Droidje AI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xlocllm-1.1.0.tar.gz (14.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

xlocllm-1.1.0-py3-none-any.whl (14.6 MB view details)

Uploaded Python 3

File details

Details for the file xlocllm-1.1.0.tar.gz.

File metadata

  • Download URL: xlocllm-1.1.0.tar.gz
  • Upload date:
  • Size: 14.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for xlocllm-1.1.0.tar.gz
Algorithm Hash digest
SHA256 42be1af134c6a4de07d161cfaa7173bd41c872e4cf7a35731b2b24a0e04d6beb
MD5 8198f7e331bf116dfdbbc9c7d573650a
BLAKE2b-256 1a627451716ca15de3e137ea46261086383d7d21a2b83144362a37134c25d179

See more details on using hashes here.

File details

Details for the file xlocllm-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: xlocllm-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 14.6 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for xlocllm-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a2e01e107d1ffed539454b535244bd5f710684a6fc6998c1ea0f50832aea18b9
MD5 1df92f759b2faee4723f3f3ba0a06394
BLAKE2b-256 f2b26530e2a1738425feeed1462559a34b56ce0cf31400f8e45831a4b780e5ea

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page