Local native and browser-backed AI inference bridge

These details have not been verified by PyPI

Project links

Project description

xlocllm

xlocllm is a Python SDK for local AI inference. The default runtime is native: Python starts a local supervisor, exposes an OpenAI-compatible loopback API, and runs local engines such as llama.cpp/GGUF for LLMs and ONNX Runtime for embeddings, rerankers, vision, audio, and other task models.

The browser/WebGPU runtime remains available with mode="web" and keeps the old browser-backed behavior through MLC WebLLM and Transformers.js.

The goal is simple:

pip install xlocllm

Then:

import xlocllm

llm = xlocllm.unit("LLM", "Qwen-3.5-0.8b")
runtime = xlocllm.runtime([llm])
runtime.run()

print(runtime.url)  # http://127.0.0.1:1146/v1
print(runtime.chat("Say hello", temperature=0))

What It Does

Starts a local FastAPI bridge on 127.0.0.1.
Uses native mode by default; use xlocllm.mode = "web" or runtime(..., mode="web") for the browser runtime.
Opens a small dashboard window for status and controls. Native mode uses a non-browser desktop monitor; web mode keeps the paired browser runtime window.
Runs model engines locally in native mode, or inside the paired browser in web mode.
Provides OpenAI-compatible /v1 endpoints for local clients.
Supports LLMs, embeddings, rerankers, translation, TTS, vision, ASR, and more through a shared catalog.
Provides local RAG with vector storage, embeddings, optional reranking, automatic LLM retrieval, and runtime.chatui().
Keeps Python-side objects for models, units, runtimes, and bridges.

Install

pip install xlocllm

The package install stays light. In native mode, managed engine dependencies and model artifacts are downloaded into the xlocllm cache on the first runtime.run().

Optional OpenAI client helper:

pip install "xlocllm[openai]"

Development install from this repository:

python -m pip install -e .\python\xlocllm[dev,openai]

Quick Start

import xlocllm

runtime = xlocllm.runtime(
    [
        xlocllm.unit("LLM", "Qwen-3.5-0.8b"),
        xlocllm.unit("embedding", "multilingual-e5-small"),
    ]
)

runtime.install()
runtime.run()

print(runtime.status())

OpenAI-Compatible Usage

import xlocllm
from openai import OpenAI

llm = xlocllm.unit(type="LLM", model="Qwen-3.5-0.8b-fp32")
client = OpenAI(base_url="http://127.0.0.1:1146/v1", api_key="xlocllm")

with xlocllm.runtime([llm]) as runtime:
    runtime.run()
    response = client.chat.completions.create(
        model="Qwen-3.5-0.8b-fp32",
        messages=[{"role": "user", "content": "What is lidar?"}],
        max_tokens=64,
    )
    print(response.choices[0].message.content)

With the optional helper:

client = runtime.client()

Local RAG

import xlocllm

emb = xlocllm.unit("embedding", "multilingual-e5-small")
rag = xlocllm.rag(emb=emb, name="kb")
llm = xlocllm.unit("LLM", "Qwen-3.5-0.8b-fp32", rag=rag)

with xlocllm.runtime([llm]) as runtime:
    runtime.run()
    rag.add(["xlocllm keeps vectors in the active runtime storage."], ids=["storage"])
    print(runtime.chat("Where does xlocllm keep vectors?"))
    runtime.chatui(session="kb-demo")

Native mode uses local persistent storage. Browser mode uses IndexedDB in the paired browser runtime.

Core API

model = xlocllm.model("Qwen-3.5-0.8b", unit="LLM")
models = xlocllm.models(unit="LLM", max_vram_mb=1500)
native_models = xlocllm.models(mode="native")
cpu_models = xlocllm.models(mode="web", webgpu=False)
rag_embeddings = xlocllm.models(unit="embedding", mode="native", use_case="rag")
vlms = xlocllm.models(unit="vlm", mode="native", modality="image")

unit = xlocllm.unit("LLM", "Qwen-3.5-0.8b", quant="q4", reasoning=None)
store = xlocllm.vectorstorage("kb")
rag = xlocllm.rag(emb=xlocllm.unit("embedding", "multilingual-e5-small"), store=store)
runtime = xlocllm.runtime([unit], port=1146)
bridge = xlocllm.Bridge(port=1146)

print(runtime.url)
print(bridge.url)
print(xlocllm.bridges())
print(xlocllm.runtimes())
print(xlocllm.status())
print(xlocllm.benchmark())
print(xlocllm.benchmark("LLM"))

benchmark() checks CPU/RAM/disk, native engine availability, GPU/NPU signals, and Hugging Face latency. In mode="web" it can temporarily open a paired mini browser to detect real WebGPU/WebNN/NPU support. With a unit type, it returns fast and quality recommendations.

Reasoning-capable LLMs can be configured at creation and updated hot:

llm = xlocllm.unit("LLM", "Qwen-3.5-0.8b-fp32", reasoning=False)
runtime.set_reasoning(llm.id, True)

For native GGUF LLMs, quant= accepts values such as q2, q4, q8, fp16, and fp32. If omitted, xlocllm requests q4 and falls back to the next available quantization without changing the model name.

Custom ONNX models can run as service units:

reg = xlocllm.unit(
    "regression",
    "local-sklearn-regression",
    options={"model_path": "model.onnx", "input_name": "float_input"},
)
with xlocllm.runtime([reg]) as runtime:
    runtime.run()
    print(reg.predict([[1.0, 2.0, 3.0]]))

The native catalog now contains 300 curated public Hugging Face model entries across LLM, embeddings, rerankers, VLM, speech, vision, OCR, translation, and text tasks. New filters include subtype=, modality=, use_case=, license=, and min_context=.

Mode decorators are available as both decorators and context managers:

@xlocllm.webgpu
def run_webgpu():
    llm = xlocllm.unit("LLM", "SmolLM2-360M-Instruct-q4f16_1-MLC")
    ...

with xlocllm.web:
    clf = xlocllm.unit("text-classification", "Xenova/distilbert-base-uncased-finetuned-sst-2-english")

unit() also accepts catalog objects and custom local models:

info = xlocllm.model("Qwen-3.5-0.8b", unit="LLM")
llm = xlocllm.unit(info)

classifier = xlocllm.unit(sklearn_model, type="text-classification", name="clf", labels=["no", "yes"])
onnx_unit = xlocllm.unit("model.onnx", type="regression", name="reg")

Custom sklearn/torch models are exported to ONNX. Native mode runs them through ONNX Runtime; web mode serves the artifact to the browser and runs it through ONNX Runtime Web/WASM.

CLI:

xlocllm status
xlocllm benchmark
xlocllm benchmark LLM
xlocllm benchmark LLM --mode web
xlocllm models --unit LLM
xlocllm models --unit LLM --mode web --no-webgpu
xlocllm run --unit LLM --model "Qwen-3.5-0.8b"
xlocllm run --unit LLM --model "Qwen-3.5-0.8b" --mode web
xlocllm cache delete --mode native --unit LLM --model "Qwen-3.5-0.8b" --yes
xlocllm cache clear --mode native --yes

Documentation

Repository: mgg789/xlocllm
Python Unit wiki for AI/tools: Python-Unit
Full English SDK docs: docs.md
Full Russian SDK docs: docs_ru.md
Ready-to-run Russian recipes: recipes_ru.md
English model catalog: models.md
Russian model catalog: models_ru.md

These URLs are also exposed from the Python package for agents and tooling:

import xlocllm

print(xlocllm.PROJECT_URLS)
print(xlocllm.DOCUMENTATION_URL)

Model Lookup

Use exact modelId, label, or aliases:

xlocllm.unit("LLM", "Qwen-3.5-0.8b")
xlocllm.unit("LLM", "Qwen3.5-0.8B-q4f16_1-MLC")
xlocllm.unit("embedding", "multilingual-e5-small")

Browse the complete catalog in models.md.

Local State

By default, xlocllm stores bridge metadata, native engine/model cache, vector stores, and browser profiles under:

Windows: %LOCALAPPDATA%\xlocllm
Linux/macOS: $XDG_STATE_HOME/xlocllm or ~/.local/state/xlocllm

Environment variables:

XLOCLLM_HOME - override local state directory.
XLOCLLM_WEB_URL - use a custom web runtime URL.
XLOCLLM_LOG_LEVEL - uvicorn log level.
XLOCLLM_NATIVE_DISABLE_INSTALL=1 - disable managed native dependency installation and fail with a diagnostic error instead.

Development Checks

python -m pytest python/xlocllm/tests
python -m ruff check python/xlocllm/src python/xlocllm/tests
python -m mypy python/xlocllm/src

Build the Python package:

cd python\xlocllm
python -m build

Notes

The bridge binds to loopback only. In native mode, the dashboard window is only a monitor/control surface; closing it does not run model weights in the browser. In web mode, the browser window must remain open while browser-backed models are running. Without WebGPU, web mode exposes only the CPU/WASM-compatible Transformers.js subset and rejects heavier models before loading.

License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.1.0

May 28, 2026

1.0.1

May 25, 2026

1.0.0

May 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xlocllm-1.1.0.tar.gz (14.4 MB view details)

Uploaded May 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

xlocllm-1.1.0-py3-none-any.whl (14.6 MB view details)

Uploaded May 28, 2026 Python 3

File details

Details for the file xlocllm-1.1.0.tar.gz.

File metadata

Download URL: xlocllm-1.1.0.tar.gz
Upload date: May 28, 2026
Size: 14.4 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for xlocllm-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`42be1af134c6a4de07d161cfaa7173bd41c872e4cf7a35731b2b24a0e04d6beb`
MD5	`8198f7e331bf116dfdbbc9c7d573650a`
BLAKE2b-256	`1a627451716ca15de3e137ea46261086383d7d21a2b83144362a37134c25d179`

See more details on using hashes here.

File details

Details for the file xlocllm-1.1.0-py3-none-any.whl.

File metadata

Download URL: xlocllm-1.1.0-py3-none-any.whl
Upload date: May 28, 2026
Size: 14.6 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for xlocllm-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a2e01e107d1ffed539454b535244bd5f710684a6fc6998c1ea0f50832aea18b9`
MD5	`1df92f759b2faee4723f3f3ba0a06394`
BLAKE2b-256	`f2b26530e2a1738425feeed1462559a34b56ce0cf31400f8e45831a4b780e5ea`

See more details on using hashes here.

xlocllm 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

xlocllm

What It Does

Install

Quick Start

OpenAI-Compatible Usage

Local RAG

Core API

Documentation

Model Lookup

Local State

Development Checks

Notes

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes