ローカルLLM（mlx / mlx-vlm / llama.cpp）を OpenAI 互換 API で束ねるマルチモデルゲートウェイ。gateway.toml のカタログを 1 ポートで配信し、model で振り分け・遅延起動・LRU 退避・MTP 高速化（Ollama 流の共有デーモン）

These details have not been verified by PyPI

Project links

Repository

Project description

local-llm-server

ローカルLLM（mlx / mlx-vlm / llama.cpp）を束ねるマルチモデルゲートウェイ。

gateway.toml（モデルカタログ）を書いて 1 プロセス起動するだけ。
1 つの公開ポートで複数モデルを配信し、リクエストの model で振り分ける。
モデルは初回リクエスト時に遅延起動、max_resident 超過で LRU 退避、idle_timeout で自動アンロード。
クライアントは公開ポートに繋いで model を選ぶだけ。

インストール

uvを使用する。

uv add "local-llm-server[mlx]"

extras 指定はクォート必須（zsh の glob 展開回避）。内訳:

extra	入るもの	用途
`mlx`	`mlx-lm` / `mlx-vlm`	Apple Silicon で実際に推論する

使い方

1. `gateway.toml`（モデルカタログ）

カレントディレクトリに gateway.toml を置く。これがサーバーの唯一の設定。リポジトリ直下にすぐ使える例を同梱（→ gateway.toml）:

host = "127.0.0.1"
port = 8799                 # 公開ポート。クライアントの base_url はここ
max_resident = 1            # 同時常駐モデル数の上限。超えたら LRU 退避（省略時 無制限）
idle_timeout = 1200         # 20分使われないモデルは自動アンロード（0/省略で無効）
draft_model = "auto"        # MTP の既定（各 [[models]] で上書き・"off" で無効）

[[models]]
model = "mlx-community/Qwen3.6-27B-4bit"
backend = "mlx-vlm"

[[models]]
model = "mlx-community/gemma-4-26B-A4B-it-qat-4bit"
backend = "mlx-vlm"

MTP（投機的デコード）による高速化 → docs/mtp.md。

2. ゲートウェイを起動

gateway.toml のあるディレクトリで起動するだけ（管理者の唯一の操作）:

uv run local-llm-server

1 つの公開ポート（例 http://127.0.0.1:8799/v1）でカタログのモデルを束ねる。各モデルは初回リクエスト時に遅延起動し、2 回目以降は常駐して即応答。max_resident 超過は LRU 退避、 idle_timeout で自動アンロード。

3. 接続（ `model` で選ぶ）

公開ポートに繋ぎ、model で使うモデルを選ぶ。

from local_llm_server import LLMClient

llm = LLMClient(
  model="mlx-community/Qwen3.6-27B-4bit",
  base_url="http://127.0.0.1:8799/v1"
)
print(llm.respond("ローカルLLMの利点を3つ。"))

高度操作 → docs/connecting.md。

運用（status / stop）

稼働確認（カタログ＝全モデル・pid・ログパス）

uv run local-llm-server --status

ゲートウェイ停止（配下のモデルサーバーも全て停止）

uv run local-llm-server --stop

Ctrl+C / kill でも、起動済みのモデルサーバーまで一緒に止まる（孫プロセスは残らない）。

ライセンス

Apache-2.0

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

0.12.0

Jun 23, 2026

This version

0.6.2

Jun 23, 2026

0.6.1

Jun 23, 2026

0.6.0

Jun 23, 2026

0.5.0

Jun 23, 2026

0.4.0

Jun 23, 2026

0.2.0

Jun 23, 2026

0.1.0

Jun 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

local_llm_server-0.6.2.tar.gz (45.1 kB view details)

Uploaded Jun 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

local_llm_server-0.6.2-py3-none-any.whl (39.9 kB view details)

Uploaded Jun 23, 2026 Python 3

File details

Details for the file local_llm_server-0.6.2.tar.gz.

File metadata

Download URL: local_llm_server-0.6.2.tar.gz
Upload date: Jun 23, 2026
Size: 45.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for local_llm_server-0.6.2.tar.gz
Algorithm	Hash digest
SHA256	`8f8df62cd0eaa529cd4e6f8598656ba91478d1c79bc70fde2bd9a995f113838b`
MD5	`36182998b1a463c13dd51d13420cfdbc`
BLAKE2b-256	`4b6de956a6ff3e2c00d516041e3628c40909e617c6b4c1fc89b44d197cd251f6`

See more details on using hashes here.

File details

Details for the file local_llm_server-0.6.2-py3-none-any.whl.

File metadata

Download URL: local_llm_server-0.6.2-py3-none-any.whl
Upload date: Jun 23, 2026
Size: 39.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for local_llm_server-0.6.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e04333f93d19c68b53adf72606a15d4a35d20c58f168c4733bd14cb52c8d2dcc`
MD5	`20bd04a2b60646c9d8f6ccd907deb22c`
BLAKE2b-256	`d9324c7cfbfed14e40af16d77c1641b865916820b18080bc8cc1d8c9f0ed6a83`

See more details on using hashes here.

local-llm-server 0.6.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

local-llm-server

インストール

使い方

1. `gateway.toml`（モデルカタログ）

2. ゲートウェイを起動

3. 接続（ `model` で選ぶ）

運用（status / stop）

稼働確認（カタログ＝全モデル・pid・ログパス）

ゲートウェイ停止（配下のモデルサーバーも全て停止）

ライセンス

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

local-llm-server 0.6.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

local-llm-server

インストール

使い方

1. gateway.toml（モデルカタログ）

2. ゲートウェイを起動

3. 接続（ model で選ぶ）

運用（status / stop）

稼働確認（カタログ＝全モデル・pid・ログパス）

ゲートウェイ停止（配下のモデルサーバーも全て停止）

ライセンス

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

1. `gateway.toml`（モデルカタログ）

3. 接続（ `model` で選ぶ）