Minimal OpenAI-compatible server for GPT-OSS models on Apple Silicon with MLX
Project description
MLX GPT-OSS Server
Minimal OpenAI-compatible server for GPT-OSS/Harmony models on Apple Silicon.
Built with mlx-lm (inference), openai-harmony (prompt formatting), and FastAPI (HTTP API).
Feature List
- OpenAI-style
/v1/chat/completionsendpoint - OpenAI-style
/v1/responsesendpoint - Streaming (
SSE) and non-streaming responses - Harmony
reasoning_effortsupport (low,medium,high) - OpenAI tool-calling response format
- Responses API function-calling and
previous_response_idsupport - Robust Harmony tool-calling parser and stream recovery paths
- Usage token counts in responses
/healthqueue stats and/v1/modelscompatibility endpoint- Single-model runtime with FIFO request queueing
Requirements
- macOS on Apple Silicon
- Python
>=3.11
Quick Start
pip install mlx-gpt-oss
mlx-gpt-oss --model mlx-community/gpt-oss-20b-MXFP4-Q8
Default bind: http://0.0.0.0:8000
Install From Source
python3 -m venv .venv
source .venv/bin/activate
pip install -e .
mlx-gpt-oss --model mlx-community/gpt-oss-20b-MXFP4-Q8
API Endpoints
| Endpoint | Method | Purpose |
|---|---|---|
/health |
GET |
Server health + active/queued request counts |
/v1/models |
GET |
Loaded model metadata |
/v1/chat/completions |
POST |
OpenAI-compatible chat completion |
/v1/responses |
POST |
OpenAI-compatible Responses API create |
/v1/responses/{response_id} |
GET |
Retrieve stored response |
/v1/responses/{response_id} |
DELETE |
Delete stored response |
/v1/responses/{response_id}/input_items |
GET |
Retrieve stored request input items |
Chat Completions Notes
modelis required for compatibility, but the server always uses the single model loaded at startup.- Supports OpenAI-style
messages,stream,tools,tool_choice,stop, and common sampling params. top_kis accepted but generation remains pinned totop_k=0for GPT-OSS behavior.reasoning_effortcan be set directly, or viachat_template_kwargs.reasoning_effort.- Streaming returns
chat.completion.chunkevents and ends with[DONE].
Responses API Notes
- Supported input types are text message items, replayed
function_callitems, andfunction_call_outputitems. - Supported tools are custom
functiontools only. - Stored responses are process-local, in-memory, and bounded by LRU eviction.
previous_response_idreuses stored conversation transcript, but does not carry forward priorinstructions.
Responses API Limits
- No multimodal inputs (
image,audio,file, etc.) - No hosted OpenAI tools such as
web_search,file_search, orcode_interpreter - No structured output / non-plain-text
text.format - No
parallel_tool_calls=false - No named/required tool forcing;
tool_choicesupportsautoandnone
Tool Calling Reliability
- Uses official Harmony assistant-action stop tokens from
openai-harmony(no hardcoded token IDs). - Handles streaming edge cases: unfinished tool-call endings, buffered fallback dedupe, and repeated identical tool calls.
- Addresses a class of tool-calling failures seen in other MLX servers.
CLI Options
| Flag | Default | Description |
|---|---|---|
--model |
required | Model path or Hugging Face ID |
--host |
0.0.0.0 |
Bind address |
--port |
8000 |
Bind port |
--context-length |
8196 |
Max KV cache context length |
--log-level |
INFO |
DEBUG, INFO, WARNING, ERROR |
--log-file |
disabled | Optional rotating file log output |
--debug-raw-preview-chars |
0 |
In DEBUG, preview N chars of prompts/output |
--http-access-log |
False |
Emit one access log line per HTTP request |
--responses-store-max-items |
256 |
Max stored /v1/responses records kept in memory |
--responses-store-max-bytes |
67108864 |
Approximate max in-memory bytes for stored responses |
Security
- No built-in auth or API key checks, this is your responsibility.
- Default host is
0.0.0.0for local/LAN self-hosting. - CORS is permissive (
*, credentials disabled). - Use
--host 127.0.0.1for local-only access.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mlx_gpt_oss-1.0.3.tar.gz.
File metadata
- Download URL: mlx_gpt_oss-1.0.3.tar.gz
- Upload date:
- Size: 32.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
64891cc5ffc4bd6c5f202c9afedf39af9c467fa84ab2da96bfdd30b86ab017ff
|
|
| MD5 |
afad2939944c36bd0e136f02c3653b69
|
|
| BLAKE2b-256 |
b16ee35b22729d1e55b49a19a8d6511518221ac157e54ab0db3c03181c96a408
|
Provenance
The following attestation bundles were made for mlx_gpt_oss-1.0.3.tar.gz:
Publisher:
publish.yml on icelaglace/mlx-gpt-oss
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mlx_gpt_oss-1.0.3.tar.gz -
Subject digest:
64891cc5ffc4bd6c5f202c9afedf39af9c467fa84ab2da96bfdd30b86ab017ff - Sigstore transparency entry: 1101197470
- Sigstore integration time:
-
Permalink:
icelaglace/mlx-gpt-oss@21eff22f3579f78ca842978f74b04cd1ba31e4ec -
Branch / Tag:
refs/tags/v1.0.3 - Owner: https://github.com/icelaglace
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@21eff22f3579f78ca842978f74b04cd1ba31e4ec -
Trigger Event:
release
-
Statement type:
File details
Details for the file mlx_gpt_oss-1.0.3-py3-none-any.whl.
File metadata
- Download URL: mlx_gpt_oss-1.0.3-py3-none-any.whl
- Upload date:
- Size: 31.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f669375c7b5383daa7c66891a4c3d6af061763babcaa7a65f738c5a78adb128c
|
|
| MD5 |
bdd645aea16b9c85c0540310571cf72b
|
|
| BLAKE2b-256 |
cdfb78fe9c04052ab55b9253c632d040cfd326e9d714fce984990a44d856fbc6
|
Provenance
The following attestation bundles were made for mlx_gpt_oss-1.0.3-py3-none-any.whl:
Publisher:
publish.yml on icelaglace/mlx-gpt-oss
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mlx_gpt_oss-1.0.3-py3-none-any.whl -
Subject digest:
f669375c7b5383daa7c66891a4c3d6af061763babcaa7a65f738c5a78adb128c - Sigstore transparency entry: 1101197514
- Sigstore integration time:
-
Permalink:
icelaglace/mlx-gpt-oss@21eff22f3579f78ca842978f74b04cd1ba31e4ec -
Branch / Tag:
refs/tags/v1.0.3 - Owner: https://github.com/icelaglace
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@21eff22f3579f78ca842978f74b04cd1ba31e4ec -
Trigger Event:
release
-
Statement type: