Skip to main content

Minimal OpenAI-compatible server for GPT-OSS models on Apple Silicon with MLX

Project description

MLX GPT-OSS Server

Minimal OpenAI-compatible server for GPT-OSS/Harmony models on Apple Silicon.
Built with mlx-lm (inference), openai-harmony (prompt formatting), and FastAPI (HTTP API).

Feature List

  • OpenAI-style /v1/chat/completions endpoint
  • OpenAI-style /v1/responses endpoint
  • Streaming (SSE) and non-streaming responses
  • Harmony reasoning_effort support (low, medium, high)
  • OpenAI tool-calling response format
  • Responses API function-calling and previous_response_id support
  • Robust Harmony tool-calling parser and stream recovery paths
  • Usage token counts in responses
  • /health queue stats and /v1/models compatibility endpoint
  • Single-model runtime with FIFO request queueing

Requirements

  • macOS on Apple Silicon
  • Python >=3.11

Quick Start

pip install mlx-gpt-oss
mlx-gpt-oss --model mlx-community/gpt-oss-20b-MXFP4-Q8

Default bind: http://0.0.0.0:8000

Install From Source

python3 -m venv .venv
source .venv/bin/activate
pip install -e .
mlx-gpt-oss --model mlx-community/gpt-oss-20b-MXFP4-Q8

API Endpoints

Endpoint Method Purpose
/health GET Server health + active/queued request counts
/v1/models GET Loaded model metadata
/v1/chat/completions POST OpenAI-compatible chat completion
/v1/responses POST OpenAI-compatible Responses API create
/v1/responses/{response_id} GET Retrieve stored response
/v1/responses/{response_id} DELETE Delete stored response
/v1/responses/{response_id}/input_items GET Retrieve stored request input items

Chat Completions Notes

  • model is required for compatibility, but the server always uses the single model loaded at startup.
  • Supports OpenAI-style messages, stream, tools, tool_choice, stop, and common sampling params.
  • top_k is accepted but generation remains pinned to top_k=0 for GPT-OSS behavior.
  • reasoning_effort can be set directly, or via chat_template_kwargs.reasoning_effort.
  • Streaming returns chat.completion.chunk events and ends with [DONE].

Responses API Notes

  • Supported input types are text message items, replayed function_call items, and function_call_output items.
  • Supported tools are custom function tools only.
  • Stored responses are process-local, in-memory, and bounded by LRU eviction.
  • previous_response_id reuses stored conversation transcript, but does not carry forward prior instructions.

Responses API Limits

  • No multimodal inputs (image, audio, file, etc.)
  • No hosted OpenAI tools such as web_search, file_search, or code_interpreter
  • No structured output / non-plain-text text.format
  • No parallel_tool_calls=false
  • No named/required tool forcing; tool_choice supports auto and none

Tool Calling Reliability

  • Uses official Harmony assistant-action stop tokens from openai-harmony (no hardcoded token IDs).
  • Handles streaming edge cases: unfinished tool-call endings, buffered fallback dedupe, and repeated identical tool calls.
  • Addresses a class of tool-calling failures seen in other MLX servers.

CLI Options

Flag Default Description
--model required Model path or Hugging Face ID
--host 0.0.0.0 Bind address
--port 8000 Bind port
--context-length 8196 Max KV cache context length
--log-level INFO DEBUG, INFO, WARNING, ERROR
--log-file disabled Optional rotating file log output
--debug-raw-preview-chars 0 In DEBUG, preview N chars of prompts/output
--http-access-log False Emit one access log line per HTTP request
--responses-store-max-items 256 Max stored /v1/responses records kept in memory
--responses-store-max-bytes 67108864 Approximate max in-memory bytes for stored responses

Security

  • No built-in auth or API key checks, this is your responsibility.
  • Default host is 0.0.0.0 for local/LAN self-hosting.
  • CORS is permissive (*, credentials disabled).
  • Use --host 127.0.0.1 for local-only access.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx_gpt_oss-1.0.3.tar.gz (32.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mlx_gpt_oss-1.0.3-py3-none-any.whl (31.7 kB view details)

Uploaded Python 3

File details

Details for the file mlx_gpt_oss-1.0.3.tar.gz.

File metadata

  • Download URL: mlx_gpt_oss-1.0.3.tar.gz
  • Upload date:
  • Size: 32.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mlx_gpt_oss-1.0.3.tar.gz
Algorithm Hash digest
SHA256 64891cc5ffc4bd6c5f202c9afedf39af9c467fa84ab2da96bfdd30b86ab017ff
MD5 afad2939944c36bd0e136f02c3653b69
BLAKE2b-256 b16ee35b22729d1e55b49a19a8d6511518221ac157e54ab0db3c03181c96a408

See more details on using hashes here.

Provenance

The following attestation bundles were made for mlx_gpt_oss-1.0.3.tar.gz:

Publisher: publish.yml on icelaglace/mlx-gpt-oss

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mlx_gpt_oss-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: mlx_gpt_oss-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 31.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mlx_gpt_oss-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 f669375c7b5383daa7c66891a4c3d6af061763babcaa7a65f738c5a78adb128c
MD5 bdd645aea16b9c85c0540310571cf72b
BLAKE2b-256 cdfb78fe9c04052ab55b9253c632d040cfd326e9d714fce984990a44d856fbc6

See more details on using hashes here.

Provenance

The following attestation bundles were made for mlx_gpt_oss-1.0.3-py3-none-any.whl:

Publisher: publish.yml on icelaglace/mlx-gpt-oss

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page