Skip to main content

Minimal OpenAI-compatible server for GPT-OSS models on Apple Silicon with MLX

Project description

MLX GPT-OSS Server

Minimal OpenAI-compatible server for GPT-OSS/Harmony models on Apple Silicon.
Built with mlx-lm (inference), openai-harmony (prompt formatting), and FastAPI (HTTP API).

Feature List

  • OpenAI-style /v1/chat/completions endpoint
  • Streaming (SSE) and non-streaming responses
  • Harmony reasoning_effort support (low, medium, high)
  • OpenAI tool-calling response format
  • Robust Harmony tool-calling parser and stream recovery paths
  • Usage token counts in responses
  • /health queue stats and /v1/models compatibility endpoint
  • Single-model runtime with FIFO request queueing

Requirements

  • macOS on Apple Silicon
  • Python >=3.11

Quick Start

pip install mlx-gpt-oss
mlx-gpt-oss --model mlx-community/gpt-oss-20b-MXFP4-Q8

Default bind: http://0.0.0.0:8000

Install From Source

python3 -m venv .venv
source .venv/bin/activate
pip install -e .
mlx-gpt-oss --model mlx-community/gpt-oss-20b-MXFP4-Q8

API Endpoints

Endpoint Method Purpose
/health GET Server health + active/queued request counts
/v1/models GET Loaded model metadata
/v1/chat/completions POST OpenAI-compatible chat completion

Chat Completions Notes

  • model is required for compatibility, but the server always uses the single model loaded at startup.
  • Supports OpenAI-style messages, stream, tools, tool_choice, stop, and common sampling params.
  • top_k is accepted but generation remains pinned to top_k=0 for GPT-OSS behavior.
  • reasoning_effort can be set directly, or via chat_template_kwargs.reasoning_effort.
  • Streaming returns chat.completion.chunk events and ends with [DONE].

Tool Calling Reliability

  • Uses official Harmony assistant-action stop tokens from openai-harmony (no hardcoded token IDs).
  • Handles streaming edge cases: unfinished tool-call endings, buffered fallback dedupe, and repeated identical tool calls.
  • Addresses a class of tool-calling failures seen in other MLX servers.

CLI Options

Flag Default Description
--model required Model path or Hugging Face ID
--host 0.0.0.0 Bind address
--port 8000 Bind port
--context-length 8196 Max KV cache context length
--log-level INFO DEBUG, INFO, WARNING, ERROR
--log-file disabled Optional rotating file log output
--debug-raw-preview-chars 0 In DEBUG, preview N chars of prompts/output
--http-access-log False Emit one access log line per HTTP request

Security

  • No built-in auth or API key checks, this is your responsibility.
  • Default host is 0.0.0.0 for local/LAN self-hosting.
  • CORS is permissive (*, credentials disabled).
  • Use --host 127.0.0.1 for local-only access.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx_gpt_oss-1.0.0.tar.gz (23.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mlx_gpt_oss-1.0.0-py3-none-any.whl (22.3 kB view details)

Uploaded Python 3

File details

Details for the file mlx_gpt_oss-1.0.0.tar.gz.

File metadata

  • Download URL: mlx_gpt_oss-1.0.0.tar.gz
  • Upload date:
  • Size: 23.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mlx_gpt_oss-1.0.0.tar.gz
Algorithm Hash digest
SHA256 067f9cef93084c92b936bfe9c8593d1f2ac2d4b18329cad63b2026c58938c17a
MD5 bd451b8cf596ddd5f6738900b7a474bb
BLAKE2b-256 855d95550c165e066e2bc6b5cc4e418fe52d1fc924764c25453f07e4a4da8ad2

See more details on using hashes here.

Provenance

The following attestation bundles were made for mlx_gpt_oss-1.0.0.tar.gz:

Publisher: publish.yml on icelaglace/mlx-gpt-oss

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mlx_gpt_oss-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: mlx_gpt_oss-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 22.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mlx_gpt_oss-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 34fccb348c2f8bc94b794a566d122e8cf1633c171e4d935d34505add2f799a78
MD5 787878d2fb859cc5fbeb749c6f501633
BLAKE2b-256 3e8fe7b89aa256521be4170952ca2641d794f9221019e789eae1e811df0bce3a

See more details on using hashes here.

Provenance

The following attestation bundles were made for mlx_gpt_oss-1.0.0-py3-none-any.whl:

Publisher: publish.yml on icelaglace/mlx-gpt-oss

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page