Skip to main content

server to serve mlx model as an OpenAI compatible API

Project description

mlx-llm-server

This guide will help you set up the MLX-LLM server to serve the model as an OpenAI compatible API.

Quick Start

Installation

Before starting the MLX-LLM server, install the server package from PyPI:

pip install mlx-llm-server

Start the Server

mlx-llm-server --model <path-to-your-model>

Arguments

  • --model: The path to the mlx model weights, tokenizer, and config. This argument is required.
  • --adapter-file: (Optional) The path for the trained adapter weights.

Host and Port Configuration

The server will start on the host and port specified by the environment variables HOST and PORT. If these are not set, it defaults to 127.0.0.1:8080.

To start the server on a different host or port, set the HOST and PORT environment variables before starting the server. For example:

export HOST=0.0.0.0
export PORT=5000
mlx-llm-server --model <path-to-your-model>

The MLX-LLM server can serve both Hugging Face format models and quantized MLX models. You can find these models at the MLX Community on Hugging Face.

API Spec

API Endpoint: /v1/chat/completions

Method: POST

Request Headers

  • Content-Type: Must be application/json.

Request Body (JSON)

  • messages: An array of message objects representing the conversation history. Each message object should have a role (e.g., user, assistant) and content (the message text).
  • role_mapping: (Optional) A dictionary to customize the role prefixes in the generated prompt. If not provided, default mappings are used.
  • stop: (Optional) An array of strings or a single string representing stopping conditions for the generation. These are sequences of tokens where the generation should stop.
  • max_tokens: (Optional) An integer specifying the maximum number of tokens to generate. Defaults to 100.
  • stream: (Optional) A boolean indicating if the response should be streamed. If true, responses are sent as they are generated. Defaults to false.
  • model: (Optional) A string specifying the model to use for generation. This is not utilized in the provided code but could be used for selecting among multiple models.
  • temperature: (Optional) A float specifying the sampling temperature. Defaults to 1.0.
  • top_p: (Optional) A float specifying the nucleus sampling parameter. Defaults to 1.0.
  • repetition_penalty: Optional. Applies a penalty to repeated tokens.
  • repetition_context_size: Optional. The size of the context window for applying repetition penalty.

Development Setup Guide

Miniconda Installation

For Apple Silicon users, install Miniconda natively with these commands:

wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh
bash Miniforge3-MacOSX-arm64.sh

Conda Environment Setup

After Miniconda installation, create a dedicated conda environment for MLX-LLM:

conda create -n mlx-llm python=3.10
conda activate mlx-llm

Installing Dependencies

With the mlx-llm environment activated, install the necessary dependencies using the following command:

pip install -r requirements.txt

Testing the API with curl

You can test the API using the curl command. Here's an example:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer no-key" \
  -d '{
  "model": "gpt-3.5-turbo",
  "stop":["<|im_end|>"],
  "messages": [
    {
      "role": "user",
      "content": "Write a limerick about python exceptions"
    }
  ]
}'

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx-llm-server-0.1.10.tar.gz (6.8 kB view details)

Uploaded Source

Built Distribution

mlx_llm_server-0.1.10-py3-none-any.whl (7.1 kB view details)

Uploaded Python 3

File details

Details for the file mlx-llm-server-0.1.10.tar.gz.

File metadata

  • Download URL: mlx-llm-server-0.1.10.tar.gz
  • Upload date:
  • Size: 6.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.13

File hashes

Hashes for mlx-llm-server-0.1.10.tar.gz
Algorithm Hash digest
SHA256 0833c43e7b624d54c3b0a15d924c460ceaea1c3e03cad802a8a8c9447171eb97
MD5 6f15a13fb28de09595c60eaedd4fd9d7
BLAKE2b-256 da4a94d3dd245ba746122eb8974008c38078ac25acda464ff4ad7ad90b77e742

See more details on using hashes here.

File details

Details for the file mlx_llm_server-0.1.10-py3-none-any.whl.

File metadata

File hashes

Hashes for mlx_llm_server-0.1.10-py3-none-any.whl
Algorithm Hash digest
SHA256 5c23c19259e1e2c3d13046691ca47b378fb155e7f1d0f6d04388199aed30f789
MD5 df6c7195469df49feb810e77e0d73640
BLAKE2b-256 47d8ff2afff2d57e66a9c6d46ad961ba35a279a42bbb22d08f754039132938fb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page