server to serve mlx model as an OpenAI compatible API
Project description
mlx-llm-server
This guide will help you set up the MLX-LLM server to serve the model as an OpenAI compatible API.
Quick Start
Installation
Before starting the MLX-LLM server, install the server package from PyPI:
pip install mlx-llm-server
Start the Server
mlx-llm-server --model <path-to-your-model>
Arguments
--model
: The path to the mlx model weights, tokenizer, and config. This argument is required.--adapter-file
: (Optional) The path for the trained adapter weights.
Host and Port Configuration
The server will start on the host and port specified by the environment variables HOST
and PORT
. If these are not set, it defaults to 127.0.0.1:8080
.
To start the server on a different host or port, set the HOST
and PORT
environment variables before starting the server. For example:
export HOST=0.0.0.0
export PORT=5000
mlx-llm-server --model <path-to-your-model>
The MLX-LLM server can serve both Hugging Face format models and quantized MLX models. You can find these models at the MLX Community on Hugging Face.
API Spec
API Endpoint: /v1/chat/completions
Method: POST
Request Headers
Content-Type
: Must beapplication/json
.
Request Body (JSON)
messages
: An array of message objects representing the conversation history. Each message object should have arole
(e.g.,user
,assistant
) andcontent
(the message text).role_mapping
: (Optional) A dictionary to customize the role prefixes in the generated prompt. If not provided, default mappings are used.stop
: (Optional) An array of strings or a single string representing stopping conditions for the generation. These are sequences of tokens where the generation should stop.max_tokens
: (Optional) An integer specifying the maximum number of tokens to generate. Defaults to 100.stream
: (Optional) A boolean indicating if the response should be streamed. Iftrue
, responses are sent as they are generated. Defaults tofalse
.model
: (Optional) A string specifying the model to use for generation. This is not utilized in the provided code but could be used for selecting among multiple models.temperature
: (Optional) A float specifying the sampling temperature. Defaults to 1.0.top_p
: (Optional) A float specifying the nucleus sampling parameter. Defaults to 1.0.repetition_penalty
: Optional. Applies a penalty to repeated tokens.repetition_context_size
: Optional. The size of the context window for applying repetition penalty.
Development Setup Guide
Miniconda Installation
For Apple Silicon users, install Miniconda natively with these commands:
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh
bash Miniforge3-MacOSX-arm64.sh
Conda Environment Setup
After Miniconda installation, create a dedicated conda environment for MLX-LLM:
conda create -n mlx-llm python=3.10
conda activate mlx-llm
Installing Dependencies
With the mlx-llm
environment activated, install the necessary dependencies using the following command:
pip install -r requirements.txt
Testing the API with curl
You can test the API using the curl
command. Here's an example:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"model": "gpt-3.5-turbo",
"stop":["<|im_end|>"],
"messages": [
{
"role": "user",
"content": "Write a limerick about python exceptions"
}
]
}'
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file mlx-llm-server-0.1.10.tar.gz
.
File metadata
- Download URL: mlx-llm-server-0.1.10.tar.gz
- Upload date:
- Size: 6.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0833c43e7b624d54c3b0a15d924c460ceaea1c3e03cad802a8a8c9447171eb97 |
|
MD5 | 6f15a13fb28de09595c60eaedd4fd9d7 |
|
BLAKE2b-256 | da4a94d3dd245ba746122eb8974008c38078ac25acda464ff4ad7ad90b77e742 |
File details
Details for the file mlx_llm_server-0.1.10-py3-none-any.whl
.
File metadata
- Download URL: mlx_llm_server-0.1.10-py3-none-any.whl
- Upload date:
- Size: 7.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5c23c19259e1e2c3d13046691ca47b378fb155e7f1d0f6d04388199aed30f789 |
|
MD5 | df6c7195469df49feb810e77e0d73640 |
|
BLAKE2b-256 | 47d8ff2afff2d57e66a9c6d46ad961ba35a279a42bbb22d08f754039132938fb |