Standalone MLX-based LLM inference service with OpenAI compatible API
Project description
plllm-mlx
A standalone MLX-based LLM inference service with OpenAI compatible API, designed specifically for Apple Silicon.
Overview
plllm-mlx is a production-ready inference service that provides:
- Native Apple Silicon optimization via MLX framework
- OpenAI-compatible API for seamless integration
- Zero external dependencies (no database, Redis, or external services)
- Process isolation for stable multi-model serving
- Efficient KV cache with prefix caching
- Real-time streaming support
Key Features
🍎 Apple Silicon Native
- Optimized for M-series chips (M1/M2/M3/M4)
- Leverages MLX framework for efficient inference
- Hardware-accelerated operations
🔌 OpenAI Compatible API
- Drop-in replacement for OpenAI API
- Chat completions with streaming support
- Model listing and health check endpoints
📦 Zero External Dependencies
- No database required
- No Redis or message queues
- Standalone operation
- Simple deployment
🔄 Process Isolation
- Each model runs in separate subprocess
- Isolated memory management
- Fault tolerance and stability
- Clean shutdown handling
💾 Intelligent KV Cache
- Prefix-based caching for efficiency
- Message-level cache matching
- Incremental prefill optimization
- Memory-aware eviction
🎯 Extensible Architecture
- Pluggable model loaders (MLX-LM, MLX-VLM)
- Customizable step processors
- Easy to extend for new models
Installation
Using uv tool (Recommended)
# Install as a standalone tool
uv tool install plllm-mlx
# Start the service
plllm-mlx serve
# Or with options
plllm-mlx serve --port 8000 --config ~/.plllm-mlx/config.yaml
From Source
# Clone the repository
git clone https://github.com/littlepush/plllm-mlx.git
cd plllm-mlx
# Install dependencies
uv sync
# Run the service
uv run plllm-mlx serve
VLM Support (Vision Language Models)
For VLM models (e.g., Qwen2.5-VL, Qwen3.5-VL), install with VLM dependencies:
# Using uv tool
uv tool install 'plllm-mlx[vlm]'
# Using pip
pip install 'plllm-mlx[vlm]'
# If already installed, add VLM support
uv tool install 'plllm-mlx[vlm]' --force
# or
pip install torch torchvision
Quick Start
1. Start Service
# Start service (registers as LaunchAgent)
plllm-mlx serve
# Check status
plllm-mlx status
# Stop service
plllm-mlx stop
2. Manage Models
# List loaded models
plllm-mlx ps
# List all local models
plllm-mlx ls
# Search HuggingFace
plllm-mlx search qwen
# Download a model
plllm-mlx download mlx-community/Qwen2.5-7B-8bit
# Load/unload models
plllm-mlx load Qwen2.5-7B-8bit
plllm-mlx unload Qwen2.5-7B-8bit
# Configure model
plllm-mlx config Qwen2.5-7B-8bit temperature=0.8 max_tokens=2048
3. Use API
# Chat completion
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen2.5-7B-8bit",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'
2. Start Service
plllm-mlx --config config.yaml
3. Make API Requests
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"messages": [
{"role": "user", "content": "Hello!"}
],
"stream": true
}'
Architecture
plllm_mlx/
├── cli.py # Command-line interface
├── config.py # Configuration management
├── logging_config.py # Logging setup
├── exceptions.py # Custom exceptions
│
├── models/ # Model loading and inference
│ ├── model_loader.py # Base loader class
│ ├── mlx_loader.py # MLX-LM implementation
│ ├── mlxvlm_loader.py # MLX-VLM implementation
│ ├── model_subprocess.py # Subprocess execution
│ ├── process_manager.py # Process lifecycle management
│ ├── kv_cache.py # KV cache implementation
│ ├── local_models.py # Model manager
│ ├── model_detector.py # Model type detection
│ │
│ ├── step_processor.py # Base processor
│ ├── base_step_processor.py
│ ├── default_step_processor.py
│ ├── openai_step_processor.py
│ └── qwen3_thinking_step_processor.py
│
├── helpers/ # Utility modules
│ ├── chain_cache.py # Chain-based cache
│ ├── chat_helper.py # Chat completion builder
│ ├── chunk_helper.py # Chunk data structures
│ ├── step_info.py # Step metadata
│ ├── toolcall_helper.py # Tool call parsing
│ └── clz_helper.py # Class registry
│
└── routers/ # FastAPI endpoints
├── chat.py # Chat completions
├── models.py # Model listing
├── loader.py # Loader management
├── stepprocessor.py # Processor management
└── model_manager.py # Model operations
Core Concepts
Model Loaders
Model loaders handle model loading, inference, and streaming:
- MLX-LM Loader: For standard language models
- MLX-VLM Loader: For vision-language models
Each loader implements:
class PlModelLoader:
async def ensure_model_loaded()
async def stream_generate(session_object)
async def prepare_prompt(body)
Step Processors
Step processors transform raw generation results:
- Base: Basic text generation
- Default: Standard processing
- OpenAI: OpenAI-compatible formatting
- Qwen3Thinking: Qwen3 thinking mode support
Processors handle:
- Token accumulation
- Tool call parsing
- Thinking/reasoning content
- Finish reason detection
KV Cache
Prefix-based KV cache for efficient inference:
How it works:
- Split prompt into message segments
- Calculate MD5 hash for each message
- Match against cached message chains
- Skip prefill for matched prefix
- Only process incremental messages
Benefits:
- Faster multi-turn conversations
- Reduced memory usage
- Lower latency for repeated prompts
Process Isolation
Each model runs in a separate subprocess:
Main Process
├── API Server
├── Process Manager
│ ├── Model A Subprocess
│ │ ├── Model Loader
│ │ └── KV Cache
│ └── Model B Subprocess
│ ├── Model Loader
│ └── KV Cache
Advantages:
- Memory isolation
- Fault tolerance
- Clean resource cleanup
- Parallel model serving
API Reference
Chat Completions
POST /v1/chat/completions
Request:
{
"model": "model-name",
"messages": [
{"role": "system", "content": "You are helpful."},
{"role": "user", "content": "Hello!"}
],
"temperature": 0.7,
"max_tokens": 4096,
"stream": true
}
Streaming response:
data: {"id":"chatcmpl-xxx","choices":[{"delta":{"content":"Hi"}}]}
data: [DONE]
Models
GET /v1/models
Returns available models.
Health Check
GET /health
Returns service status.
Model Management
GET /api/v1/model/list
POST /api/v1/model/load
POST /api/v1/model/unload
Configuration
Server Options
| Option | Type | Default | Description |
|---|---|---|---|
server.host |
string | "0.0.0.0" | Bind address |
server.port |
int | 8000 | Server port |
server.log_level |
string | "info" | Log level |
Model Options
| Option | Type | Default | Description |
|---|---|---|---|
name |
string | required | Model identifier |
model_id |
string | required | HuggingFace model ID |
loader |
string | "mlx" | Loader type (mlx/mlxvlm) |
max_tokens |
int | 4096 | Maximum output tokens |
temperature |
float | 0.7 | Sampling temperature |
enable_prefix_cache |
bool | true | Enable KV cache |
Cache Options
| Option | Type | Default | Description |
|---|---|---|---|
enable_prefix_cache |
bool | true | Enable prefix cache |
max_memory_ratio |
float | 0.9 | Memory threshold |
Environment Variables
| Variable | Default | Description |
|---|---|---|
PLLLM_MLX_CONFIG |
config.yaml | Config file path |
PLLLM_MLX_HOST |
0.0.0.0 | Server host |
PLLLM_MLX_PORT |
8000 | Server port |
PLLLM_MLX_LOG_LEVEL |
info | Log level |
Development
Setup
git clone https://github.com/littlepush/plllm-mlx.git
cd plllm-mlx
uv sync --extra dev
Run Tests
uv run pytest
Code Quality
uv run ruff format .
uv run ruff check .
Extending plllm-mlx
Add a New Model Loader
- Create
models/my_loader.py - Inherit from
PlModelLoader - Implement required methods:
class MyLoader(PlModelLoader): @staticmethod def model_loader_name() -> str: return "my_loader" async def ensure_model_loaded(self): # Load model async def stream_generate(self, session_object): # Generate tokens
Add a New Step Processor
- Create
models/my_step_processor.py - Inherit from
PlStepProcessor - Implement processing logic:
class MyProcessor(PlStepProcessor): @staticmethod def step_clz_name() -> str: return "my_processor" def step(self, generate_response) -> Optional[PlChunk]: # Process token
Performance Tips
Memory Management
- Use process isolation for multiple models
- Monitor memory with
PLLLM_MEMORY_THRESHOLD - Adjust
prefill_step_sizefor large prompts
KV Cache Optimization
- Enable prefix cache for multi-turn conversations
- Monitor cache hit rates in logs
- Adjust
kv_bitsfor memory/speed tradeoff
Streaming Performance
- Use streaming (
"stream": true) for better UX - Monitor first token latency
- Check KV cache effectiveness
Troubleshooting
Model Loading Issues
# Check model availability
ls ~/.cache/huggingface/hub/
# Verify model format
python -c "from transformers import AutoModel; AutoModel.from_pretrained('model-id')"
Memory Issues
# Monitor memory
top -l 1 | grep PhysMem
# Reduce memory usage
# - Use quantized models (4bit/8bit)
# - Reduce max_tokens
# - Disable unused models
Streaming Issues
- Check if streaming is enabled in request
- Verify SSE support in client
- Check logs for generation errors
Requirements
- Python 3.12+
- macOS with Apple Silicon (M1/M2/M3/M4)
- MLX framework (
pip install mlx mlx-lm)
License
MIT License - see LICENSE for details.
Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Run tests and linting
- Submit a pull request
Acknowledgments
- MLX - Apple's ML framework
- mlx-lm - MLX language models
- mlx-vlm - MLX vision-language models
- FastAPI - Modern async web framework
Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Made with ❤️ for Apple Silicon
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file plllm_mlx-1.3.6.tar.gz.
File metadata
- Download URL: plllm_mlx-1.3.6.tar.gz
- Upload date:
- Size: 70.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.29 {"installer":{"name":"uv","version":"0.9.29","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6ce001d0b037cd777cb309b44d766b778378e2bc688aaa1072c97991ee09d493
|
|
| MD5 |
a2b9d02f8cf1dafe9d944907094466d8
|
|
| BLAKE2b-256 |
ae31ed858e36b6b229d032b1502d394d2ca8f4ba954aff4c3b3bf180c7af1773
|
File details
Details for the file plllm_mlx-1.3.6-py3-none-any.whl.
File metadata
- Download URL: plllm_mlx-1.3.6-py3-none-any.whl
- Upload date:
- Size: 85.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.29 {"installer":{"name":"uv","version":"0.9.29","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1763c00b45e64abbb5be00702268d4ed3776dad82c477bca412fae333f16a98a
|
|
| MD5 |
dda24042cb30d340b9e89d4a8cd20c7f
|
|
| BLAKE2b-256 |
7e0ff7253748794ce3853480115f1b765dad9a88194de586347c1a06aad5d814
|