GPU-Accelerated LLM Terminal for Apple Silicon
Project description
Cortex - LLM Terminal Client for Apple Silicon
Cortex is an LLM terminal interface designed for Apple Silicon, using MLX and PyTorch MPS frameworks for GPU-accelerated inference.
What It Does
- GPU-accelerated inference via MLX (primary) and PyTorch MPS backends
- Apple Silicon required - leverages unified memory architecture
- Multiple model formats - MLX, GGUF, SafeTensors, PyTorch, GPTQ, AWQ
- Built-in fine-tuning - LoRA-based model customization via interactive wizard
- Chat template auto-detection - automatic format detection with confidence scoring
- Conversation persistence - SQLite-backed chat history with branching
Features
- GPU-Accelerated Inference - Delegates to MLX and PyTorch MPS for Metal-based execution
- Apple Silicon Only - Requires Metal GPU; exits if GPU acceleration is unavailable
- Model Format Support:
- MLX (Apple's format, loaded via
mlx_lm) - GGUF (via
llama-cpp-pythonwith Metal backend) - SafeTensors (via HuggingFace
transformers) - PyTorch models (via HuggingFace
transformerswith MPS device) - GPTQ quantized (via
auto-gptq) - AWQ quantized (via
awq)
- MLX (Apple's format, loaded via
- Quantization - 4-bit, 5-bit, 8-bit, and mixed-precision quantization via MLX conversion pipeline
- Model Conversion - Convert HuggingFace models to MLX format with configurable quantization recipes
- Template Registry - Automatic detection of chat templates (ChatML, Llama, Alpaca, Gemma, Reasoning) with confidence scoring and real-time token filtering for reasoning models
- Rotating KV Cache - MLX-based KV cache for long context handling (default 4096 tokens)
- Fine-Tuning - LoRA-based model customization with interactive 6-step wizard
- Terminal UI - ANSI terminal interface with streaming output
Installation
# Clone and install
git clone https://github.com/faisalmumtaz/Cortex.git
cd Cortex
./install.sh
The installer:
- Checks for Apple Silicon (arm64) compatibility
- Creates a Python virtual environment
- Installs dependencies via
pip install -e .(frompyproject.toml) - Sets up the
cortexcommand in your PATH
Quick Install (pipx)
If you just want the CLI without cloning the repo, use pipx:
pipx install cortex-llm
Quick Start
# After installation, just run:
cortex
Downloading Models
# Inside Cortex, use the download command:
cortex
# Then type: /download
The download feature:
- HuggingFace integration - download any model by repository ID
- Automatic loading - option to load model immediately after download
Documentation
User Documentation
- Installation Guide - Complete setup instructions
- CLI Reference - Commands and user interface
- Configuration - System settings and optimization
- Model Management - Loading and managing models
- Template Registry - Automatic chat template detection and management
- Fine-Tuning Guide - Customize models with LoRA
- Troubleshooting - Common issues and solutions
Technical Documentation
- MLX Acceleration - MLX framework integration and optimization
- GPU Validation - Hardware requirements and detection
- Inference Engine - Text generation architecture
- Conversation Management - Chat history and persistence
- Development Guide - Contributing and architecture
System Requirements
- Apple Silicon Mac (M1/M2/M3/M4 - all variants supported)
- macOS 13.3+ (required by MLX framework)
- Python 3.11+
- 16GB+ unified memory (24GB+ recommended for larger models)
- Xcode Command Line Tools
Performance
Performance depends on your Apple Silicon chip, model size, and quantization level. The inference engine measures tokens/second, first-token latency, and memory usage at runtime.
To check that GPU acceleration is working:
source venv/bin/activate
python tests/test_apple_silicon.py
You should see:
- All validation checks passing
- Measured GFLOPS from matrix operations
- Confirmation of Metal and MLX availability
GPU Acceleration Architecture
Cortex uses a multi-layer approach, delegating all GPU computation to established frameworks:
-
MLX Framework (Primary Backend)
- Apple's ML framework with native Metal support
- Quantization support (4-bit, 5-bit, 8-bit, mixed-precision)
- Rotating KV cache for long contexts
- JIT compilation via
mx.compile - Operation fusion for reduced kernel launches
-
PyTorch MPS Backend
- Metal Performance Shaders for PyTorch models
- FP16 optimization and channels-last tensor format
-
llama.cpp (GGUF Backend)
- Metal-accelerated inference for GGUF models
-
Memory Management
- Pre-allocated memory pools with best-fit/first-fit allocation strategies
- Automatic pool sizing (60% of available memory, capped at 75% of total)
- Defragmentation support
Understanding "Skipping Kernel" Messages
When loading GGUF models, you may see messages like:
ggml_metal_init: skipping kernel_xxx_bf16 (not supported)
These are NORMAL! They indicate:
- BF16 kernels being skipped (your GPU uses FP16 instead)
- GPU acceleration is still fully active
- The system automatically uses optimal alternatives
Troubleshooting
If you suspect GPU isn't being used:
- Run validation:
python tests/test_apple_silicon.py - Check output: Should see passing checks and measured GFLOPS
- Monitor tokens/sec: Displayed during inference
- Verify Metal: Ensure Xcode Command Line Tools installed
Common issues:
- Low performance: Run
python tests/test_apple_silicon.pyto diagnose - Memory errors: Reduce
gpu_memory_fractionin config.yaml
MLX Model Conversion
Cortex includes an MLX model converter:
from cortex.metal.mlx_converter import MLXConverter, ConversionConfig, QuantizationRecipe
converter = MLXConverter()
config = ConversionConfig(
quantization=QuantizationRecipe.SPEED_4BIT, # 4-bit quantization
compile_model=True # JIT compilation
)
success, message, output_path = converter.convert_model(
"microsoft/DialoGPT-medium",
config=config
)
Quantization Options
- 4-bit: Maximum speed, 75% size reduction
- 5-bit: Balanced speed and quality
- 8-bit: Higher quality, 50% size reduction
- Mixed Precision: Custom per-layer quantization
MLX as Primary Backend
Cortex uses MLX (Apple's machine learning framework) as the primary acceleration backend:
- Metal Support: GPU execution via MLX's built-in Metal operations
- Quantization: Support for 4-bit, 5-bit, 8-bit, and mixed-precision quantization
- Model Conversion: Convert HuggingFace models to MLX format
Built With
- MLX - Apple's machine learning framework
- mlx-lm - LLM utilities and LoRA fine-tuning for MLX
- PyTorch - With Metal Performance Shaders backend
- llama.cpp - Metal-accelerated GGUF support
- Rich - Terminal formatting
- HuggingFace - Model hub and transformers
Contributing
We welcome contributions! Please see the Development Guide for contributing guidelines and setup instructions.
License
MIT License - See LICENSE for details.
Note: Cortex requires Apple Silicon. Intel Macs are not supported.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cortex_llm-1.0.2.tar.gz.
File metadata
- Download URL: cortex_llm-1.0.2.tar.gz
- Upload date:
- Size: 152.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3c21982b1605b56889dcaad9e1686aad96ee0711099d9a59615a4884d1b9941b
|
|
| MD5 |
0e52c61a52f49ecfdb0a5cb4ff425ae4
|
|
| BLAKE2b-256 |
f0cd85456f270c242f9356d2dd4c0d33283864640586a2a165ac148244c1aeb7
|
File details
Details for the file cortex_llm-1.0.2-py3-none-any.whl.
File metadata
- Download URL: cortex_llm-1.0.2-py3-none-any.whl
- Upload date:
- Size: 163.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5b7f98e8f8eb6c48c81e31d806177c819e604c576d51bac9339923fdad2848f1
|
|
| MD5 |
1c8b3924b8ec35db7a39de52370997d9
|
|
| BLAKE2b-256 |
420aec4e93e09579b1d9587f02216c619516283814286f68e9fe13bf6124259a
|