GPU-Accelerated LLM Terminal for Apple Silicon

These details have not been verified by PyPI

Project links

Project description

Cortex - LLM Terminal Client for Apple Silicon

Cortex is an LLM terminal interface designed for Apple Silicon, using MLX and PyTorch MPS frameworks for GPU-accelerated inference.

What It Does

GPU-accelerated inference via MLX (primary) and PyTorch MPS backends
Apple Silicon required - leverages unified memory architecture
Multiple model formats - MLX, GGUF, SafeTensors, PyTorch, GPTQ, AWQ
Built-in fine-tuning - LoRA-based model customization via interactive wizard
Chat template auto-detection - automatic format detection with confidence scoring
Conversation persistence - SQLite-backed chat history with branching

Features

GPU-Accelerated Inference - Delegates to MLX and PyTorch MPS for Metal-based execution
Apple Silicon Only - Requires Metal GPU; exits if GPU acceleration is unavailable
Model Format Support:
- MLX (Apple's format, loaded via mlx_lm)
- GGUF (via llama-cpp-python with Metal backend)
- SafeTensors (via HuggingFace transformers)
- PyTorch models (via HuggingFace transformers with MPS device)
- GPTQ quantized (via auto-gptq)
- AWQ quantized (via awq)
Quantization - 4-bit, 5-bit, 8-bit, and mixed-precision quantization via MLX conversion pipeline
Model Conversion - Convert HuggingFace models to MLX format with configurable quantization recipes
Template Registry - Automatic detection of chat templates (ChatML, Llama, Alpaca, Gemma, Reasoning) with confidence scoring and real-time token filtering for reasoning models
Rotating KV Cache - MLX-based KV cache for long context handling (default 4096 tokens)
Fine-Tuning - LoRA-based model customization with interactive 6-step wizard
Terminal UI - ANSI terminal interface with streaming output

Installation

# Clone and install
git clone https://github.com/faisalmumtaz/Cortex.git
cd Cortex
./install.sh

The installer:

Checks for Apple Silicon (arm64) compatibility
Creates a Python virtual environment
Installs dependencies via pip install -e . (from pyproject.toml)
Sets up the cortex command in your PATH

Quick Install (pipx)

If you just want the CLI without cloning the repo, use pipx:

pipx install cortex-llm

Quick Start

# After installation, just run:
cortex

Downloading Models

# Inside Cortex, use the download command:
cortex
# Then type: /download

The download feature:

HuggingFace integration - download any model by repository ID
Automatic loading - option to load model immediately after download

Documentation

User Documentation

Installation Guide - Complete setup instructions
CLI Reference - Commands and user interface
Configuration - System settings and optimization
Model Management - Loading and managing models
Template Registry - Automatic chat template detection and management
Fine-Tuning Guide - Customize models with LoRA
Troubleshooting - Common issues and solutions

Technical Documentation

MLX Acceleration - MLX framework integration and optimization
GPU Validation - Hardware requirements and detection
Inference Engine - Text generation architecture
Conversation Management - Chat history and persistence
Development Guide - Contributing and architecture

System Requirements

Apple Silicon Mac (M1/M2/M3/M4 - all variants supported)
macOS 13.3+ (required by MLX framework)
Python 3.11+
16GB+ unified memory (24GB+ recommended for larger models)
Xcode Command Line Tools

Performance

Performance depends on your Apple Silicon chip, model size, and quantization level. The inference engine measures tokens/second, first-token latency, and memory usage at runtime.

To check that GPU acceleration is working:

source venv/bin/activate
python tests/test_apple_silicon.py

You should see:

All validation checks passing
Measured GFLOPS from matrix operations
Confirmation of Metal and MLX availability

GPU Acceleration Architecture

Cortex uses a multi-layer approach, delegating all GPU computation to established frameworks:

MLX Framework (Primary Backend)
- Apple's ML framework with native Metal support
- Quantization support (4-bit, 5-bit, 8-bit, mixed-precision)
- Rotating KV cache for long contexts
- JIT compilation via mx.compile
- Operation fusion for reduced kernel launches
PyTorch MPS Backend
- Metal Performance Shaders for PyTorch models
- FP16 optimization and channels-last tensor format
llama.cpp (GGUF Backend)
- Metal-accelerated inference for GGUF models
Memory Management
- Pre-allocated memory pools with best-fit/first-fit allocation strategies
- Automatic pool sizing (60% of available memory, capped at 75% of total)
- Defragmentation support

Understanding "Skipping Kernel" Messages

When loading GGUF models, you may see messages like:

ggml_metal_init: skipping kernel_xxx_bf16 (not supported)

These are NORMAL! They indicate:

BF16 kernels being skipped (your GPU uses FP16 instead)
GPU acceleration is still fully active
The system automatically uses optimal alternatives

Troubleshooting

If you suspect GPU isn't being used:

Run validation: python tests/test_apple_silicon.py
Check output: Should see passing checks and measured GFLOPS
Monitor tokens/sec: Displayed during inference
Verify Metal: Ensure Xcode Command Line Tools installed

Common issues:

Low performance: Run python tests/test_apple_silicon.py to diagnose
Memory errors: Reduce gpu_memory_fraction in config.yaml

MLX Model Conversion

Cortex includes an MLX model converter:

from cortex.metal.mlx_converter import MLXConverter, ConversionConfig, QuantizationRecipe

converter = MLXConverter()
config = ConversionConfig(
    quantization=QuantizationRecipe.SPEED_4BIT,  # 4-bit quantization
    compile_model=True  # JIT compilation
)

success, message, output_path = converter.convert_model(
    "microsoft/DialoGPT-medium",
    config=config
)

Quantization Options

4-bit: Maximum speed, 75% size reduction
5-bit: Balanced speed and quality
8-bit: Higher quality, 50% size reduction
Mixed Precision: Custom per-layer quantization

MLX as Primary Backend

Cortex uses MLX (Apple's machine learning framework) as the primary acceleration backend:

Metal Support: GPU execution via MLX's built-in Metal operations
Quantization: Support for 4-bit, 5-bit, 8-bit, and mixed-precision quantization
Model Conversion: Convert HuggingFace models to MLX format

Built With

MLX - Apple's machine learning framework
mlx-lm - LLM utilities and LoRA fine-tuning for MLX
PyTorch - With Metal Performance Shaders backend
llama.cpp - Metal-accelerated GGUF support
Rich - Terminal formatting
HuggingFace - Model hub and transformers

Contributing

We welcome contributions! Please see the Development Guide for contributing guidelines and setup instructions.

License

MIT License - See LICENSE for details.

Note: Cortex requires Apple Silicon. Intel Macs are not supported.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.20

Feb 4, 2026

1.0.19

Feb 4, 2026

1.0.18

Feb 4, 2026

1.0.17

Feb 4, 2026

1.0.16

Feb 4, 2026

1.0.15

Feb 4, 2026

1.0.14

Feb 4, 2026

1.0.13

Feb 4, 2026

1.0.12

Feb 4, 2026

1.0.11

Feb 3, 2026

1.0.10

Feb 2, 2026

1.0.9

Feb 2, 2026

1.0.8

Feb 2, 2026

1.0.7

Feb 2, 2026

1.0.6

Feb 1, 2026

1.0.5

Feb 1, 2026

1.0.4

Feb 1, 2026

This version

1.0.3

Feb 1, 2026

1.0.2

Feb 1, 2026

1.0.1

Feb 1, 2026

1.0.0

Feb 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cortex_llm-1.0.3.tar.gz (153.8 kB view details)

Uploaded Feb 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cortex_llm-1.0.3-py3-none-any.whl (165.0 kB view details)

Uploaded Feb 1, 2026 Python 3

File details

Details for the file cortex_llm-1.0.3.tar.gz.

File metadata

Download URL: cortex_llm-1.0.3.tar.gz
Upload date: Feb 1, 2026
Size: 153.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for cortex_llm-1.0.3.tar.gz
Algorithm	Hash digest
SHA256	`60063f51d23d3db72a4783b2decf042177742d7020eae786d04897783def7b9c`
MD5	`1e0dce12fefb7ced3b01c4485745a880`
BLAKE2b-256	`eff04cd39defcd465630b16ab34cf61e80703f11e851a3055a5e84b4dce63686`

See more details on using hashes here.

File details

Details for the file cortex_llm-1.0.3-py3-none-any.whl.

File metadata

Download URL: cortex_llm-1.0.3-py3-none-any.whl
Upload date: Feb 1, 2026
Size: 165.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for cortex_llm-1.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b04cfd0cf9d10a7947e9d0d75255c5a1eeec9085e1fdf391c2f3e928864794f1`
MD5	`51a76d15e54503cf560ef7b82adf55ea`
BLAKE2b-256	`9de6c4df8885edda20296be172f7c2d84ca87f7d1485675bf2ccffb6c8cd3092`

See more details on using hashes here.

cortex-llm 1.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Cortex - LLM Terminal Client for Apple Silicon

What It Does

Features

Installation

Quick Install (pipx)

Quick Start

Downloading Models

Documentation

User Documentation

Technical Documentation

System Requirements

Performance

GPU Acceleration Architecture

Understanding "Skipping Kernel" Messages

Troubleshooting

MLX Model Conversion

Quantization Options

MLX as Primary Backend

Built With

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes