Skip to main content

Python wrapper for llama.cpp - OpenAI API compatible LLM backend with auto GPU detection

Project description

MoXing

A Python wrapper for llama.cpp that provides an OpenAI API compatible LLM backend with automatic GPU detection and model downloading.

Key Features:

  • 🚀 Faster than Ollama - Direct llama.cpp execution without overhead
  • 🔧 OpenAI Compatible - Drop-in replacement for OpenAI API
  • 🎮 Multi-GPU Support - CUDA, Vulkan, ROCm, Metal backends
  • 📦 Auto Download - Models from HuggingFace, ModelScope, or Ollama
  • 💾 GGUF Compression - Save disk space with transparent decompression

Installation

pip install moxing

Binaries are downloaded automatically on first use (~60 MB). No manual setup required.

Quick Start

# Serve an Ollama model
moxing ollama serve llama3.2

# Serve with specific device and backend
moxing ollama serve llama3.2 -d gpu0 -b vulkan

# Serve from HuggingFace
moxing serve Qwen/Qwen2.5-7B-Instruct-GGUF

# Serve a local GGUF file with specific device
moxing serve ./model.gguf --port 8080 -d gpu1 -b cuda

# List available devices
moxing devices

Then use OpenAI API:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="none")
response = client.chat.completions.create(
    model="local",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

CLI Commands

Model Management

Command Description
moxing serve <model> Start server with a model
moxing run <model> "prompt" Quick inference
moxing chat <model> Interactive chat
moxing download <repo> Download model from HF/ModelScope
moxing models List available models

Ollama Integration

Command Description
moxing ollama list List Ollama models
moxing ollama serve <model> Serve Ollama model
moxing ollama info <model> Show model details

GGUF Compression

Command Description
moxing compress pack <file> Compress GGUF file
moxing compress unpack <file> Decompress file
moxing compress split <file> Split into chunks
moxing compress merge <pattern> Merge chunks

System & Diagnostics

Command Description
moxing devices List GPU devices
moxing diagnose System diagnostics
moxing bench <model> Benchmark performance
moxing --version Show version info

GPU Backends

MoXing automatically detects and uses the best available backend:

Platform CPU CUDA Vulkan ROCm Metal
Linux x64 -
Windows x64 - -
macOS ARM64 - - -

Force a specific backend:

pip install moxing[cuda]   # NVIDIA GPU
pip install moxing[vulkan] # Cross-platform GPU
pip install moxing[metal]  # Apple Silicon
pip install moxing[rocm]   # AMD GPU
pip install moxing[cpu]    # CPU only

Device Selection

List available devices and their IDs:

moxing devices

Select a specific device and backend when serving:

# Use GPU 0 with Vulkan backend
moxing serve model.gguf -d gpu0 -b vulkan

# Use GPU 1 with CUDA backend
moxing serve model.gguf -d gpu1 -b cuda

# Use CPU only
moxing serve model.gguf -d cpu

# Run multiple instances on different devices (auto port)
moxing serve model1.gguf -d gpu0 --auto-port &
moxing serve model2.gguf -d gpu1 --auto-port &
moxing serve model3.gguf -d cpu --auto-port &

# Or specify ports manually
moxing serve model1.gguf -d gpu0 -p 8080 &
moxing serve model2.gguf -d gpu1 -p 8081 &
moxing serve model3.gguf -d cpu -p 8082 &

Device options:

  • -d gpu0, -d gpu1, ... - Select GPU by index
  • -d cpu - Use CPU only
  • -d auto - Auto-select best device (default)

Port options:

  • -p 8080 - Use specific port
  • -p 0 or --auto-port - Auto-find available port

Backend options:

  • -b vulkan - Cross-platform GPU (AMD, Intel, NVIDIA)
  • -b cuda - NVIDIA GPU
  • -b rocm - AMD GPU (Linux)
  • -b metal - Apple Silicon (macOS)
  • -b cpu - CPU only
  • -b auto - Auto-detect (default)

Download Multiple Backend Binaries

Download binaries for all supported backends to enable device switching:

# List available backends
moxing download-binaries --list

# Download specific backend
moxing download-binaries --backend vulkan

# Download all backends for multi-device support
moxing download-binaries --backend all

Model Sources

Ollama Models

moxing ollama list                  # List installed models
moxing ollama serve llama3.2        # Serve with OpenAI API
moxing ollama serve llama3.2 --skip-check  # Skip compatibility check
moxing ollama serve --select        # Interactive selection

HuggingFace

moxing download Qwen/Qwen2.5-7B-Instruct-GGUF -q Q4_K_M
moxing serve Qwen/Qwen2.5-7B-Instruct-GGUF

ModelScope (China Mirror)

moxing download Qwen/Qwen2.5-7B-Instruct-GGUF --source modelscope

Python API

from moxing import quick_run, quick_server, LlamaServer

# Quick inference
result = quick_run("llama3.2", "Write a haiku about coding")
print(result)

# Start server
with quick_server("llama3.2", port=8080) as server:
    # Use OpenAI API at http://localhost:8080/v1
    pass

# Full control
server = LlamaServer(
    model="model.gguf",
    backend="cuda",
    ctx_size=8192,
    gpu_layers=99
)
server.start()

GGUF Compression

Save disk space by compressing GGUF files:

# Compress (typically 3-5% smaller)
moxing compress pack model.gguf
# Creates: model.gguf.zst

# Serve compressed file (auto-decompresses)
moxing serve model.gguf.zst

# Split large files
moxing compress split model.gguf --size 512  # 512 MB chunks

# Merge back
moxing compress merge "model.gguf-part-*" merged.gguf

# Manage cache
moxing compress cache --size
moxing compress cache --clear

Performance

On Apple M4 with carstenuhlig/omnicoder-9b:

Framework Speed
Ollama ~10 tokens/s
MoXing ~15 tokens/s

Results vary by model and hardware. MoXing removes Ollama's abstraction layer for direct llama.cpp execution.

Environment Variables

Variable Description
MOXING_BINARY_SOURCE Binary source: github, modelscope, auto
MOXING_BINARY_MIRROR Custom binary mirror URL
MOXING_NO_UPDATE_CHECK Skip binary update check (set to 1)

Building from Source

Build Wheel

./generate_wheel.sh --version 0.1.9

Build Binaries

# Build all available backends
./generate_binaries.sh

# Build specific backend
./generate_binaries.sh --backend cuda

# Build specific llama.cpp version
./generate_binaries.sh --version b8468

Upload

./upload_binaries.sh  # Upload to GitHub
./upload_pypi.sh      # Upload to PyPI

How It Works

User Request → MoXing → llama.cpp (GPU accelerated) → OpenAI API Response
                  ↓
           Auto-download model if needed
                  ↓
           Auto-download binaries if needed

Compressed files are transparently decompressed:

model.gguf.zst → ~/.cache/moxing/decompressed/model.gguf → llama.cpp

Compatibility

Tested Models:

  • Qwen2.5 series
  • Llama 3.x series
  • Mistral series
  • Phi-3 series
  • carstenuhlig/omnicoder-9b

Try your model directly - if it works with llama.cpp, it works with MoXing.

Troubleshooting

# Check system status
moxing diagnose

# Check binary version
moxing --version

# Re-download binaries
moxing download-binaries --force

# Clear all caches
moxing clear-cache --all

License

MIT License - see LICENSE for details.

Contributing

Contributions welcome! Please open an issue or PR on GitHub.

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

moxing-0.1.30.tar.gz (154.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

moxing-0.1.30-py3-none-any.whl (167.7 kB view details)

Uploaded Python 3

File details

Details for the file moxing-0.1.30.tar.gz.

File metadata

  • Download URL: moxing-0.1.30.tar.gz
  • Upload date:
  • Size: 154.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for moxing-0.1.30.tar.gz
Algorithm Hash digest
SHA256 662606197bea16f2b34ddd9c4d9f921d36373b33c1dd84825cd9445778d72859
MD5 24cbd21d3603e21ab512bce8ba232e56
BLAKE2b-256 510c97b5c09951099dc10c7d265c2c291a84cfc51b86ed3e1298fc34727da6b1

See more details on using hashes here.

File details

Details for the file moxing-0.1.30-py3-none-any.whl.

File metadata

  • Download URL: moxing-0.1.30-py3-none-any.whl
  • Upload date:
  • Size: 167.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for moxing-0.1.30-py3-none-any.whl
Algorithm Hash digest
SHA256 6b5110804d466f9283f4802aba4e45091b8955549601c5e2945fd7ecc64c3957
MD5 520df345892d1da2b7bd96f3de2fde3f
BLAKE2b-256 d63c22c1f2f2bbc63f40355d01eb067c60116555d388abb58d3d2d9bea7ce17f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page