Python wrapper for llama.cpp - OpenAI API compatible LLM backend with auto GPU detection

These details have not been verified by PyPI

Project links

Project description

MoXing

A Python wrapper for llama.cpp that provides an OpenAI API compatible LLM backend with automatic GPU detection and model downloading.

Key Features:

🚀 Faster than Ollama - Direct llama.cpp execution without overhead
🔧 OpenAI Compatible - Drop-in replacement for OpenAI API
🎮 Multi-GPU Support - CUDA, Vulkan, ROCm, Metal backends
📦 Auto Download - Models from HuggingFace, ModelScope, or Ollama
💾 GGUF Compression - Save disk space with transparent decompression

Installation

pip install moxing

Binaries are downloaded automatically on first use (~60 MB). No manual setup required.

Quick Start

# Serve an Ollama model
moxing ollama serve llama3.2

# Serve with specific device and backend
moxing ollama serve llama3.2 -d gpu0 -b vulkan

# Serve from HuggingFace
moxing serve Qwen/Qwen2.5-7B-Instruct-GGUF

# Serve a local GGUF file with specific device
moxing serve ./model.gguf --port 8080 -d gpu1 -b cuda

# List available devices
moxing devices

Then use OpenAI API:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="none")
response = client.chat.completions.create(
    model="local",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

CLI Commands

Model Management

Command	Description
`moxing serve <model>`	Start server with a model
`moxing run <model> "prompt"`	Quick inference
`moxing chat <model>`	Interactive chat
`moxing download <repo>`	Download model from HF/ModelScope
`moxing models`	List available models

Ollama Integration

Command	Description
`moxing ollama list`	List Ollama models
`moxing ollama serve <model>`	Serve Ollama model
`moxing ollama info <model>`	Show model details

GGUF Compression

Command	Description
`moxing compress pack <file>`	Compress GGUF file
`moxing compress unpack <file>`	Decompress file
`moxing compress split <file>`	Split into chunks
`moxing compress merge <pattern>`	Merge chunks

System & Diagnostics

Command	Description
`moxing devices`	List GPU devices
`moxing diagnose`	System diagnostics
`moxing bench <model>`	Benchmark performance
`moxing --version`	Show version info

GPU Backends

MoXing automatically detects and uses the best available backend:

Platform	CPU	CUDA	Vulkan	ROCm	Metal
Linux x64	✅	✅	✅	✅	-
Windows x64	✅	✅	✅	-	-
macOS ARM64	✅	-	-	-	✅

Force a specific backend:

pip install moxing[cuda]   # NVIDIA GPU
pip install moxing[vulkan] # Cross-platform GPU
pip install moxing[metal]  # Apple Silicon
pip install moxing[rocm]   # AMD GPU
pip install moxing[cpu]    # CPU only

Device Selection

List available devices and their IDs:

moxing devices

Select a specific device and backend when serving:

# Use GPU 0 with Vulkan backend
moxing serve model.gguf -d gpu0 -b vulkan

# Use GPU 1 with CUDA backend
moxing serve model.gguf -d gpu1 -b cuda

# Use CPU only
moxing serve model.gguf -d cpu

# Run multiple instances on different devices (auto port)
moxing serve model1.gguf -d gpu0 --auto-port &
moxing serve model2.gguf -d gpu1 --auto-port &
moxing serve model3.gguf -d cpu --auto-port &

# Or specify ports manually
moxing serve model1.gguf -d gpu0 -p 8080 &
moxing serve model2.gguf -d gpu1 -p 8081 &
moxing serve model3.gguf -d cpu -p 8082 &

Device options:

-d gpu0, -d gpu1, ... - Select GPU by index
-d cpu - Use CPU only
-d auto - Auto-select best device (default)

Port options:

-p 8080 - Use specific port
-p 0 or --auto-port - Auto-find available port

Backend options:

-b vulkan - Cross-platform GPU (AMD, Intel, NVIDIA)
-b cuda - NVIDIA GPU
-b rocm - AMD GPU (Linux)
-b metal - Apple Silicon (macOS)
-b cpu - CPU only
-b auto - Auto-detect (default)

Download Multiple Backend Binaries

Download binaries for all supported backends to enable device switching:

# List available backends
moxing download-binaries --list

# Download specific backend
moxing download-binaries --backend vulkan

# Download all backends for multi-device support
moxing download-binaries --backend all

Model Sources

Ollama Models

moxing ollama list                  # List installed models
moxing ollama serve llama3.2        # Serve with OpenAI API
moxing ollama serve llama3.2 --skip-check  # Skip compatibility check
moxing ollama serve --select        # Interactive selection

HuggingFace

moxing download Qwen/Qwen2.5-7B-Instruct-GGUF -q Q4_K_M
moxing serve Qwen/Qwen2.5-7B-Instruct-GGUF

ModelScope (China Mirror)

moxing download Qwen/Qwen2.5-7B-Instruct-GGUF --source modelscope

Python API

from moxing import quick_run, quick_server, LlamaServer

# Quick inference
result = quick_run("llama3.2", "Write a haiku about coding")
print(result)

# Start server
with quick_server("llama3.2", port=8080) as server:
    # Use OpenAI API at http://localhost:8080/v1
    pass

# Full control
server = LlamaServer(
    model="model.gguf",
    backend="cuda",
    ctx_size=8192,
    gpu_layers=99
)
server.start()

GGUF Compression

Save disk space by compressing GGUF files:

# Compress (typically 3-5% smaller)
moxing compress pack model.gguf
# Creates: model.gguf.zst

# Serve compressed file (auto-decompresses)
moxing serve model.gguf.zst

# Split large files
moxing compress split model.gguf --size 512  # 512 MB chunks

# Merge back
moxing compress merge "model.gguf-part-*" merged.gguf

# Manage cache
moxing compress cache --size
moxing compress cache --clear

Performance

On Apple M4 with carstenuhlig/omnicoder-9b:

Framework	Speed
Ollama	~10 tokens/s
MoXing	~15 tokens/s

Results vary by model and hardware. MoXing removes Ollama's abstraction layer for direct llama.cpp execution.

Environment Variables

Variable	Description
`MOXING_BINARY_SOURCE`	Binary source: `github`, `modelscope`, `auto`
`MOXING_BINARY_MIRROR`	Custom binary mirror URL
`MOXING_NO_UPDATE_CHECK`	Skip binary update check (set to `1`)

Building from Source

Build Wheel

./generate_wheel.sh --version 0.1.9

Build Binaries

# Build all available backends
./generate_binaries.sh

# Build specific backend
./generate_binaries.sh --backend cuda

# Build specific llama.cpp version
./generate_binaries.sh --version b8468

Upload

./upload_binaries.sh  # Upload to GitHub
./upload_pypi.sh      # Upload to PyPI

How It Works

User Request → MoXing → llama.cpp (GPU accelerated) → OpenAI API Response
                  ↓
           Auto-download model if needed
                  ↓
           Auto-download binaries if needed

Compressed files are transparently decompressed:

model.gguf.zst → ~/.cache/moxing/decompressed/model.gguf → llama.cpp

Compatibility

Tested Models:

Qwen2.5 series
Llama 3.x series
Mistral series
Phi-3 series
carstenuhlig/omnicoder-9b

Try your model directly - if it works with llama.cpp, it works with MoXing.

Troubleshooting

# Check system status
moxing diagnose

# Check binary version
moxing --version

# Re-download binaries
moxing download-binaries --force

# Clear all caches
moxing clear-cache --all

License

MIT License - see LICENSE for details.

Contributing

Contributions welcome! Please open an issue or PR on GitHub.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.37

May 19, 2026

0.1.36

May 19, 2026

0.1.35

May 18, 2026

0.1.34

May 6, 2026

0.1.33

May 6, 2026

0.1.32

May 6, 2026

0.1.31

May 5, 2026

This version

0.1.30

May 4, 2026

0.1.28

Apr 20, 2026

0.1.27

Apr 20, 2026

0.1.26

Apr 5, 2026

0.1.25

Apr 2, 2026

0.1.24

Mar 28, 2026

0.1.23

Mar 28, 2026

0.1.22

Mar 26, 2026

0.1.15

Mar 23, 2026

0.1.14

Mar 23, 2026

0.1.12

Mar 23, 2026

0.1.11

Mar 23, 2026

0.1.10

Mar 22, 2026

0.1.9

Mar 21, 2026

0.1.8.post3

Mar 21, 2026

0.1.6

Mar 20, 2026

0.1.5

Mar 17, 2026

0.1.4

Mar 17, 2026

0.1.2

Mar 17, 2026

0.1.1

Mar 17, 2026

0.1.0

Mar 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

moxing-0.1.30.tar.gz (154.0 kB view details)

Uploaded May 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

moxing-0.1.30-py3-none-any.whl (167.7 kB view details)

Uploaded May 4, 2026 Python 3

File details

Details for the file moxing-0.1.30.tar.gz.

File metadata

Download URL: moxing-0.1.30.tar.gz
Upload date: May 4, 2026
Size: 154.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for moxing-0.1.30.tar.gz
Algorithm	Hash digest
SHA256	`662606197bea16f2b34ddd9c4d9f921d36373b33c1dd84825cd9445778d72859`
MD5	`24cbd21d3603e21ab512bce8ba232e56`
BLAKE2b-256	`510c97b5c09951099dc10c7d265c2c291a84cfc51b86ed3e1298fc34727da6b1`

See more details on using hashes here.

File details

Details for the file moxing-0.1.30-py3-none-any.whl.

File metadata

Download URL: moxing-0.1.30-py3-none-any.whl
Upload date: May 4, 2026
Size: 167.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for moxing-0.1.30-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6b5110804d466f9283f4802aba4e45091b8955549601c5e2945fd7ecc64c3957`
MD5	`520df345892d1da2b7bd96f3de2fde3f`
BLAKE2b-256	`d63c22c1f2f2bbc63f40355d01eb067c60116555d388abb58d3d2d9bea7ce17f`

See more details on using hashes here.

moxing 0.1.30

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MoXing

Installation

Quick Start

CLI Commands

Model Management

Ollama Integration

GGUF Compression

System & Diagnostics

GPU Backends

Device Selection

Download Multiple Backend Binaries

Model Sources

Ollama Models

HuggingFace

ModelScope (China Mirror)

Python API

GGUF Compression

Performance

Environment Variables

Building from Source

Build Wheel

Build Binaries

Upload

How It Works

Compatibility

Troubleshooting

License

Contributing

Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes