High-performance kernels for Xoron multimodal model with runtime dispatch, JIT compilation, and multi-GPU support
Project description
XTransformers
๐ High-Performance Kernels for Xoron Multimodal Model
XTransformers is a custom kernel library designed specifically for the Xoron multimodal model, providing state-of-the-art optimizations for running large language models on consumer-grade hardware.
โจ Features
Hardware Support
- Multi-GPU: NVIDIA CUDA, AMD ROCm, Intel oneAPI
- Apple Silicon: Metal Performance Shaders
- CPU: Intel (AVX2/AVX512/AMX), AMD (AVX2/AVX512), ARM (NEON/SVE)
- Cross-Platform: Triton JIT kernels for portability
Runtime Optimization
- Runtime Dispatch: Automatically selects optimal kernel variant at startup
- JIT Compilation: Compiles kernels optimized for your specific hardware
- NUMA Awareness: Efficient memory placement on multi-socket systems
Model Optimizations
- MoE Expert Offloading: Cold experts offloaded to CPU, hot experts on GPU
- GGUF Support: On-the-fly dequantization (Q2_K to Q8_K, FP8)
- MLA (Multi-Head Latent Attention): 4-8x KV cache compression
- Ring Attention: Efficient 128K+ context processing
- Flash Attention: Memory-efficient attention with O(N) memory
Multimodal
- Vision: SigLIP encoder with TiTok 1D tokenization
- Video: 3D-RoPE temporal encoding, VidTok compression
- Audio: Conformer encoder/decoder, raw waveform processing
- Generation: MoE-DiT image/video generation with Flow Matching
๐ฆ Installation
From PyPI (Recommended)
pip install xtransformers
From Source (Development)
git clone https://github.com/nigfuapp-web/xtransformers.git
cd xtransformers
pip install -e .
With GPU Support
# NVIDIA CUDA
XT_USE_CUDA=1 pip install xtransformers
# AMD ROCm
XT_USE_ROCM=1 pip install xtransformers
# With Triton (cross-platform GPU)
pip install xtransformers[triton]
Build Options
Control the build with environment variables:
# Force specific CPU variant
XT_CPU_VARIANT=AVX512_BF16 pip install xtransformers
# Enable/disable features
XT_ENABLE_AMX=ON pip install xtransformers
XT_ENABLE_GGUF=ON pip install xtransformers
XT_ENABLE_JIT=ON pip install xtransformers
# CUDA architectures (for multi-GPU support)
XT_CUDA_ARCHS="70;75;80;86;89;90" pip install xtransformers
# Parallel build
XT_PARALLEL=16 pip install xtransformers
๐ Quick Start
import xtransformers
# Initialize with hardware detection
xtransformers.init()
# Print detected hardware
xtransformers.print_hardware_info()
# Get optimal kernel variant
variant = xtransformers.get_best_variant()
print(f"Using kernel variant: {variant}")
MoE Inference
import xtransformers
import torch
# Configure MoE kernel
config = xtransformers.MoEKernelConfig()
config.num_experts = 8
config.num_experts_per_tok = 2
config.hidden_size = 4096
config.intermediate_size = 11008
config.enable_expert_offload = True
config.quant_bits = 4 # Q4 quantization
# Create kernel with runtime dispatch
moe_kernel = xtransformers.create_moe_kernel(config)
# Load GGUF weights
moe_kernel.load_weights("/path/to/model/experts")
# Forward pass
expert_ids = torch.randint(0, 8, (batch_size, seq_len, 2))
routing_weights = torch.softmax(torch.randn(batch_size, seq_len, 2), dim=-1)
input_hidden = torch.randn(batch_size, seq_len, 4096, dtype=torch.bfloat16)
output = torch.zeros_like(input_hidden, dtype=torch.float32)
moe_kernel.forward(batch_size, seq_len, expert_ids, routing_weights, input_hidden, output)
Triton Kernels (Cross-Platform GPU)
from xtransformers.triton_kernels import TritonKernels
# Flash Attention
output = TritonKernels.flash_attention(Q, K, V, causal=True)
# MoE Routing
expert_ids, routing_weights = TritonKernels.moe_routing(router_logits, top_k=2)
# GGUF Dequantization
fp32_weights = TritonKernels.dequantize_q4(quantized, scales, num_elements)
๐๏ธ Architecture
xtransformers/
โโโ cmake/ # CMake modules for CPU detection
โโโ xtransformers_kernel/
โ โโโ cpu_backend/ # NUMA-aware worker pool, SIMD kernels
โ โโโ cuda/ # CUDA kernels (MoE, attention)
โ โโโ operators/
โ โ โโโ moe/ # MoE with expert offloading
โ โ โโโ attention/ # Flash/Ring/MLA attention
โ โ โโโ multimodal/ # Vision, video, audio projectors
โ โโโ python/ # Python bindings and Triton kernels
โ โโโ ext_bindings.cpp # pybind11 bindings
โโโ third_party/ # pybind11, llama.cpp headers
โโโ examples/ # Usage examples
โโโ tests/ # Unit tests
โโโ setup.py # Build script with JIT detection
โโโ pyproject.toml # Package configuration
๐ Performance
MoE Expert Processing (DeepSeek-V3 style)
| Hardware | Variant | Throughput (tokens/sec) |
|---|---|---|
| Intel Sapphire Rapids | AMX | 2,500 |
| Intel Ice Lake | AVX512-BF16 | 1,800 |
| AMD EPYC Genoa | AVX512-VNNI | 1,600 |
| Apple M2 Ultra | NEON | 1,200 |
Memory Efficiency
| Feature | Memory Reduction |
|---|---|
| MLA KV Compression | 4-8x |
| Expert Offloading | ~60% GPU memory |
| Q4 Quantization | 4x |
| Ring Attention | O(N) vs O(Nยฒ) |
๐ง Configuration
CPU Variant Selection
XTransformers automatically selects the best kernel variant at install time and runtime:
- AMX (Intel Sapphire Rapids+): Highest performance for matrix operations
- AVX512-BF16: Optimal for bfloat16 operations
- AVX512-VNNI: Accelerated INT8/INT16 operations
- AVX512: Base AVX-512 operations
- AVX2: Fallback for older x86 CPUs
- SVE: ARM Scalable Vector Extensions
- NEON: ARM SIMD (Apple Silicon, etc.)
Expert Offloading
Configure which experts stay on GPU vs CPU:
config.enable_expert_offload = True
config.gpu_expert_ids = [0, 1, 2, 3] # Hot experts on GPU
config.cpu_expert_ids = [4, 5, 6, 7] # Cold experts on CPU
GGUF Quantization
Supported formats:
Q2_K,Q3_K,Q4_K,Q5_K,Q6_K: K-quant formatsQ4_0,Q8_0: Basic formatsIQ4_XS: imatrix quantizationFP8: 8-bit floating point
๐ Integration
With sglang
# Coming soon: Native sglang backend
from xtransformers.sglang import XTransformersBackend
With Xoron Model
from xoron import XoronMultimodalModel
import xtransformers
# XTransformers is automatically used when available
model = XoronMultimodalModel.from_pretrained("path/to/model")
๐ License
Apache 2.0 License
๐ค Contributing
Contributions are welcome! Please see our Contributing Guide.
๐ Acknowledgments
- KTransformers for inspiration
- llama.cpp for GGUF format
- Triton for cross-platform GPU kernels
- Flash Attention for attention algorithms
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file xoron_kernel-1.0.0.tar.gz.
File metadata
- Download URL: xoron_kernel-1.0.0.tar.gz
- Upload date:
- Size: 54.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a035fcc4d4efdefb0d876d41b0fb4116392c13f70649a6b41664eb0f7ba8ec07
|
|
| MD5 |
1c53b9da086dd32080101dc296485db0
|
|
| BLAKE2b-256 |
7f826254a65d056d25e58e8ec41111ed982ac059ffd59d41464f42d9846280f7
|