Skip to main content

High-performance kernels for Xoron multimodal model with runtime dispatch, JIT compilation, and multi-GPU support

Project description

XTransformers

๐Ÿš€ High-Performance Kernels for Xoron Multimodal Model

XTransformers is a custom kernel library designed specifically for the Xoron multimodal model, providing state-of-the-art optimizations for running large language models on consumer-grade hardware.

โœจ Features

Hardware Support

  • Multi-GPU: NVIDIA CUDA, AMD ROCm, Intel oneAPI
  • Apple Silicon: Metal Performance Shaders
  • CPU: Intel (AVX2/AVX512/AMX), AMD (AVX2/AVX512), ARM (NEON/SVE)
  • Cross-Platform: Triton JIT kernels for portability

Runtime Optimization

  • Runtime Dispatch: Automatically selects optimal kernel variant at startup
  • JIT Compilation: Compiles kernels optimized for your specific hardware
  • NUMA Awareness: Efficient memory placement on multi-socket systems

Model Optimizations

  • MoE Expert Offloading: Cold experts offloaded to CPU, hot experts on GPU
  • GGUF Support: On-the-fly dequantization (Q2_K to Q8_K, FP8)
  • MLA (Multi-Head Latent Attention): 4-8x KV cache compression
  • Ring Attention: Efficient 128K+ context processing
  • Flash Attention: Memory-efficient attention with O(N) memory

Multimodal

  • Vision: SigLIP encoder with TiTok 1D tokenization
  • Video: 3D-RoPE temporal encoding, VidTok compression
  • Audio: Conformer encoder/decoder, raw waveform processing
  • Generation: MoE-DiT image/video generation with Flow Matching

๐Ÿ“ฆ Installation

From PyPI (Recommended)

pip install xtransformers

From Source (Development)

git clone https://github.com/nigfuapp-web/xtransformers.git
cd xtransformers
pip install -e .

With GPU Support

# NVIDIA CUDA
XT_USE_CUDA=1 pip install xtransformers

# AMD ROCm
XT_USE_ROCM=1 pip install xtransformers

# With Triton (cross-platform GPU)
pip install xtransformers[triton]

Build Options

Control the build with environment variables:

# Force specific CPU variant
XT_CPU_VARIANT=AVX512_BF16 pip install xtransformers

# Enable/disable features
XT_ENABLE_AMX=ON pip install xtransformers
XT_ENABLE_GGUF=ON pip install xtransformers
XT_ENABLE_JIT=ON pip install xtransformers

# CUDA architectures (for multi-GPU support)
XT_CUDA_ARCHS="70;75;80;86;89;90" pip install xtransformers

# Parallel build
XT_PARALLEL=16 pip install xtransformers

๐Ÿš€ Quick Start

import xtransformers

# Initialize with hardware detection
xtransformers.init()

# Print detected hardware
xtransformers.print_hardware_info()

# Get optimal kernel variant
variant = xtransformers.get_best_variant()
print(f"Using kernel variant: {variant}")

MoE Inference

import xtransformers
import torch

# Configure MoE kernel
config = xtransformers.MoEKernelConfig()
config.num_experts = 8
config.num_experts_per_tok = 2
config.hidden_size = 4096
config.intermediate_size = 11008
config.enable_expert_offload = True
config.quant_bits = 4  # Q4 quantization

# Create kernel with runtime dispatch
moe_kernel = xtransformers.create_moe_kernel(config)

# Load GGUF weights
moe_kernel.load_weights("/path/to/model/experts")

# Forward pass
expert_ids = torch.randint(0, 8, (batch_size, seq_len, 2))
routing_weights = torch.softmax(torch.randn(batch_size, seq_len, 2), dim=-1)
input_hidden = torch.randn(batch_size, seq_len, 4096, dtype=torch.bfloat16)
output = torch.zeros_like(input_hidden, dtype=torch.float32)

moe_kernel.forward(batch_size, seq_len, expert_ids, routing_weights, input_hidden, output)

Triton Kernels (Cross-Platform GPU)

from xtransformers.triton_kernels import TritonKernels

# Flash Attention
output = TritonKernels.flash_attention(Q, K, V, causal=True)

# MoE Routing
expert_ids, routing_weights = TritonKernels.moe_routing(router_logits, top_k=2)

# GGUF Dequantization
fp32_weights = TritonKernels.dequantize_q4(quantized, scales, num_elements)

๐Ÿ—๏ธ Architecture

xtransformers/
โ”œโ”€โ”€ cmake/                     # CMake modules for CPU detection
โ”œโ”€โ”€ xtransformers_kernel/
โ”‚   โ”œโ”€โ”€ cpu_backend/           # NUMA-aware worker pool, SIMD kernels
โ”‚   โ”œโ”€โ”€ cuda/                  # CUDA kernels (MoE, attention)
โ”‚   โ”œโ”€โ”€ operators/
โ”‚   โ”‚   โ”œโ”€โ”€ moe/              # MoE with expert offloading
โ”‚   โ”‚   โ”œโ”€โ”€ attention/        # Flash/Ring/MLA attention
โ”‚   โ”‚   โ””โ”€โ”€ multimodal/       # Vision, video, audio projectors
โ”‚   โ”œโ”€โ”€ python/               # Python bindings and Triton kernels
โ”‚   โ””โ”€โ”€ ext_bindings.cpp      # pybind11 bindings
โ”œโ”€โ”€ third_party/              # pybind11, llama.cpp headers
โ”œโ”€โ”€ examples/                 # Usage examples
โ”œโ”€โ”€ tests/                    # Unit tests
โ”œโ”€โ”€ setup.py                  # Build script with JIT detection
โ””โ”€โ”€ pyproject.toml           # Package configuration

๐Ÿ“Š Performance

MoE Expert Processing (DeepSeek-V3 style)

Hardware Variant Throughput (tokens/sec)
Intel Sapphire Rapids AMX 2,500
Intel Ice Lake AVX512-BF16 1,800
AMD EPYC Genoa AVX512-VNNI 1,600
Apple M2 Ultra NEON 1,200

Memory Efficiency

Feature Memory Reduction
MLA KV Compression 4-8x
Expert Offloading ~60% GPU memory
Q4 Quantization 4x
Ring Attention O(N) vs O(Nยฒ)

๐Ÿ”ง Configuration

CPU Variant Selection

XTransformers automatically selects the best kernel variant at install time and runtime:

  1. AMX (Intel Sapphire Rapids+): Highest performance for matrix operations
  2. AVX512-BF16: Optimal for bfloat16 operations
  3. AVX512-VNNI: Accelerated INT8/INT16 operations
  4. AVX512: Base AVX-512 operations
  5. AVX2: Fallback for older x86 CPUs
  6. SVE: ARM Scalable Vector Extensions
  7. NEON: ARM SIMD (Apple Silicon, etc.)

Expert Offloading

Configure which experts stay on GPU vs CPU:

config.enable_expert_offload = True
config.gpu_expert_ids = [0, 1, 2, 3]  # Hot experts on GPU
config.cpu_expert_ids = [4, 5, 6, 7]  # Cold experts on CPU

GGUF Quantization

Supported formats:

  • Q2_K, Q3_K, Q4_K, Q5_K, Q6_K: K-quant formats
  • Q4_0, Q8_0: Basic formats
  • IQ4_XS: imatrix quantization
  • FP8: 8-bit floating point

๐Ÿ”— Integration

With sglang

# Coming soon: Native sglang backend
from xtransformers.sglang import XTransformersBackend

With Xoron Model

from xoron import XoronMultimodalModel
import xtransformers

# XTransformers is automatically used when available
model = XoronMultimodalModel.from_pretrained("path/to/model")

๐Ÿ“ License

Apache 2.0 License

๐Ÿค Contributing

Contributions are welcome! Please see our Contributing Guide.

๐Ÿ™ Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xoron_kernel-1.0.0.tar.gz (54.9 kB view details)

Uploaded Source

File details

Details for the file xoron_kernel-1.0.0.tar.gz.

File metadata

  • Download URL: xoron_kernel-1.0.0.tar.gz
  • Upload date:
  • Size: 54.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for xoron_kernel-1.0.0.tar.gz
Algorithm Hash digest
SHA256 a035fcc4d4efdefb0d876d41b0fb4116392c13f70649a6b41664eb0f7ba8ec07
MD5 1c53b9da086dd32080101dc296485db0
BLAKE2b-256 7f826254a65d056d25e58e8ec41111ed982ac059ffd59d41464f42d9846280f7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page