High-performance kernels for Xoron multimodal model with runtime dispatch, JIT compilation, and multi-GPU support

These details have not been verified by PyPI

Project links

Project description

XTransformers

🚀 High-Performance Kernels for Xoron Multimodal Model

XTransformers is a custom kernel library designed specifically for the Xoron multimodal model, providing state-of-the-art optimizations for running large language models on consumer-grade hardware.

✨ Features

Hardware Support

Multi-GPU: NVIDIA CUDA, AMD ROCm, Intel oneAPI
Apple Silicon: Metal Performance Shaders
CPU: Intel (AVX2/AVX512/AMX), AMD (AVX2/AVX512), ARM (NEON/SVE)
Cross-Platform: Triton JIT kernels for portability

Runtime Optimization

Runtime Dispatch: Automatically selects optimal kernel variant at startup
JIT Compilation: Compiles kernels optimized for your specific hardware
NUMA Awareness: Efficient memory placement on multi-socket systems

Model Optimizations

MoE Expert Offloading: Cold experts offloaded to CPU, hot experts on GPU
GGUF Support: On-the-fly dequantization (Q2_K to Q8_K, FP8)
MLA (Multi-Head Latent Attention): 4-8x KV cache compression
Ring Attention: Efficient 128K+ context processing
Flash Attention: Memory-efficient attention with O(N) memory

Multimodal

Vision: SigLIP encoder with TiTok 1D tokenization
Video: 3D-RoPE temporal encoding, VidTok compression
Audio: Conformer encoder/decoder, raw waveform processing
Generation: MoE-DiT image/video generation with Flow Matching

📦 Installation

From PyPI (Recommended)

pip install xtransformers

From Source (Development)

git clone https://github.com/nigfuapp-web/xtransformers.git
cd xtransformers
pip install -e .

With GPU Support

# NVIDIA CUDA
XT_USE_CUDA=1 pip install xtransformers

# AMD ROCm
XT_USE_ROCM=1 pip install xtransformers

# With Triton (cross-platform GPU)
pip install xtransformers[triton]

Build Options

Control the build with environment variables:

# Force specific CPU variant
XT_CPU_VARIANT=AVX512_BF16 pip install xtransformers

# Enable/disable features
XT_ENABLE_AMX=ON pip install xtransformers
XT_ENABLE_GGUF=ON pip install xtransformers
XT_ENABLE_JIT=ON pip install xtransformers

# CUDA architectures (for multi-GPU support)
XT_CUDA_ARCHS="70;75;80;86;89;90" pip install xtransformers

# Parallel build
XT_PARALLEL=16 pip install xtransformers

🚀 Quick Start

import xtransformers

# Initialize with hardware detection
xtransformers.init()

# Print detected hardware
xtransformers.print_hardware_info()

# Get optimal kernel variant
variant = xtransformers.get_best_variant()
print(f"Using kernel variant: {variant}")

MoE Inference

import xtransformers
import torch

# Configure MoE kernel
config = xtransformers.MoEKernelConfig()
config.num_experts = 8
config.num_experts_per_tok = 2
config.hidden_size = 4096
config.intermediate_size = 11008
config.enable_expert_offload = True
config.quant_bits = 4  # Q4 quantization

# Create kernel with runtime dispatch
moe_kernel = xtransformers.create_moe_kernel(config)

# Load GGUF weights
moe_kernel.load_weights("/path/to/model/experts")

# Forward pass
expert_ids = torch.randint(0, 8, (batch_size, seq_len, 2))
routing_weights = torch.softmax(torch.randn(batch_size, seq_len, 2), dim=-1)
input_hidden = torch.randn(batch_size, seq_len, 4096, dtype=torch.bfloat16)
output = torch.zeros_like(input_hidden, dtype=torch.float32)

moe_kernel.forward(batch_size, seq_len, expert_ids, routing_weights, input_hidden, output)

Triton Kernels (Cross-Platform GPU)

from xtransformers.triton_kernels import TritonKernels

# Flash Attention
output = TritonKernels.flash_attention(Q, K, V, causal=True)

# MoE Routing
expert_ids, routing_weights = TritonKernels.moe_routing(router_logits, top_k=2)

# GGUF Dequantization
fp32_weights = TritonKernels.dequantize_q4(quantized, scales, num_elements)

🏗️ Architecture

xtransformers/
├── cmake/                     # CMake modules for CPU detection
├── xtransformers_kernel/
│   ├── cpu_backend/           # NUMA-aware worker pool, SIMD kernels
│   ├── cuda/                  # CUDA kernels (MoE, attention)
│   ├── operators/
│   │   ├── moe/              # MoE with expert offloading
│   │   ├── attention/        # Flash/Ring/MLA attention
│   │   └── multimodal/       # Vision, video, audio projectors
│   ├── python/               # Python bindings and Triton kernels
│   └── ext_bindings.cpp      # pybind11 bindings
├── third_party/              # pybind11, llama.cpp headers
├── examples/                 # Usage examples
├── tests/                    # Unit tests
├── setup.py                  # Build script with JIT detection
└── pyproject.toml           # Package configuration

📊 Performance

MoE Expert Processing (DeepSeek-V3 style)

Hardware	Variant	Throughput (tokens/sec)
Intel Sapphire Rapids	AMX	2,500
Intel Ice Lake	AVX512-BF16	1,800
AMD EPYC Genoa	AVX512-VNNI	1,600
Apple M2 Ultra	NEON	1,200

Memory Efficiency

Feature	Memory Reduction
MLA KV Compression	4-8x
Expert Offloading	~60% GPU memory
Q4 Quantization	4x
Ring Attention	O(N) vs O(N²)

🔧 Configuration

CPU Variant Selection

XTransformers automatically selects the best kernel variant at install time and runtime:

AMX (Intel Sapphire Rapids+): Highest performance for matrix operations
AVX512-BF16: Optimal for bfloat16 operations
AVX512-VNNI: Accelerated INT8/INT16 operations
AVX512: Base AVX-512 operations
AVX2: Fallback for older x86 CPUs
SVE: ARM Scalable Vector Extensions
NEON: ARM SIMD (Apple Silicon, etc.)

Expert Offloading

Configure which experts stay on GPU vs CPU:

config.enable_expert_offload = True
config.gpu_expert_ids = [0, 1, 2, 3]  # Hot experts on GPU
config.cpu_expert_ids = [4, 5, 6, 7]  # Cold experts on CPU

GGUF Quantization

Supported formats:

Q2_K, Q3_K, Q4_K, Q5_K, Q6_K: K-quant formats
Q4_0, Q8_0: Basic formats
IQ4_XS: imatrix quantization
FP8: 8-bit floating point

🔗 Integration

With sglang

# Coming soon: Native sglang backend
from xtransformers.sglang import XTransformersBackend

With Xoron Model

from xoron import XoronMultimodalModel
import xtransformers

# XTransformers is automatically used when available
model = XoronMultimodalModel.from_pretrained("path/to/model")

📝 License

Apache 2.0 License

🤝 Contributing

Contributions are welcome! Please see our Contributing Guide.

🙏 Acknowledgments

KTransformers for inspiration
llama.cpp for GGUF format
Triton for cross-platform GPU kernels
Flash Attention for attention algorithms

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.0

Feb 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xoron_kernel-1.0.0.tar.gz (54.9 kB view details)

Uploaded Feb 15, 2026 Source

File details

Details for the file xoron_kernel-1.0.0.tar.gz.

File metadata

Download URL: xoron_kernel-1.0.0.tar.gz
Upload date: Feb 15, 2026
Size: 54.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for xoron_kernel-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`a035fcc4d4efdefb0d876d41b0fb4116392c13f70649a6b41664eb0f7ba8ec07`
MD5	`1c53b9da086dd32080101dc296485db0`
BLAKE2b-256	`7f826254a65d056d25e58e8ec41111ed982ac059ffd59d41464f42d9846280f7`

See more details on using hashes here.

xoron-kernel 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

XTransformers

✨ Features

Hardware Support

Runtime Optimization

Model Optimizations

Multimodal

📦 Installation

From PyPI (Recommended)

From Source (Development)

With GPU Support

Build Options

🚀 Quick Start

MoE Inference

Triton Kernels (Cross-Platform GPU)

🏗️ Architecture

📊 Performance

MoE Expert Processing (DeepSeek-V3 style)

Memory Efficiency

🔧 Configuration

CPU Variant Selection

Expert Offloading

GGUF Quantization

🔗 Integration

With sglang

With Xoron Model

📝 License

🤝 Contributing

🙏 Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes