Skip to main content

Open-Source Multimodal Vision Injection Framework for Any Language Model

Project description

OpenLLaVA v3.0.0 License Python PyTorch

CUDA ROCm TPU MLX XPU


OpenLLaVA is an open-source multimodal vision injection framework for adding vision capabilities to any language model. Architecture-agnostic, multi-backend, and production-ready — from research to deployment.

Quickstart · Architecture · Core Concepts · Training · Optimizations · CLI · Distributed


Table of Contents


Overview

OpenLLaVA is a comprehensive framework for injecting vision capabilities into any HuggingFace language model. It provides a complete pipeline — from model construction through training, inference, serving, export, and evaluation — all accessible through a unified Python API and CLI.

The framework supports any LLM architecture (Llama, Mistral, Qwen, Gemma, Phi, etc.) and any HuggingFace-compatible vision encoder. It automatically detects model dimensions, constructs the appropriate projector, patches the tokenizer with visual tokens, and configures the training and inference pipelines

[!NOTE] OpenLLaVA is backend-agnostic. The same code runs on CUDA, ROCm, Apple MLX, Intel XPU, Google TPU, and CPU — with automatic hardware detection and optimal configuration selection.

Design Principles

Principle Description
Architecture Agnostic Works with any HuggingFace LLM and vision encoder
Multi-Backend CUDA, ROCm, TPU, MLX, XPU, CPU — auto-detected
Production Ready Continuous batching, PagedAttention, speculative decoding
Optimization Suite 40+ built-in optimizations for training and inference
Full Pipeline Train, serve, export, evaluate — all in one framework

Key Features

Model Construction

  • Vision Injection: Add vision capabilities to any language model in 3 lines of code
  • AnyRes Processing: Dynamic high-resolution image support with patch grouping
  • YakiProjector: MLP-based vision-to-LLM alignment with configurable depth and width
  • Token Extending: Automatic tokenizer patching with <image> special tokens
  • Architecture Detection: Auto-detects LLM hidden dimensions, attention heads, and vocabulary size

Training Pipeline

  • 3-Phase Training: Pretraining alignment, visual instruction tuning, RLHF/DPO alignment
  • LoRA Variants: LoRA, LoRA+, LoRAGA, LoRAFA, DoRA, QLoRA, Split LoRA
  • BitNet Training: Ternary weight training (b1.58) with absmean quantization
  • MoE + LoRA Fusion: Mixture-of-Experts with LoRA adapters per expert
  • Curriculum Learning: Progressive difficulty scheduling
  • Padding-Free Training: Variable-length sequences without padding tokens
  • Sequence Packing: Pack multiple sequences into single training examples
  • FP8 Training: Native FP8 training on H100 GPUs (Hopper architecture)

Inference and Serving

  • Continuous Batching: Dynamic batching with no maximum batch size
  • PagedAttention: Block-level KV cache management for 4x memory efficiency
  • Speculative Decoding: Eagle, Medusa, NGram draft models for 2-3x throughput
  • KV Cache Optimizations: Quantization, eviction (H2O, SnapKV, FastGen, WG), compression (PackKV, SWAN)
  • Sparse Attention: Dynamic sparse attention pattern selection
  • Chunked Prefill: Split long prompts into manageable chunks
  • OpenAI-Compatible API: FastAPI server with /v1/chat/completions, streaming, and vision support

Optimization Suite

  • 40+ Built-in Optimizations: From FP8 training to KV cache compression
  • torch.compile: Full-graph compilation with custom backends
  • torchao Integration: Quantization-aware training, weight-only quantization, sparsity
  • GPTQ / AWQ: Post-training weight quantization
  • FP4 / NVFP4: 4-bit floating point quantization for H100
  • GaLore: Gradient Low-Rank Projection for memory-efficient full finetuning
  • EMA: Exponential Moving Average for training stability
  • Selective Checkpointing: Memory-efficient activation checkpointing

Distributed Training

  • FSDP2: Fully Sharded Data Parallel with mixed precision
  • DeepSpeed ZeRO: ZeRO stages 0-3 with CPU/NVMe offload
  • Tensor Parallelism: Megatron-style tensor parallel (1D, 2D, 3D)
  • Pipeline Parallelism: GPipe / 1F1B pipeline scheduling
  • Expert Parallelism: Distributed MoE training
  • Ring Attention: Sequence parallelism for long-context training
  • Heterogeneous Training: GPU + CPU + TPU mixed-device training
  • ZeRO++: Hierarchical ZeRO with communication compression

Multi-Backend Support

Backend Hardware Status
CUDA NVIDIA GPUs (Ampere, Ada, Hopper) Production
ROCm AMD GPUs Production
CPU FP32 Any x86/x64 CPU Production
TPU (XLA/SPMD) Google TPU v3-v5 Beta
MLX Apple Silicon (M1-M4) Beta
XPU Intel GPUs (Arc, Data Center) Beta

Quickstart

Installation

# Core installation (CUDA auto-detected)
pip install openllava

# With CLI tools
pip install openllava[cli]

# With serving capabilities
pip install openllava[serve]

# Full installation
pip install openllava[all]

[!IMPORTANT] OpenLLaVA requires PyTorch 2.3 or later. Install PyTorch separately if your environment requires a specific CUDA version.

Build from Source

git clone https://github.com/OpceanAI/openllava.git
cd openllava

# Install with CUDA extensions
pip install -e .[all]

# Install without CUDA (CPU-only)
OPENLLAVA_NO_CUDA=1 pip install -e .[all]

Inject Vision Into Any LLM

from openllava import OpenLLaVA, Backend

model = OpenLLaVA(
    llm="meta-llama/Llama-3-8B",
    vision_encoder="google/siglip2-so400m-patch14-384",
    backend=Backend.AUTO,
)

# View the patched model architecture
print(model)

Train with LoRA

# Apply LoRA adapters
model.lora(r=64, alpha=128, dropout=0.05)

# Phase 1: Vision-language alignment
model.train(phase1=dict(
    dataset="liuhaotian/LLaVA-Pretrain",
    samples=100_000,
    learning_rate=1e-3,
    batch_size=128,
))

# Phase 2: Visual instruction tuning
model.train(phase2=dict(
    dataset="liuhaotian/LLaVA-Instruct-150K",
    learning_rate=2e-4,
    batch_size=32,
))

# Push to HuggingFace Hub
model.push("my-org/my-model")

Run Inference

from openllava import OpenLLaVA

model = OpenLLaVA.from_pretrained("openllava/yaki-8b")

response = model.generate(
    images=["chart.png"],
    prompt="Describe the key trends in this chart.",
    max_new_tokens=512,
    temperature=0.7,
)

print(response)

Serve as OpenAI-Compatible API

openllava serve openllava/yaki-8b --port 8000
from openai import OpenAI

client = OpenAI(
    api_key="openllava",
    base_url="http://localhost:8000/v1",
)

response = client.chat.completions.create(
    model="yaki-8b",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What is shown in this image?"},
                {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}},
            ],
        }
    ],
    max_tokens=512,
)

print(response.choices[0].message.content)

Architecture

How Vision Injection Works

  1. Vision Encoding: The input image is processed by a vision encoder (e.g., SigLIP2) producing a grid of patch embeddings.

  2. Patch Grouping: Adjacent patches are grouped (default 3x3) to reduce the visual token count and capture local spatial structure. Each group produces a single 10368-dimensional vector for SigLIP2.

  3. Projection: The YakiProjector MLP maps grouped vision features into the LLM's hidden dimension (e.g., 4096 for Llama-3-8B) through a 2-layer GELU-activated MLP.

  4. Token Patching: The tokenizer is extended with a <image> special token. During processing, this token is replaced by the projected vision embeddings, which are inserted before or interleaved with text embeddings.

  5. Generation: The LLM attends to both visual and textual tokens, enabling multimodal understanding and generation.

[!TIP] The patch grouping size is configurable. Larger groups reduce sequence length at the cost of spatial resolution. The default 3x3 grouping with SigLIP2 produces 81 visual tokens per image.


Core Concepts

OpenLLaVA Class

The central entry point. It orchestrates model loading, patching, training, and inference.

from openllava import OpenLLaVA

model = OpenLLaVA(
    llm="OpceanAI/Yuuki-RxG",                           # HF model ID or local path
    vision_encoder="google/siglip2-so400m-patch14-384",  # HF vision encoder
    architecture="llava",                                # Architecture variant
    backend=Backend.AUTO,                                # Backend selection
    torch_dtype=torch.bfloat16,                          # Compute dtype
    device_map="auto",                                   # Device mapping strategy
)

YakiProjector

The MLP projector that aligns vision features to the LLM's embedding space.

from openllava import YakiProjector

projector = YakiProjector(
    vision_hidden_size=1152,      # SigLIP2 hidden dimension
    llm_hidden_size=4096,         # Llama-3-8B hidden dimension
    patch_group=3,                # 3x3 patch grouping
    projector_depth=2,            # MLP depth
    activation="gelu",            # Activation function
    dropout=0.0,                  # Dropout rate
)

FastVisionModel API

An Unsloth-style API for quick model loading and PEFT configuration.

from openllava.api import FastVisionModel

model, tokenizer = FastVisionModel.from_pretrained(
    "openllava/yaki-8b",
    max_seq_length=2048,
    load_in_4bit=True,
    dtype=torch.bfloat16,
)

model = FastVisionModel.get_peft_model(
    model,
    r=16,
    alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)

FastVisionModel.for_training(model)

Backend Abstraction

Hardware backends are auto-detected on import. Explicit selection is also supported.

from openllava import Backend, BackendManager

# Auto-detection (default)
manager = BackendManager()

# Explicit selection
manager = BackendManager(Backend.CUDA)

# Available backends
for backend in Backend:
    print(backend.value)
    # auto, cuda, cpu_fp32, tpu, xpu, rocm, mlx, heterogeneous

Training Pipeline

OpenLLaVA employs a 3-phase training pipeline designed for optimal vision-language alignment.

Phase 1: Vision-Language Alignment

Aligns the vision encoder and projector with the LLM's embedding space using image-caption pairs.

Parameter Recommended Value Description
Dataset liuhaotian/LLaVA-Pretrain 100K image-caption pairs
Learning Rate 1e-3 High LR for projector convergence
Batch Size 128 Large batches recommended
Optimizer AdamW Standard optimizer
Scheduler Cosine Cosine decay with warmup
Epochs 1 Single pass sufficient

Phase 2: Visual Instruction Tuning

Fine-tunes the entire model (or LoRA adapters) on visual instruction-following data.

Parameter Recommended Value Description
Dataset liuhaotian/LLaVA-Instruct-150K 150K visual instructions
Learning Rate 2e-4 Lower LR for instruction tuning
Batch Size 32 Moderate batch size
Optimizer AdamW Standard optimizer
Scheduler Cosine Cosine decay with warmup
Epochs 3-5 Multiple epochs beneficial

Phase 3: RL Alignment (Optional)

Aligns the model with human preferences using RLHF, DPO, GRPO, or ORPO.

from openllava.api import OpenLLaVATrainer
from openllava.api import TrainingConfig

config = TrainingConfig(
    phase1_dataset="liuhaotian/LLaVA-Pretrain",
    phase2_dataset="liuhaotian/LLaVA-Instruct-150K",
    output_dir="./yaki-checkpoints",
    lora_r=64,
    lora_alpha=128,
    lora_dropout=0.05,
    learning_rate_phase1=1e-3,
    learning_rate_phase2=2e-4,
    batch_size_phase1=128,
    batch_size_phase2=32,
    num_epochs_phase2=3,
    save_steps=500,
    logging_steps=10,
    report_to="wandb",
)

trainer = OpenLLaVATrainer(config)
trainer.train()

# Or train step-by-step
trainer.train_phase1()
trainer.train_phase2()

Training Modes

Mode Description Memory Usage Speed
lora Low-Rank Adaptation Low Fast
qlora 4-bit LoRA Very Low Moderate
lora_plus LoRA with different LR for A/B matrices Low Fast
dora Weight-Decomposed Low-Rank Adaptation Low Fast
lora_ga LoRA with Gradient Approximation Low Moderate
lora_fa LoRA with Feature Alignment Low Fast
full_finetune Full parameter fine-tuning High Slow
bitnet Ternary weight training (b1.58) Very Low Fast
moe_lora Mixture-of-Experts with LoRA Moderate Moderate

Optimizations

OpenLLaVA ships with 40+ built-in optimizations covering training, inference, memory, and quantization.

[!NOTE] All optimizations are opt-in and configurable. The framework applies sensible defaults based on hardware detection.

Training Optimizations

Optimization Description Hardware
FP8 Training Native FP8 forward/backward pass H100 (Hopper)
Padding-Free Variable-length sequences without padding All
Sequence Packing Pack multiple sequences per example All
Selective Checkpointing Activation checkpointing with heuristics All
CPU Offloading Async CPU offload for optimizer states All
GPU Memory Pooling Pre-allocated memory pool for tensors CUDA
torch.compile Full-graph compilation All
EMA Exponential Moving Average All
Curriculum Learning Progressive difficulty scheduling All

Quantization

Technique Bits Type Use Case
GPTQ 2-4 Post-training Inference speedup
AWQ 4 Post-training Inference speedup
FP8 8 Training/Inference H100 training
FP4 (NVFP4) 4 Inference H100 inference
QAT 2-8 Training Quantization-aware training
torchao 2-8 Post-training Weight-only quantization
NF4 (bitsandbytes) 4 Training QLoRA
BitNet b1.58 1.58 Training Ternary weights

KV Cache Optimizations

Optimization Method Memory Savings
KV Quantization FP8/INT8 KV cache 50%
H2O Eviction Heavy Hitter Oracle policy 20-50%
SnapKV Snapshot-based eviction 20-50%
FastGen Generation-aware eviction 20-40%
WG Eviction Window-Guided eviction 30-50%
PackKV Cache compression 50-75%
SWAN Sliding Window Attention with cache 40-60%
Chunked Prefill Split long prompts into chunks Variable

Speculative Decoding

Method Description Speedup
Eagle Draft Eagle-style draft model 2-3x
Medusa Heads Multi-head speculative decoding 2-3x
NGram Draft N-gram based draft model 1.5-2x
Tree Verification Parallel verification of draft tokens 2-3x

Other Optimizations

Optimization Description
torchao Sparsity Weight sparsification for inference
MXFP8 MoE MXFP8 format for MoE layers
VQ Codebook EMA YADIS VQ codebook EMA updates
Fused Cross-Attention YADIS fused cross-attention
Adaptive MoE Routing YADIS dynamic expert routing
Split LoRA Split LoRA across devices
GaLore Gradient Low-Rank Projection
Mixed-Precision Quantization MicroMix per-layer precision
Async I/O nvJPEG + async data loading
from openllava.optimizations import (
    compile_model,
    enable_fp8_training,
    gptq_quantize,
    EMAModel,
)

# Compile the model
model.model = compile_model(model.model, mode="max-autotune")

# Enable FP8 training (H100 only)
enable_fp8_training(model.model)

# GPTQ quantization
gptq_quantize(model.model, bits=4, dataset="c4")

# Enable EMA tracking
ema = EMAModel(model.model, decay=0.999)

CLI Reference

The openllava CLI provides five main commands.

openllava --help

train

Train a vision-language model.

openllava train \
  --llm meta-llama/Llama-3-8B \
  --vision-encoder google/siglip2-so400m-patch14-384 \
  --phase1-dataset liuhaotian/LLaVA-Pretrain \
  --phase2-dataset liuhaotian/LLaVA-Instruct-150K \
  --output-dir ./checkpoints \
  --lora-r 64 \
  --lora-alpha 128 \
  --batch-size 128 \
  --learning-rate 1e-3 \
  --num-epochs 3 \
  --report-to wandb

Training modes:

# QLoRA (4-bit quantized)
openllava train --mode qlora --load-in-4bit

# BitNet (ternary weights)
openllava train --mode bitnet

# Full fine-tuning
openllava train --mode full_finetune

# MoE + LoRA
openllava train --mode moe_lora --num-experts 8

serve

Launch an OpenAI-compatible inference server.

openllava serve openllava/yaki-8b --port 8000

# With advanced features
openllava serve openllava/yaki-8b \
  --port 8000 \
  --batch-size 64 \
  --max-seq-len 4096 \
  --paged-attention \
  --continuous-batching \
  --speculative-decoding \
  --kv-cache-dtype fp8

[!TIP] The inference server supports all OpenAI SDK features: streaming, vision inputs, function calling, and structured JSON output.

export

Export a model to various formats.

# HuggingFace SafeTensors
openllava export openllava/yaki-8b --format safetensors --output ./export

# GGUF (for llama.cpp)
openllava export openllava/yaki-8b --format gguf --quant q4_k_m

# ONNX
openllava export openllava/yaki-8b --format onnx --output ./export

# vLLM
openllava export openllava/yaki-8b --format vllm

# MLX (Apple Silicon)
openllava export openllava/yaki-8b --format mlx

benchmark

Benchmark model performance.

openllava benchmark openllava/yaki-8b

# Specific benchmarks
openllava benchmark openllava/yaki-8b \
  --throughput \
  --latency \
  --memory \
  --batch-sizes 1,8,32,64

info

Display system and framework information.

openllava info

API Reference

OpenLLaVA

class OpenLLaVA:
    def __init__(
        self,
        llm: str,
        vision_encoder: str = "google/siglip2-so400m-patch14-384",
        architecture: str = "llava",
        backend: Backend = Backend.AUTO,
        torch_dtype: Optional[torch.dtype] = None,
        device_map: str = "auto",
        attn_implementation: str = "flash_attention_2",
        trust_remote_code: bool = False,
    )

    # Training
    def lora(self, r: int = 64, alpha: int = 128, dropout: float = 0.05,
             target_modules: Optional[List[str]] = None) -> "OpenLLaVA":
    def dora(self, r: int = 64, alpha: int = 128, ...) -> "OpenLLaVA":
    def qlora(self, r: int = 64, ..., load_in_4bit: bool = True) -> "OpenLLaVA":
    def lora_plus(self, r: int = 64, lr_ratio: float = 16.0) -> "OpenLLaVA":
    def bitnet(self) -> "OpenLLaVA":
    def train(self, phase1: Optional[dict] = None,
              phase2: Optional[dict] = None, **kwargs):

    # RL Alignment
    def dpo(self, dataset: str, ..., beta: float = 0.1):
    def grpo(self, dataset: str, ..., group_size: int = 8):
    def orpo(self, dataset: str, ...):

    # Inference
    def generate(self, images: Union[str, List[str]], prompt: str,
                 **generate_kwargs) -> str:
    def chat(self, messages: List[dict], images: Optional[List[str]] = None,
             **generate_kwargs) -> str:

    # I/O
    def save(self, path: str, merge_lora: bool = False):
    def push(self, repo_id: str, merge_lora: bool = False):
    @classmethod
    def from_pretrained(cls, repo_id: str, **kwargs) -> "OpenLLaVA":

FastVisionModel

class FastVisionModel:
    @classmethod
    def from_pretrained(
        cls,
        model_id: str,
        max_seq_length: int = 2048,
        load_in_4bit: bool = False,
        load_in_8bit: bool = False,
        dtype: Optional[torch.dtype] = None,
        device_map: str = "auto",
        attn_implementation: str = "flash_attention_2",
    ) -> Tuple[nn.Module, AutoTokenizer]:

    @classmethod
    def get_peft_model(
        cls,
        model: nn.Module,
        r: int = 16,
        alpha: int = 32,
        target_modules: Optional[List[str]] = None,
        modules_to_save: Optional[List[str]] = None,
    ) -> nn.Module:

    @classmethod
    def for_training(cls, model: nn.Module):
    @classmethod
    def for_inference(cls, model: nn.Module):

OpenLLaVATrainer

class OpenLLaVATrainer:
    def __init__(self, config: TrainingConfig):
    def train(self):
    def train_phase1(self):
    def train_phase2(self):
    def train_rl(self, method: str = "dpo", **kwargs):
    def save(self, path: str):
    def push(self, repo_id: str):
    def evaluate(self, benchmarks: List[str] = ["scienceqa", "mmbench"]):

InferenceEngine

class OpenLLaVAInferenceEngine:
    def __init__(self, model_id: str, **kwargs):
    def generate(self, prompt: str, images: Optional[List[str]] = None,
                 max_tokens: int = 512, temperature: float = 0.7,
                 stream: bool = False) -> Union[str, Generator]:
    def chat(self, messages: List[dict], **kwargs) -> str:
    def get_stats(self) -> dict:

Server

from openllava.serve import OpenLLaVAServer

server = OpenLLaVAServer(
    model_id="openllava/yaki-8b",
    host="0.0.0.0",
    port=8000,
    api_key="sk-openllava",           # Optional auth
    rate_limit=100,                    # Requests per minute
    continuous_batching=True,
    paged_attention=True,
)

server.run()

Distributed Training

OpenLLaVA supports a comprehensive distributed training stack spanning multiple parallelism strategies.

[!WARNING] Distributed training requires a cluster with high-speed interconnects (NVLink, InfiniBand, or RoCE). The framework auto-detects topology and recommends optimal strategies.

Parallelism Strategy Comparison

Strategy Description When to Use
FSDP2 Fully Sharded Data Parallel Single-node multi-GPU
DeepSpeed ZeRO-1 Optimizer state partitioning Large models, moderate speedup
DeepSpeed ZeRO-2 Optimizer + gradient partitioning Large models, good speedup
DeepSpeed ZeRO-3 Full parameter partitioning Very large models (>13B)
Tensor Parallel (1D) Split tensors across GPUs >13B, high-bandwidth interconnect
Tensor Parallel (2D/3D) 2D/3D tensor sharding Very large models, multi-node
Pipeline Parallel Layer-level partitioning Multi-node, deep models
Expert Parallel Distribute MoE experts MoE models
Ring Attention Sequence parallelism Long context (>32K)
Heterogeneous GPU+CPU+TPU mixed Resource-constrained environments

FSDP2

from openllava import OpenLLaVA
from openllava.distributed import FSDPConfig

config = FSDPConfig(
    sharding_strategy="hybrid",
    cpu_offload=False,
    mixed_precision="bf16",
    activation_checkpointing=True,
    limit_all_gathers=True,
)

model = OpenLLaVA(
    llm="meta-llama/Llama-3-8B",
    vision_encoder="google/siglip2-so400m-patch14-384",
)

model.train(
    phase2=dict(dataset="liuhaotian/LLaVA-Instruct-150K"),
    distributed="fsdp",
    fsdp_config=config,
)

DeepSpeed ZeRO

from openllava.distributed import DeepSpeedConfig

config = DeepSpeedConfig(
    zero_stage=3,
    offload_optimizer="cpu",
    offload_params="nvme",
    gradient_accumulation_steps=4,
    gradient_clipping=1.0,
    communication_dtype="bf16",
)

Auto-Parallelism

from openllava.distributed import auto_parallel
from openllava.utils import HardwareDetector

detector = HardwareDetector()
topology = detector.detect_topology()

strategy = auto_parallel(
    model_size=8_000_000_000,    # 8B parameters
    hardware=topology,
    memory_budget_gb=80,
    target_throughput=1000,       # tokens per second
)

print(f"Recommended strategy: {strategy.name}")
print(f"World size: {strategy.world_size}")
print(f"Strategy config: {strategy.config}")

RL Alignment

OpenLLaVA supports four RL alignment methods for post-training preference optimization.

Method Description Use Case
DPO Direct Preference Optimization Binary preference pairs
GRPO Group Relative Policy Optimization Multi-response ranking
ORPO Odds Ratio Preference Optimization Preference optimization without reference model
PPO Proximal Policy Optimization Full RLHF pipeline with reward model
# DPO
model.dpo(
    dataset="your-dpo-dataset",
    beta=0.1,
    learning_rate=5e-6,
    batch_size=16,
)

# GRPO
model.grpo(
    dataset="your-grpo-dataset",
    group_size=8,
    learning_rate=1e-6,
)

# ORPO
model.orpo(
    dataset="your-orpo-dataset",
    lambda_weight=0.5,
    learning_rate=1e-6,
)

Reward Functions

from openllava.rl.rewards import (
    ExactMatchReward,
    F1Reward,
    FormatReward,
    SafetyReward,
    CompositeReward,
)

reward_fn = CompositeReward([
    ExactMatchReward(target="expected_answer"),
    FormatReward(pattern=r"```.*```"),
    SafetyReward(),
])

Export and Deployment

Model Export Formats

Format Use Case Tool
SafeTensors HuggingFace Hub, PyTorch openllava export
GGUF llama.cpp, Ollama, local CPU inference openllava export --format gguf
ONNX ONNX Runtime, cross-platform inference openllava export --format onnx
vLLM High-throughput production serving openllava export --format vllm
MLX Apple Silicon inference openllava export --format mlx
from openllava.export import export_to_gguf, export_to_onnx, push_to_hub

# Export to GGUF
export_to_gguf(model, output_path="./model.gguf", quant="q4_k_m")

# Export to ONNX
export_to_onnx(model, output_path="./model.onnx")

# Push to HuggingFace Hub
model.push("openllava/yaki-8b", private=False)

# Or via CLI
push_to_hub(
    repo_id="openllava/yaki-8b",
    local_path="./checkpoints",
    commit_message="Release Yaki-8B v1",
)

LoRA Merge

from openllava.export import merge_lora_weights

# Merge LoRA weights into base model
model = merge_lora_weights(model)
model.save("./merged-model")
model.push("my-org/my-model-merged")

Evaluation

OpenLLaVA integrates with standard multimodal benchmarks.

from openllava.eval import EvalRunner

runner = EvalRunner(
    model=model,
    benchmarks=["scienceqa", "mmbench", "textvqa"],
    batch_size=16,
)

results = runner.run()
print(results)

# Results per benchmark
{
    "scienceqa": {"accuracy": 0.912, "samples": 4241},
    "mmbench": {"accuracy": 0.763, "samples": 2975},
    "textvqa": {"accuracy": 0.684, "samples": 5000},
}
openllava eval \
  --model openllava/yaki-8b \
  --benchmarks scienceqa,mmbench,textvqa \
  --batch-size 16

Configuration

Training Configuration

from openllava.api import TrainingConfig

config = TrainingConfig(
    # Phase 1
    phase1_dataset="liuhaotian/LLaVA-Pretrain",
    phase1_learning_rate=1e-3,
    phase1_batch_size=128,
    phase1_max_samples=100_000,

    # Phase 2
    phase2_dataset="liuhaotian/LLaVA-Instruct-150K",
    phase2_learning_rate=2e-4,
    phase2_batch_size=32,
    phase2_num_epochs=3,

    # LoRA
    lora_r=64,
    lora_alpha=128,
    lora_dropout=0.05,
    lora_target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],

    # Optimization
    optim="adamw_torch",
    warmup_ratio=0.03,
    weight_decay=0.0,
    gradient_accumulation_steps=1,
    max_grad_norm=1.0,

    # Precision
    torch_dtype="bfloat16",
    load_in_4bit=False,

    # Checkpointing
    output_dir="./checkpoints",
    save_steps=500,
    save_total_limit=5,
    logging_steps=10,
    report_to="wandb",

    # Distributed
    distributed_strategy="fsdp",
    fsdp_sharding_strategy="hybrid",
    deepspeed_zero_stage=3,
)

Environment Variables

Variable Default Description
CUDA_VISIBLE_DEVICES GPU device IDs
OPENLLAVA_BACKEND auto Force backend selection
OPENLLAVA_CACHE_DIR ~/.cache/openllava Cache directory
OPENLLAVA_NO_CUDA false Disable CUDA detection
HF_TOKEN HuggingFace Hub token
WANDB_API_KEY Weights & Biases key
PJRT_DEVICE TPU device type

Backends

OpenLLaVA supports six hardware backends with automatic device detection and operation routing.

CUDA (NVIDIA)

from openllava import Backend

model = OpenLLaVA(llm="...", backend=Backend.CUDA)

Optimized for NVIDIA Ampere (A100/A30), Ada Lovelace (RTX 4090), and Hopper (H100) architectures. Uses FlashAttention-2, FP8 training on H100, and CUDA graphs for reduced kernel launch overhead.

[!IMPORTANT] CUDA 11.8 or later is required. Ampere or newer architecture recommended. FlashAttention-2 is auto-enabled when supported.

ROCm (AMD)

model = OpenLLaVA(llm="...", backend=Backend.ROCM)

Supports AMD MI250, MI300X, and RX 7000 series GPUs. Uses ROCm-aware Triton kernels and the Composable Kernel library for optimized matmul and attention.

CPU FP32

model = OpenLLaVA(llm="...", backend=Backend.CPU_FP32)

Falls back to FP32 computation with SIMD-optimized kernels (AVX-512, AVX2, NEON). Suitable for CPU-only inference and development environments.

TPU (Google)

model = OpenLLaVA(llm="...", backend=Backend.TPU)

Requires torch_xla and jax. Supports TPU v3-v5 with SPMD (Single Program Multiple Data) for model parallelism.

MLX (Apple Silicon)

model = OpenLLaVA(llm="...", backend=Backend.MLX)

Requires mlx and mlx-lm. Optimized for Apple M1-M4 series with unified memory architecture.

XPU (Intel)

model = OpenLLaVA(llm="...", backend=Backend.XPU)

Supports Intel Arc A-series and Data Center GPU Max Series via intel-extension-for-pytorch.

Heterogeneous

model = OpenLLaVA(llm="...", backend=Backend.HETEROGENEOUS)

Distributes model layers across multiple device types (e.g., GPU + CPU + TPU) for resource-constrained environments.


Performance

Training Throughput (tokens/second, BF16)

Model GPU LoRA Full FT
LLaVA-7B (Llama-2) 1x A100-80GB 2,850 1,240
LLaVA-13B (Vicuna) 1x A100-80GB 1,620 680
LLaVA-7B 8x A100-80GB (FSDP) 21,400 9,600
LLaVA-13B 8x A100-80GB (FSDP) 12,800 5,400

Inference Latency (first token, ms)

Model GPU FlashAttn PagedAttn Speculative
Yaki-7B A100-80GB 45 38 22
Yaki-7B RTX 4090 38 32 18
Yaki-13B A100-80GB 72 61 35
Yaki-13B 2x A100 (TP) 40 34 20

Memory Usage (GB, Yaki-7B with LoRA)

Configuration Peak Memory Notes
FP32 Full FT 56.2 Not recommended
BF16 Full FT 28.8 Recommended
BF16 LoRA (r=64) 18.4 Default
FP16 QLoRA (4-bit) 10.2 Memory-constrained
BitNet b1.58 6.8 Maximum efficiency

Project Structure

openllava/
├── openllava/                    # Main Python package
│   ├── core/                     # Core model, backend, patcher
│   ├── api/                      # High-level FastModel + Trainer API
│   ├── cli/                      # Click-based CLI (train, serve, export, benchmark)
│   ├── data/                     # Dataset loading, templates, collators, streaming
│   ├── training/                 # LoRA variants, BitNet, DoRA, checkpointing
│   ├── rl/                       # RL alignment (DPO, GRPO, ORPO, PPO)
│   ├── inference/                # Inference engine, continuous batching, PagedAttention
│   ├── serve/                    # FastAPI OpenAI-compatible server
│   ├── optimizations/            # 40+ optimizations (FP8, KV cache, quantization, etc.)
│   ├── experts/                  # Mixture-of-Experts layers and training
│   ├── distributed/              # FSDP, DeepSpeed, TP, PP, EP, ring attention
│   ├── backends/                 # CUDA, ROCm, MLX, TPU, XPU, CPU, ONNX, GGUF
│   ├── kernels/                  # Triton kernels + CUDA graphs
│   │   ├── triton/               # Fused attention, RoPE, SwiGLU, RMSNorm, etc.
│   │   └── cuda_graphs/          # CUDA graph capture
│   ├── export/                   # GGUF, ONNX, SafeTensors, vLLM, MLX export
│   ├── eval/                     # ScienceQA, MMBench, TextVQA benchmarks
│   └── utils/                    # Hardware detection, profiling, model cards
├── csrc/                         # C++/CUDA/CPU native extensions
│   ├── gpu/                      # CUDA kernels (projector, cross-attention, VQ)
│   ├── cpu/                      # CPU fallbacks (offload, quantization, GGUF)
│   └── tpu/                      # TPU XLA backend
├── setup.py                      # Python packaging + CMake extension build
├── pyproject.toml                # Project configuration
├── CMakeLists.txt                # C++/CUDA build system
└── LICENSE                       # Apache 2.0

License

OpenLLaVA is licensed under the Apache License 2.0.

Copyright (c) 2024-2026 OpceanAI

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

OpenLLaVA — Vision injection for every language model.

Built by OpceanAI Research Team

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openllava-3.0.0.tar.gz (470.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

openllava-3.0.0-py3-none-any.whl (526.5 kB view details)

Uploaded Python 3

File details

Details for the file openllava-3.0.0.tar.gz.

File metadata

  • Download URL: openllava-3.0.0.tar.gz
  • Upload date:
  • Size: 470.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for openllava-3.0.0.tar.gz
Algorithm Hash digest
SHA256 03dc76e6d5b85ccd4bab49348484c5bda237b3a5a7a1cc3889cc3125ab6b9ae0
MD5 cbd5c71545f7043db1b33b112bc49ea8
BLAKE2b-256 8d9a9e747afc4e1b1bfff7cb062239669e8598d005fb66817aea6b5dded0ab9f

See more details on using hashes here.

File details

Details for the file openllava-3.0.0-py3-none-any.whl.

File metadata

  • Download URL: openllava-3.0.0-py3-none-any.whl
  • Upload date:
  • Size: 526.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for openllava-3.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a3f0cb0593f43123c431f78c78a510d20ae9bb5d275bac057afa9e2fe0c9431d
MD5 9e658475f304f7c15b4b1efbe939349e
BLAKE2b-256 eb89846be576d0c3529cce4d673dcd2d93da7d94c0ec063ed027cbe322e3bab3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page