openllava

Open-Source Multimodal Vision Injection Framework for Any Language Model

These details have not been verified by PyPI

Project links

Project description

OpenLLaVA v3.0.0 License Python PyTorch

CUDA ROCm TPU MLX XPU

OpenLLaVA is an open-source multimodal vision injection framework for adding vision capabilities to any language model. Architecture-agnostic, multi-backend, and production-ready — from research to deployment.

Quickstart · Architecture · Core Concepts · Training · Optimizations · CLI · Distributed

Overview
Key Features
Quickstart
Architecture
Core Concepts
Training Pipeline
Optimizations
CLI Reference
API Reference
Distributed Training
RL Alignment
Export and Deployment
Evaluation
Configuration
Backends
Performance
Project Structure
License

Overview

OpenLLaVA is a comprehensive framework for injecting vision capabilities into any HuggingFace language model. It provides a complete pipeline — from model construction through training, inference, serving, export, and evaluation — all accessible through a unified Python API and CLI.

The framework supports any LLM architecture (Llama, Mistral, Qwen, Gemma, Phi, etc.) and any HuggingFace-compatible vision encoder. It automatically detects model dimensions, constructs the appropriate projector, patches the tokenizer with visual tokens, and configures the training and inference pipelines

[!NOTE] OpenLLaVA is backend-agnostic. The same code runs on CUDA, ROCm, Apple MLX, Intel XPU, Google TPU, and CPU — with automatic hardware detection and optimal configuration selection.

Design Principles

Principle	Description
Architecture Agnostic	Works with any HuggingFace LLM and vision encoder
Multi-Backend	CUDA, ROCm, TPU, MLX, XPU, CPU — auto-detected
Production Ready	Continuous batching, PagedAttention, speculative decoding
Optimization Suite	40+ built-in optimizations for training and inference
Full Pipeline	Train, serve, export, evaluate — all in one framework

Key Features

Model Construction

Vision Injection: Add vision capabilities to any language model in 3 lines of code
AnyRes Processing: Dynamic high-resolution image support with patch grouping
YakiProjector: MLP-based vision-to-LLM alignment with configurable depth and width
Token Extending: Automatic tokenizer patching with <image> special tokens
Architecture Detection: Auto-detects LLM hidden dimensions, attention heads, and vocabulary size

Training Pipeline

3-Phase Training: Pretraining alignment, visual instruction tuning, RLHF/DPO alignment
LoRA Variants: LoRA, LoRA+, LoRAGA, LoRAFA, DoRA, QLoRA, Split LoRA
BitNet Training: Ternary weight training (b1.58) with absmean quantization
MoE + LoRA Fusion: Mixture-of-Experts with LoRA adapters per expert
Curriculum Learning: Progressive difficulty scheduling
Padding-Free Training: Variable-length sequences without padding tokens
Sequence Packing: Pack multiple sequences into single training examples
FP8 Training: Native FP8 training on H100 GPUs (Hopper architecture)

Inference and Serving

Continuous Batching: Dynamic batching with no maximum batch size
PagedAttention: Block-level KV cache management for 4x memory efficiency
Speculative Decoding: Eagle, Medusa, NGram draft models for 2-3x throughput
KV Cache Optimizations: Quantization, eviction (H2O, SnapKV, FastGen, WG), compression (PackKV, SWAN)
Sparse Attention: Dynamic sparse attention pattern selection
Chunked Prefill: Split long prompts into manageable chunks
OpenAI-Compatible API: FastAPI server with /v1/chat/completions, streaming, and vision support

Optimization Suite

40+ Built-in Optimizations: From FP8 training to KV cache compression
torch.compile: Full-graph compilation with custom backends
torchao Integration: Quantization-aware training, weight-only quantization, sparsity
GPTQ / AWQ: Post-training weight quantization
FP4 / NVFP4: 4-bit floating point quantization for H100
GaLore: Gradient Low-Rank Projection for memory-efficient full finetuning
EMA: Exponential Moving Average for training stability
Selective Checkpointing: Memory-efficient activation checkpointing

Distributed Training

FSDP2: Fully Sharded Data Parallel with mixed precision
DeepSpeed ZeRO: ZeRO stages 0-3 with CPU/NVMe offload
Tensor Parallelism: Megatron-style tensor parallel (1D, 2D, 3D)
Pipeline Parallelism: GPipe / 1F1B pipeline scheduling
Expert Parallelism: Distributed MoE training
Ring Attention: Sequence parallelism for long-context training
Heterogeneous Training: GPU + CPU + TPU mixed-device training
ZeRO++: Hierarchical ZeRO with communication compression

Multi-Backend Support

Backend	Hardware	Status
CUDA	NVIDIA GPUs (Ampere, Ada, Hopper)	Production
ROCm	AMD GPUs	Production
CPU FP32	Any x86/x64 CPU	Production
TPU (XLA/SPMD)	Google TPU v3-v5	Beta
MLX	Apple Silicon (M1-M4)	Beta
XPU	Intel GPUs (Arc, Data Center)	Beta

Quickstart

Installation

# Core installation (CUDA auto-detected)
pip install openllava

# With CLI tools
pip install openllava[cli]

# With serving capabilities
pip install openllava[serve]

# Full installation
pip install openllava[all]

[!IMPORTANT] OpenLLaVA requires PyTorch 2.3 or later. Install PyTorch separately if your environment requires a specific CUDA version.

Build from Source

git clone https://github.com/OpceanAI/openllava.git
cd openllava

# Install with CUDA extensions
pip install -e .[all]

# Install without CUDA (CPU-only)
OPENLLAVA_NO_CUDA=1 pip install -e .[all]

Inject Vision Into Any LLM

from openllava import OpenLLaVA, Backend

model = OpenLLaVA(
    llm="meta-llama/Llama-3-8B",
    vision_encoder="google/siglip2-so400m-patch14-384",
    backend=Backend.AUTO,
)

# View the patched model architecture
print(model)

Train with LoRA

# Apply LoRA adapters
model.lora(r=64, alpha=128, dropout=0.05)

# Phase 1: Vision-language alignment
model.train(phase1=dict(
    dataset="liuhaotian/LLaVA-Pretrain",
    samples=100_000,
    learning_rate=1e-3,
    batch_size=128,
))

# Phase 2: Visual instruction tuning
model.train(phase2=dict(
    dataset="liuhaotian/LLaVA-Instruct-150K",
    learning_rate=2e-4,
    batch_size=32,
))

# Push to HuggingFace Hub
model.push("my-org/my-model")

Run Inference

from openllava import OpenLLaVA

model = OpenLLaVA.from_pretrained("openllava/yaki-8b")

response = model.generate(
    images=["chart.png"],
    prompt="Describe the key trends in this chart.",
    max_new_tokens=512,
    temperature=0.7,
)

print(response)

Serve as OpenAI-Compatible API

openllava serve openllava/yaki-8b --port 8000

from openai import OpenAI

client = OpenAI(
    api_key="openllava",
    base_url="http://localhost:8000/v1",
)

response = client.chat.completions.create(
    model="yaki-8b",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What is shown in this image?"},
                {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}},
            ],
        }
    ],
    max_tokens=512,
)

print(response.choices[0].message.content)

Architecture

How Vision Injection Works

Vision Encoding: The input image is processed by a vision encoder (e.g., SigLIP2) producing a grid of patch embeddings.
Patch Grouping: Adjacent patches are grouped (default 3x3) to reduce the visual token count and capture local spatial structure. Each group produces a single 10368-dimensional vector for SigLIP2.
Projection: The YakiProjector MLP maps grouped vision features into the LLM's hidden dimension (e.g., 4096 for Llama-3-8B) through a 2-layer GELU-activated MLP.
Token Patching: The tokenizer is extended with a <image> special token. During processing, this token is replaced by the projected vision embeddings, which are inserted before or interleaved with text embeddings.
Generation: The LLM attends to both visual and textual tokens, enabling multimodal understanding and generation.

[!TIP] The patch grouping size is configurable. Larger groups reduce sequence length at the cost of spatial resolution. The default 3x3 grouping with SigLIP2 produces 81 visual tokens per image.

Core Concepts

OpenLLaVA Class

The central entry point. It orchestrates model loading, patching, training, and inference.

from openllava import OpenLLaVA

model = OpenLLaVA(
    llm="OpceanAI/Yuuki-RxG",                           # HF model ID or local path
    vision_encoder="google/siglip2-so400m-patch14-384",  # HF vision encoder
    architecture="llava",                                # Architecture variant
    backend=Backend.AUTO,                                # Backend selection
    torch_dtype=torch.bfloat16,                          # Compute dtype
    device_map="auto",                                   # Device mapping strategy
)

YakiProjector

The MLP projector that aligns vision features to the LLM's embedding space.

from openllava import YakiProjector

projector = YakiProjector(
    vision_hidden_size=1152,      # SigLIP2 hidden dimension
    llm_hidden_size=4096,         # Llama-3-8B hidden dimension
    patch_group=3,                # 3x3 patch grouping
    projector_depth=2,            # MLP depth
    activation="gelu",            # Activation function
    dropout=0.0,                  # Dropout rate
)

FastVisionModel API

An Unsloth-style API for quick model loading and PEFT configuration.

from openllava.api import FastVisionModel

model, tokenizer = FastVisionModel.from_pretrained(
    "openllava/yaki-8b",
    max_seq_length=2048,
    load_in_4bit=True,
    dtype=torch.bfloat16,
)

model = FastVisionModel.get_peft_model(
    model,
    r=16,
    alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)

FastVisionModel.for_training(model)

Backend Abstraction

Hardware backends are auto-detected on import. Explicit selection is also supported.

from openllava import Backend, BackendManager

# Auto-detection (default)
manager = BackendManager()

# Explicit selection
manager = BackendManager(Backend.CUDA)

# Available backends
for backend in Backend:
    print(backend.value)
    # auto, cuda, cpu_fp32, tpu, xpu, rocm, mlx, heterogeneous

Training Pipeline

OpenLLaVA employs a 3-phase training pipeline designed for optimal vision-language alignment.

Phase 1: Vision-Language Alignment

Aligns the vision encoder and projector with the LLM's embedding space using image-caption pairs.

Parameter	Recommended Value	Description
Dataset	`liuhaotian/LLaVA-Pretrain`	100K image-caption pairs
Learning Rate	1e-3	High LR for projector convergence
Batch Size	128	Large batches recommended
Optimizer	AdamW	Standard optimizer
Scheduler	Cosine	Cosine decay with warmup
Epochs	1	Single pass sufficient

Phase 2: Visual Instruction Tuning

Fine-tunes the entire model (or LoRA adapters) on visual instruction-following data.

Parameter	Recommended Value	Description
Dataset	`liuhaotian/LLaVA-Instruct-150K`	150K visual instructions
Learning Rate	2e-4	Lower LR for instruction tuning
Batch Size	32	Moderate batch size
Optimizer	AdamW	Standard optimizer
Scheduler	Cosine	Cosine decay with warmup
Epochs	3-5	Multiple epochs beneficial

Phase 3: RL Alignment (Optional)

Aligns the model with human preferences using RLHF, DPO, GRPO, or ORPO.

from openllava.api import OpenLLaVATrainer
from openllava.api import TrainingConfig

config = TrainingConfig(
    phase1_dataset="liuhaotian/LLaVA-Pretrain",
    phase2_dataset="liuhaotian/LLaVA-Instruct-150K",
    output_dir="./yaki-checkpoints",
    lora_r=64,
    lora_alpha=128,
    lora_dropout=0.05,
    learning_rate_phase1=1e-3,
    learning_rate_phase2=2e-4,
    batch_size_phase1=128,
    batch_size_phase2=32,
    num_epochs_phase2=3,
    save_steps=500,
    logging_steps=10,
    report_to="wandb",
)

trainer = OpenLLaVATrainer(config)
trainer.train()

# Or train step-by-step
trainer.train_phase1()
trainer.train_phase2()

Training Modes

Mode	Description	Memory Usage	Speed
`lora`	Low-Rank Adaptation	Low	Fast
`qlora`	4-bit LoRA	Very Low	Moderate
`lora_plus`	LoRA with different LR for A/B matrices	Low	Fast
`dora`	Weight-Decomposed Low-Rank Adaptation	Low	Fast
`lora_ga`	LoRA with Gradient Approximation	Low	Moderate
`lora_fa`	LoRA with Feature Alignment	Low	Fast
`full_finetune`	Full parameter fine-tuning	High	Slow
`bitnet`	Ternary weight training (b1.58)	Very Low	Fast
`moe_lora`	Mixture-of-Experts with LoRA	Moderate	Moderate

Optimizations

OpenLLaVA ships with 40+ built-in optimizations covering training, inference, memory, and quantization.

[!NOTE] All optimizations are opt-in and configurable. The framework applies sensible defaults based on hardware detection.

Training Optimizations

Optimization	Description	Hardware
FP8 Training	Native FP8 forward/backward pass	H100 (Hopper)
Padding-Free	Variable-length sequences without padding	All
Sequence Packing	Pack multiple sequences per example	All
Selective Checkpointing	Activation checkpointing with heuristics	All
CPU Offloading	Async CPU offload for optimizer states	All
GPU Memory Pooling	Pre-allocated memory pool for tensors	CUDA
torch.compile	Full-graph compilation	All
EMA	Exponential Moving Average	All
Curriculum Learning	Progressive difficulty scheduling	All

Quantization

Technique	Bits	Type	Use Case
GPTQ	2-4	Post-training	Inference speedup
AWQ	4	Post-training	Inference speedup
FP8	8	Training/Inference	H100 training
FP4 (NVFP4)	4	Inference	H100 inference
QAT	2-8	Training	Quantization-aware training
torchao	2-8	Post-training	Weight-only quantization
NF4 (bitsandbytes)	4	Training	QLoRA
BitNet b1.58	1.58	Training	Ternary weights

KV Cache Optimizations

Optimization	Method	Memory Savings
KV Quantization	FP8/INT8 KV cache	50%
H2O Eviction	Heavy Hitter Oracle policy	20-50%
SnapKV	Snapshot-based eviction	20-50%
FastGen	Generation-aware eviction	20-40%
WG Eviction	Window-Guided eviction	30-50%
PackKV	Cache compression	50-75%
SWAN	Sliding Window Attention with cache	40-60%
Chunked Prefill	Split long prompts into chunks	Variable

Speculative Decoding

Method	Description	Speedup
Eagle Draft	Eagle-style draft model	2-3x
Medusa Heads	Multi-head speculative decoding	2-3x
NGram Draft	N-gram based draft model	1.5-2x
Tree Verification	Parallel verification of draft tokens	2-3x

Other Optimizations

Optimization	Description
torchao Sparsity	Weight sparsification for inference
MXFP8 MoE	MXFP8 format for MoE layers
VQ Codebook EMA	YADIS VQ codebook EMA updates
Fused Cross-Attention	YADIS fused cross-attention
Adaptive MoE Routing	YADIS dynamic expert routing
Split LoRA	Split LoRA across devices
GaLore	Gradient Low-Rank Projection
Mixed-Precision Quantization	MicroMix per-layer precision
Async I/O	nvJPEG + async data loading

from openllava.optimizations import (
    compile_model,
    enable_fp8_training,
    gptq_quantize,
    EMAModel,
)

# Compile the model
model.model = compile_model(model.model, mode="max-autotune")

# Enable FP8 training (H100 only)
enable_fp8_training(model.model)

# GPTQ quantization
gptq_quantize(model.model, bits=4, dataset="c4")

# Enable EMA tracking
ema = EMAModel(model.model, decay=0.999)

CLI Reference

The openllava CLI provides five main commands.

openllava --help

train

Train a vision-language model.

openllava train \
  --llm meta-llama/Llama-3-8B \
  --vision-encoder google/siglip2-so400m-patch14-384 \
  --phase1-dataset liuhaotian/LLaVA-Pretrain \
  --phase2-dataset liuhaotian/LLaVA-Instruct-150K \
  --output-dir ./checkpoints \
  --lora-r 64 \
  --lora-alpha 128 \
  --batch-size 128 \
  --learning-rate 1e-3 \
  --num-epochs 3 \
  --report-to wandb

Training modes:

# QLoRA (4-bit quantized)
openllava train --mode qlora --load-in-4bit

# BitNet (ternary weights)
openllava train --mode bitnet

# Full fine-tuning
openllava train --mode full_finetune

# MoE + LoRA
openllava train --mode moe_lora --num-experts 8

serve

Launch an OpenAI-compatible inference server.

openllava serve openllava/yaki-8b --port 8000

# With advanced features
openllava serve openllava/yaki-8b \
  --port 8000 \
  --batch-size 64 \
  --max-seq-len 4096 \
  --paged-attention \
  --continuous-batching \
  --speculative-decoding \
  --kv-cache-dtype fp8

[!TIP] The inference server supports all OpenAI SDK features: streaming, vision inputs, function calling, and structured JSON output.

export

Export a model to various formats.

# HuggingFace SafeTensors
openllava export openllava/yaki-8b --format safetensors --output ./export

# GGUF (for llama.cpp)
openllava export openllava/yaki-8b --format gguf --quant q4_k_m

# ONNX
openllava export openllava/yaki-8b --format onnx --output ./export

# vLLM
openllava export openllava/yaki-8b --format vllm

# MLX (Apple Silicon)
openllava export openllava/yaki-8b --format mlx

benchmark

Benchmark model performance.

openllava benchmark openllava/yaki-8b

# Specific benchmarks
openllava benchmark openllava/yaki-8b \
  --throughput \
  --latency \
  --memory \
  --batch-sizes 1,8,32,64

info

Display system and framework information.

openllava info

API Reference

OpenLLaVA

class OpenLLaVA:
    def __init__(
        self,
        llm: str,
        vision_encoder: str = "google/siglip2-so400m-patch14-384",
        architecture: str = "llava",
        backend: Backend = Backend.AUTO,
        torch_dtype: Optional[torch.dtype] = None,
        device_map: str = "auto",
        attn_implementation: str = "flash_attention_2",
        trust_remote_code: bool = False,
    )

    # Training
    def lora(self, r: int = 64, alpha: int = 128, dropout: float = 0.05,
             target_modules: Optional[List[str]] = None) -> "OpenLLaVA":
    def dora(self, r: int = 64, alpha: int = 128, ...) -> "OpenLLaVA":
    def qlora(self, r: int = 64, ..., load_in_4bit: bool = True) -> "OpenLLaVA":
    def lora_plus(self, r: int = 64, lr_ratio: float = 16.0) -> "OpenLLaVA":
    def bitnet(self) -> "OpenLLaVA":
    def train(self, phase1: Optional[dict] = None,
              phase2: Optional[dict] = None, **kwargs):

    # RL Alignment
    def dpo(self, dataset: str, ..., beta: float = 0.1):
    def grpo(self, dataset: str, ..., group_size: int = 8):
    def orpo(self, dataset: str, ...):

    # Inference
    def generate(self, images: Union[str, List[str]], prompt: str,
                 **generate_kwargs) -> str:
    def chat(self, messages: List[dict], images: Optional[List[str]] = None,
             **generate_kwargs) -> str:

    # I/O
    def save(self, path: str, merge_lora: bool = False):
    def push(self, repo_id: str, merge_lora: bool = False):
    @classmethod
    def from_pretrained(cls, repo_id: str, **kwargs) -> "OpenLLaVA":

FastVisionModel

class FastVisionModel:
    @classmethod
    def from_pretrained(
        cls,
        model_id: str,
        max_seq_length: int = 2048,
        load_in_4bit: bool = False,
        load_in_8bit: bool = False,
        dtype: Optional[torch.dtype] = None,
        device_map: str = "auto",
        attn_implementation: str = "flash_attention_2",
    ) -> Tuple[nn.Module, AutoTokenizer]:

    @classmethod
    def get_peft_model(
        cls,
        model: nn.Module,
        r: int = 16,
        alpha: int = 32,
        target_modules: Optional[List[str]] = None,
        modules_to_save: Optional[List[str]] = None,
    ) -> nn.Module:

    @classmethod
    def for_training(cls, model: nn.Module):
    @classmethod
    def for_inference(cls, model: nn.Module):

OpenLLaVATrainer

class OpenLLaVATrainer:
    def __init__(self, config: TrainingConfig):
    def train(self):
    def train_phase1(self):
    def train_phase2(self):
    def train_rl(self, method: str = "dpo", **kwargs):
    def save(self, path: str):
    def push(self, repo_id: str):
    def evaluate(self, benchmarks: List[str] = ["scienceqa", "mmbench"]):

InferenceEngine

class OpenLLaVAInferenceEngine:
    def __init__(self, model_id: str, **kwargs):
    def generate(self, prompt: str, images: Optional[List[str]] = None,
                 max_tokens: int = 512, temperature: float = 0.7,
                 stream: bool = False) -> Union[str, Generator]:
    def chat(self, messages: List[dict], **kwargs) -> str:
    def get_stats(self) -> dict:

Server

from openllava.serve import OpenLLaVAServer

server = OpenLLaVAServer(
    model_id="openllava/yaki-8b",
    host="0.0.0.0",
    port=8000,
    api_key="sk-openllava",           # Optional auth
    rate_limit=100,                    # Requests per minute
    continuous_batching=True,
    paged_attention=True,
)

server.run()

Distributed Training

OpenLLaVA supports a comprehensive distributed training stack spanning multiple parallelism strategies.

[!WARNING] Distributed training requires a cluster with high-speed interconnects (NVLink, InfiniBand, or RoCE). The framework auto-detects topology and recommends optimal strategies.

Parallelism Strategy Comparison

Strategy	Description	When to Use
FSDP2	Fully Sharded Data Parallel	Single-node multi-GPU
DeepSpeed ZeRO-1	Optimizer state partitioning	Large models, moderate speedup
DeepSpeed ZeRO-2	Optimizer + gradient partitioning	Large models, good speedup
DeepSpeed ZeRO-3	Full parameter partitioning	Very large models (>13B)
Tensor Parallel (1D)	Split tensors across GPUs	>13B, high-bandwidth interconnect
Tensor Parallel (2D/3D)	2D/3D tensor sharding	Very large models, multi-node
Pipeline Parallel	Layer-level partitioning	Multi-node, deep models
Expert Parallel	Distribute MoE experts	MoE models
Ring Attention	Sequence parallelism	Long context (>32K)
Heterogeneous	GPU+CPU+TPU mixed	Resource-constrained environments

FSDP2

from openllava import OpenLLaVA
from openllava.distributed import FSDPConfig

config = FSDPConfig(
    sharding_strategy="hybrid",
    cpu_offload=False,
    mixed_precision="bf16",
    activation_checkpointing=True,
    limit_all_gathers=True,
)

model = OpenLLaVA(
    llm="meta-llama/Llama-3-8B",
    vision_encoder="google/siglip2-so400m-patch14-384",
)

model.train(
    phase2=dict(dataset="liuhaotian/LLaVA-Instruct-150K"),
    distributed="fsdp",
    fsdp_config=config,
)

DeepSpeed ZeRO

from openllava.distributed import DeepSpeedConfig

config = DeepSpeedConfig(
    zero_stage=3,
    offload_optimizer="cpu",
    offload_params="nvme",
    gradient_accumulation_steps=4,
    gradient_clipping=1.0,
    communication_dtype="bf16",
)

Auto-Parallelism

from openllava.distributed import auto_parallel
from openllava.utils import HardwareDetector

detector = HardwareDetector()
topology = detector.detect_topology()

strategy = auto_parallel(
    model_size=8_000_000_000,    # 8B parameters
    hardware=topology,
    memory_budget_gb=80,
    target_throughput=1000,       # tokens per second
)

print(f"Recommended strategy: {strategy.name}")
print(f"World size: {strategy.world_size}")
print(f"Strategy config: {strategy.config}")

RL Alignment

OpenLLaVA supports four RL alignment methods for post-training preference optimization.

Method	Description	Use Case
DPO	Direct Preference Optimization	Binary preference pairs
GRPO	Group Relative Policy Optimization	Multi-response ranking
ORPO	Odds Ratio Preference Optimization	Preference optimization without reference model
PPO	Proximal Policy Optimization	Full RLHF pipeline with reward model

# DPO
model.dpo(
    dataset="your-dpo-dataset",
    beta=0.1,
    learning_rate=5e-6,
    batch_size=16,
)

# GRPO
model.grpo(
    dataset="your-grpo-dataset",
    group_size=8,
    learning_rate=1e-6,
)

# ORPO
model.orpo(
    dataset="your-orpo-dataset",
    lambda_weight=0.5,
    learning_rate=1e-6,
)

Reward Functions

from openllava.rl.rewards import (
    ExactMatchReward,
    F1Reward,
    FormatReward,
    SafetyReward,
    CompositeReward,
)

reward_fn = CompositeReward([
    ExactMatchReward(target="expected_answer"),
    FormatReward(pattern=r"```.*```"),
    SafetyReward(),
])

Export and Deployment

Model Export Formats

Format	Use Case	Tool
SafeTensors	HuggingFace Hub, PyTorch	`openllava export`
GGUF	llama.cpp, Ollama, local CPU inference	`openllava export --format gguf`
ONNX	ONNX Runtime, cross-platform inference	`openllava export --format onnx`
vLLM	High-throughput production serving	`openllava export --format vllm`
MLX	Apple Silicon inference	`openllava export --format mlx`

from openllava.export import export_to_gguf, export_to_onnx, push_to_hub

# Export to GGUF
export_to_gguf(model, output_path="./model.gguf", quant="q4_k_m")

# Export to ONNX
export_to_onnx(model, output_path="./model.onnx")

# Push to HuggingFace Hub
model.push("openllava/yaki-8b", private=False)

# Or via CLI
push_to_hub(
    repo_id="openllava/yaki-8b",
    local_path="./checkpoints",
    commit_message="Release Yaki-8B v1",
)

LoRA Merge

from openllava.export import merge_lora_weights

# Merge LoRA weights into base model
model = merge_lora_weights(model)
model.save("./merged-model")
model.push("my-org/my-model-merged")

Evaluation

OpenLLaVA integrates with standard multimodal benchmarks.

from openllava.eval import EvalRunner

runner = EvalRunner(
    model=model,
    benchmarks=["scienceqa", "mmbench", "textvqa"],
    batch_size=16,
)

results = runner.run()
print(results)

# Results per benchmark
{
    "scienceqa": {"accuracy": 0.912, "samples": 4241},
    "mmbench": {"accuracy": 0.763, "samples": 2975},
    "textvqa": {"accuracy": 0.684, "samples": 5000},
}

openllava eval \
  --model openllava/yaki-8b \
  --benchmarks scienceqa,mmbench,textvqa \
  --batch-size 16

Configuration

Training Configuration

from openllava.api import TrainingConfig

config = TrainingConfig(
    # Phase 1
    phase1_dataset="liuhaotian/LLaVA-Pretrain",
    phase1_learning_rate=1e-3,
    phase1_batch_size=128,
    phase1_max_samples=100_000,

    # Phase 2
    phase2_dataset="liuhaotian/LLaVA-Instruct-150K",
    phase2_learning_rate=2e-4,
    phase2_batch_size=32,
    phase2_num_epochs=3,

    # LoRA
    lora_r=64,
    lora_alpha=128,
    lora_dropout=0.05,
    lora_target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],

    # Optimization
    optim="adamw_torch",
    warmup_ratio=0.03,
    weight_decay=0.0,
    gradient_accumulation_steps=1,
    max_grad_norm=1.0,

    # Precision
    torch_dtype="bfloat16",
    load_in_4bit=False,

    # Checkpointing
    output_dir="./checkpoints",
    save_steps=500,
    save_total_limit=5,
    logging_steps=10,
    report_to="wandb",

    # Distributed
    distributed_strategy="fsdp",
    fsdp_sharding_strategy="hybrid",
    deepspeed_zero_stage=3,
)

Environment Variables

Variable	Default	Description
`CUDA_VISIBLE_DEVICES`	—	GPU device IDs
`OPENLLAVA_BACKEND`	auto	Force backend selection
`OPENLLAVA_CACHE_DIR`	~/.cache/openllava	Cache directory
`OPENLLAVA_NO_CUDA`	false	Disable CUDA detection
`HF_TOKEN`	—	HuggingFace Hub token
`WANDB_API_KEY`	—	Weights & Biases key
`PJRT_DEVICE`	—	TPU device type

Backends

OpenLLaVA supports six hardware backends with automatic device detection and operation routing.

CUDA (NVIDIA)

from openllava import Backend

model = OpenLLaVA(llm="...", backend=Backend.CUDA)

Optimized for NVIDIA Ampere (A100/A30), Ada Lovelace (RTX 4090), and Hopper (H100) architectures. Uses FlashAttention-2, FP8 training on H100, and CUDA graphs for reduced kernel launch overhead.

[!IMPORTANT] CUDA 11.8 or later is required. Ampere or newer architecture recommended. FlashAttention-2 is auto-enabled when supported.

ROCm (AMD)

model = OpenLLaVA(llm="...", backend=Backend.ROCM)

Supports AMD MI250, MI300X, and RX 7000 series GPUs. Uses ROCm-aware Triton kernels and the Composable Kernel library for optimized matmul and attention.

CPU FP32

model = OpenLLaVA(llm="...", backend=Backend.CPU_FP32)

Falls back to FP32 computation with SIMD-optimized kernels (AVX-512, AVX2, NEON). Suitable for CPU-only inference and development environments.

TPU (Google)

model = OpenLLaVA(llm="...", backend=Backend.TPU)

Requires torch_xla and jax. Supports TPU v3-v5 with SPMD (Single Program Multiple Data) for model parallelism.

MLX (Apple Silicon)

model = OpenLLaVA(llm="...", backend=Backend.MLX)

Requires mlx and mlx-lm. Optimized for Apple M1-M4 series with unified memory architecture.

XPU (Intel)

model = OpenLLaVA(llm="...", backend=Backend.XPU)

Supports Intel Arc A-series and Data Center GPU Max Series via intel-extension-for-pytorch.

Heterogeneous

model = OpenLLaVA(llm="...", backend=Backend.HETEROGENEOUS)

Distributes model layers across multiple device types (e.g., GPU + CPU + TPU) for resource-constrained environments.

Performance

Training Throughput (tokens/second, BF16)

Model	GPU	LoRA	Full FT
LLaVA-7B (Llama-2)	1x A100-80GB	2,850	1,240
LLaVA-13B (Vicuna)	1x A100-80GB	1,620	680
LLaVA-7B	8x A100-80GB (FSDP)	21,400	9,600
LLaVA-13B	8x A100-80GB (FSDP)	12,800	5,400

Inference Latency (first token, ms)

Model	GPU	FlashAttn	PagedAttn	Speculative
Yaki-7B	A100-80GB	45	38	22
Yaki-7B	RTX 4090	38	32	18
Yaki-13B	A100-80GB	72	61	35
Yaki-13B	2x A100 (TP)	40	34	20

Memory Usage (GB, Yaki-7B with LoRA)

Configuration	Peak Memory	Notes
FP32 Full FT	56.2	Not recommended
BF16 Full FT	28.8	Recommended
BF16 LoRA (r=64)	18.4	Default
FP16 QLoRA (4-bit)	10.2	Memory-constrained
BitNet b1.58	6.8	Maximum efficiency

Project Structure

openllava/
├── openllava/                    # Main Python package
│   ├── core/                     # Core model, backend, patcher
│   ├── api/                      # High-level FastModel + Trainer API
│   ├── cli/                      # Click-based CLI (train, serve, export, benchmark)
│   ├── data/                     # Dataset loading, templates, collators, streaming
│   ├── training/                 # LoRA variants, BitNet, DoRA, checkpointing
│   ├── rl/                       # RL alignment (DPO, GRPO, ORPO, PPO)
│   ├── inference/                # Inference engine, continuous batching, PagedAttention
│   ├── serve/                    # FastAPI OpenAI-compatible server
│   ├── optimizations/            # 40+ optimizations (FP8, KV cache, quantization, etc.)
│   ├── experts/                  # Mixture-of-Experts layers and training
│   ├── distributed/              # FSDP, DeepSpeed, TP, PP, EP, ring attention
│   ├── backends/                 # CUDA, ROCm, MLX, TPU, XPU, CPU, ONNX, GGUF
│   ├── kernels/                  # Triton kernels + CUDA graphs
│   │   ├── triton/               # Fused attention, RoPE, SwiGLU, RMSNorm, etc.
│   │   └── cuda_graphs/          # CUDA graph capture
│   ├── export/                   # GGUF, ONNX, SafeTensors, vLLM, MLX export
│   ├── eval/                     # ScienceQA, MMBench, TextVQA benchmarks
│   └── utils/                    # Hardware detection, profiling, model cards
├── csrc/                         # C++/CUDA/CPU native extensions
│   ├── gpu/                      # CUDA kernels (projector, cross-attention, VQ)
│   ├── cpu/                      # CPU fallbacks (offload, quantization, GGUF)
│   └── tpu/                      # TPU XLA backend
├── setup.py                      # Python packaging + CMake extension build
├── pyproject.toml                # Project configuration
├── CMakeLists.txt                # C++/CUDA build system
└── LICENSE                       # Apache 2.0

License

OpenLLaVA is licensed under the Apache License 2.0.

Copyright (c) 2024-2026 OpceanAI

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

OpenLLaVA — Vision injection for every language model.

Built by OpceanAI Research Team

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

3.0.0

May 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openllava-3.0.0.tar.gz (470.2 kB view details)

Uploaded May 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

openllava-3.0.0-py3-none-any.whl (526.5 kB view details)

Uploaded May 6, 2026 Python 3

File details

Details for the file openllava-3.0.0.tar.gz.

File metadata

Download URL: openllava-3.0.0.tar.gz
Upload date: May 6, 2026
Size: 470.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for openllava-3.0.0.tar.gz
Algorithm	Hash digest
SHA256	`03dc76e6d5b85ccd4bab49348484c5bda237b3a5a7a1cc3889cc3125ab6b9ae0`
MD5	`cbd5c71545f7043db1b33b112bc49ea8`
BLAKE2b-256	`8d9a9e747afc4e1b1bfff7cb062239669e8598d005fb66817aea6b5dded0ab9f`

See more details on using hashes here.

File details

Details for the file openllava-3.0.0-py3-none-any.whl.

File metadata

Download URL: openllava-3.0.0-py3-none-any.whl
Upload date: May 6, 2026
Size: 526.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for openllava-3.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a3f0cb0593f43123c431f78c78a510d20ae9bb5d275bac057afa9e2fe0c9431d`
MD5	`9e658475f304f7c15b4b1efbe939349e`
BLAKE2b-256	`eb89846be576d0c3529cce4d673dcd2d93da7d94c0ec063ed027cbe322e3bab3`

See more details on using hashes here.

openllava 3.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Table of Contents

Overview

Design Principles

Key Features

Model Construction

Training Pipeline

Inference and Serving

Optimization Suite

Distributed Training

Multi-Backend Support

Quickstart

Installation

Build from Source

Inject Vision Into Any LLM

Train with LoRA

Run Inference

Serve as OpenAI-Compatible API

Architecture

How Vision Injection Works

Core Concepts

OpenLLaVA Class

YakiProjector

FastVisionModel API

Backend Abstraction

Training Pipeline

Phase 1: Vision-Language Alignment

Phase 2: Visual Instruction Tuning

Phase 3: RL Alignment (Optional)

Training Modes

Optimizations

Training Optimizations

Quantization

KV Cache Optimizations

Speculative Decoding

Other Optimizations

CLI Reference

train

serve

export

benchmark

info

API Reference

OpenLLaVA

FastVisionModel

OpenLLaVATrainer

InferenceEngine

Server

Distributed Training

Parallelism Strategy Comparison

FSDP2

DeepSpeed ZeRO

Auto-Parallelism

RL Alignment

Reward Functions

Export and Deployment

Model Export Formats

LoRA Merge

Evaluation

Configuration

Training Configuration

Environment Variables

Backends

CUDA (NVIDIA)

ROCm (AMD)

CPU FP32

TPU (Google)

MLX (Apple Silicon)

XPU (Intel)

Heterogeneous

Performance

Training Throughput (tokens/second, BF16)