Skip to main content

Open-source PyTorch reference implementation of MoonViT, the native-resolution vision encoder from the Kimi-VL Technical Report (arXiv:2504.07491)

Project description

MoonViT - Pytorch

Model Architecture

Version Twitter Discord PyTorch

This is an ultra-simple, single-file PyTorch implementation of MoonViT, the native-resolution vision encoder from Kimi-VL. I implemented this model because I think it's a great ViT variation with the ability to ingest images of dynamic sizes and resolutions at scale.

Install

$ git clone https://github.com/kyegomez/open-moonvit
$ cd open-moonvit
$ pip install torch

FlashAttention is optional. If flash_attn is importable and you're on CUDA, the var-length kernel is used automatically. Otherwise a block-diagonal SDPA fallback runs on CPU / MPS / CUDA with no extra dependencies.

$ pip install flash-attn --no-build-isolation  # optional

Usage

import torch
from main import MoonViT, MoonViTConfig, MLPProjector

encoder = MoonViT(MoonViTConfig())    # ~413M params, SigLIP-SO-400M defaults

# a batch of images at different resolutions, no padding, no resizing
images = [
    torch.randn(3, 224, 280),
    torch.randn(3, 140, 196),
    torch.randn(3, 336, 336),
]

out = encoder(images)
out.last_hidden_state    # (L_total, 1152)   packed patch tokens
out.cu_seqlens           # (4,) int32        image boundaries in the packed seq
out.grid_shapes          # [(16,20), (10,14), (24,24)]

To feed an LLM, compose with the MLP projector (2×2 pixel-shuffle then a two-layer MLP):

projector = MLPProjector(
    vision_hidden_size = 1152,
    llm_hidden_size    = 2048,
)

tokens, grids, cu = projector(out.last_hidden_state, out.grid_shapes, out.cu_seqlens)
tokens.shape   # (L_total // 4, 2048)

How it works

flowchart TD
    A["list of native-res images<br/>(3, H_i, W_i)"] --> B["patch embed<br/>Conv2d stride=14"]
    B --> C["+ interpolated<br/>SigLIP abs-pos-embed<br/>(bicubic, per image)"]
    C --> D["flatten &amp; pack<br/>→ (L_total, D)<br/>cu_seqlens tracks boundaries"]
    D --> E["27× Transformer block<br/>pre-norm · QKV-bias"]
    E -.->|inside attn| F["2D RoPE<br/>head_dim/2 for H<br/>head_dim/2 for W"]
    E -.->|inside attn| G["varlen attention<br/>FlashAttn or<br/>block-diagonal SDPA"]
    E --> H["post LayerNorm"]
    H --> I["MLP Projector<br/>2×2 pixel-shuffle · 2-layer MLP"]
    I --> J["LLM-space tokens<br/>(L_total/4, D_llm)"]

Four things to internalize:

  1. Packing, not padding. Images of different shapes become one long sequence. No wasted compute on pad tokens.
  2. Two positional embeddings, added together. The paper is insistent on this. Interpolated SigLIP absolute pos embed preserves the pretrained prior; 2D RoPE supplies the fine-grained, resolution-robust signal.
  3. Varlen attention is what makes (1) safe. cu_seqlens slices the packed sequence so image i only attends to itself. FlashAttention does this in one kernel; the fallback loops per-image over SDPA.
  4. The projector lives outside the encoder. Pixel shuffle is a 2×2 space-to-depth: four tokens become one, channels 4×. Then a plain two-layer MLP projects into LLM space.

Citations

@article{kimivl2025,
    title   = {Kimi-VL Technical Report},
    author  = {{Kimi Team}},
    journal = {arXiv preprint arXiv:2504.07491},
    year    = {2025},
    url     = {https://arxiv.org/abs/2504.07491}
}
@article{dehghani2023navit,
    title   = {Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution},
    author  = {Dehghani, Mostafa and Mustafa, Basil and Djolonga, Josip and Heek, Jonathan and Minderer, Matthias and Caron, Mathilde and Steiner, Andreas and Puigcerver, Joan and Geirhos, Robert and Alabdulmohsin, Ibrahim and Oliver, Avital and Padlewski, Piotr and Gritsenko, Alexey and Lucic, Mario and Houlsby, Neil},
    journal = {arXiv preprint arXiv:2307.06304},
    year    = {2023}
}
@article{zhai2023siglip,
    title   = {Sigmoid Loss for Language Image Pre-Training},
    author  = {Zhai, Xiaohua and Mustafa, Basil and Kolesnikov, Alexander and Beyer, Lucas},
    journal = {arXiv preprint arXiv:2303.15343},
    year    = {2023}
}
@article{su2021roformer,
    title   = {RoFormer: Enhanced Transformer with Rotary Position Embedding},
    author  = {Su, Jianlin and Lu, Yu and Pan, Shengfeng and Murtadha, Ahmed and Wen, Bo and Liu, Yunfeng},
    journal = {arXiv preprint arXiv:2104.09864},
    year    = {2021}
}

License

APACHE License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

open_moonvit-0.1.0.tar.gz (22.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

open_moonvit-0.1.0-py3-none-any.whl (21.2 kB view details)

Uploaded Python 3

File details

Details for the file open_moonvit-0.1.0.tar.gz.

File metadata

  • Download URL: open_moonvit-0.1.0.tar.gz
  • Upload date:
  • Size: 22.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.12.3 Darwin/24.5.0

File hashes

Hashes for open_moonvit-0.1.0.tar.gz
Algorithm Hash digest
SHA256 49053a9e16c56d5f27f22b1c8795add9d7f85a5e1644e11a4615c15564a5e901
MD5 5dc67c954c8446a1e88412d7175e404e
BLAKE2b-256 52c09994ed08fb0e2b59cfd93bb90092c22605c96f0114e9cb90f64dd70aa31d

See more details on using hashes here.

File details

Details for the file open_moonvit-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: open_moonvit-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 21.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.12.3 Darwin/24.5.0

File hashes

Hashes for open_moonvit-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cde44a27658555391f1a0674f21cdabe3e760bea0744bd5e1b066a654b320c3e
MD5 9504abe76f71e3bbbfe2d78ce418d36f
BLAKE2b-256 77e7a0eefa73ae694ede215a057b6d92a60b0ccc3ea87104c5cebf631c6994a2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page