Skip to main content

Open-source PyTorch reference implementation of MoonViT, the native-resolution vision encoder from the Kimi-VL Technical Report (arXiv:2504.07491)

Project description

MoonViT - Pytorch

Model Architecture

Version Twitter Discord PyTorch

This is an ultra-simple, single-file PyTorch implementation of MoonViT, the native-resolution vision encoder from Kimi-VL. I implemented this model because I think it's a great ViT variation with the ability to ingest images of dynamic sizes and resolutions at scale.

Install

$ pip install open-moonvit

Or from source:

$ git clone https://github.com/kyegomez/open-moonvit
$ cd open-moonvit
$ pip install -e .

FlashAttention is optional. If flash_attn is importable and you're on CUDA, the var-length kernel is used automatically. Otherwise a block-diagonal SDPA fallback runs on CPU / MPS / CUDA with no extra dependencies.

$ pip install flash-attn --no-build-isolation  # optional

Usage

import torch
from open_moonvit import MoonViT, MoonViTConfig, MLPProjector

encoder = MoonViT(MoonViTConfig())    # ~413M params, SigLIP-SO-400M defaults

# a batch of images at different resolutions, no padding, no resizing
images = [
    torch.randn(3, 224, 280),
    torch.randn(3, 140, 196),
    torch.randn(3, 336, 336),
]

out = encoder(images)
out.last_hidden_state    # (L_total, 1152)   packed patch tokens
out.cu_seqlens           # (4,) int32        image boundaries in the packed seq
out.grid_shapes          # [(16,20), (10,14), (24,24)]

To feed an LLM, compose with the MLP projector (2×2 pixel-shuffle then a two-layer MLP):

projector = MLPProjector(
    vision_hidden_size = 1152,
    llm_hidden_size    = 2048,
)

tokens, grids, cu = projector(out.last_hidden_state, out.grid_shapes, out.cu_seqlens)
tokens.shape   # (L_total // 4, 2048)

How it works

flowchart TD
    A["list of native-res images<br/>(3, H_i, W_i)"] --> B["patch embed<br/>Conv2d stride=14"]
    B --> C["+ interpolated<br/>SigLIP abs-pos-embed<br/>(bicubic, per image)"]
    C --> D["flatten &amp; pack<br/>→ (L_total, D)<br/>cu_seqlens tracks boundaries"]
    D --> E["27× Transformer block<br/>pre-norm · QKV-bias"]
    E -.->|inside attn| F["2D RoPE<br/>head_dim/2 for H<br/>head_dim/2 for W"]
    E -.->|inside attn| G["varlen attention<br/>FlashAttn or<br/>block-diagonal SDPA"]
    E --> H["post LayerNorm"]
    H --> I["MLP Projector<br/>2×2 pixel-shuffle · 2-layer MLP"]
    I --> J["LLM-space tokens<br/>(L_total/4, D_llm)"]

Four things to internalize:

  1. Packing, not padding. Images of different shapes become one long sequence. No wasted compute on pad tokens.
  2. Two positional embeddings, added together. The paper is insistent on this. Interpolated SigLIP absolute pos embed preserves the pretrained prior; 2D RoPE supplies the fine-grained, resolution-robust signal.
  3. Varlen attention is what makes (1) safe. cu_seqlens slices the packed sequence so image i only attends to itself. FlashAttention does this in one kernel; the fallback loops per-image over SDPA.
  4. The projector lives outside the encoder. Pixel shuffle is a 2×2 space-to-depth: four tokens become one, channels 4×. Then a plain two-layer MLP projects into LLM space.

Citations

@article{kimivl2025,
    title   = {Kimi-VL Technical Report},
    author  = {{Kimi Team}},
    journal = {arXiv preprint arXiv:2504.07491},
    year    = {2025},
    url     = {https://arxiv.org/abs/2504.07491}
}
@article{dehghani2023navit,
    title   = {Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution},
    author  = {Dehghani, Mostafa and Mustafa, Basil and Djolonga, Josip and Heek, Jonathan and Minderer, Matthias and Caron, Mathilde and Steiner, Andreas and Puigcerver, Joan and Geirhos, Robert and Alabdulmohsin, Ibrahim and Oliver, Avital and Padlewski, Piotr and Gritsenko, Alexey and Lucic, Mario and Houlsby, Neil},
    journal = {arXiv preprint arXiv:2307.06304},
    year    = {2023}
}
@article{zhai2023siglip,
    title   = {Sigmoid Loss for Language Image Pre-Training},
    author  = {Zhai, Xiaohua and Mustafa, Basil and Kolesnikov, Alexander and Beyer, Lucas},
    journal = {arXiv preprint arXiv:2303.15343},
    year    = {2023}
}
@article{su2021roformer,
    title   = {RoFormer: Enhanced Transformer with Rotary Position Embedding},
    author  = {Su, Jianlin and Lu, Yu and Pan, Shengfeng and Murtadha, Ahmed and Wen, Bo and Liu, Yunfeng},
    journal = {arXiv preprint arXiv:2104.09864},
    year    = {2021}
}

License

APACHE License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

open_moonvit-0.2.0.tar.gz (22.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

open_moonvit-0.2.0-py3-none-any.whl (21.3 kB view details)

Uploaded Python 3

File details

Details for the file open_moonvit-0.2.0.tar.gz.

File metadata

  • Download URL: open_moonvit-0.2.0.tar.gz
  • Upload date:
  • Size: 22.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.12.3 Darwin/24.5.0

File hashes

Hashes for open_moonvit-0.2.0.tar.gz
Algorithm Hash digest
SHA256 976bcb3b8181d6761187d50a8ea83b0bd46bb18c8c5c397b875915a80d22b1fa
MD5 d28b372cd4f9e8d78cdb8e5b6e4b04f5
BLAKE2b-256 ea077fe7855211a426da5c8fa7946537146610b4decf208122c69e23e4b493c4

See more details on using hashes here.

File details

Details for the file open_moonvit-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: open_moonvit-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 21.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.12.3 Darwin/24.5.0

File hashes

Hashes for open_moonvit-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 017a23bf4997ce3b2a9d9ff1644c455d1817cc559537fa5a9de1a4905984a3b9
MD5 429deccaf53d8c071bd7e36fc10f730e
BLAKE2b-256 ad81f5ed5f1d0c02a625e3f0dae82f4ae4306b37ca15e7de8ad1512d0be68c1f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page