Open-source PyTorch reference implementation of MoonViT, the native-resolution vision encoder from the Kimi-VL Technical Report (arXiv:2504.07491)
Project description
MoonViT - Pytorch
This is an ultra-simple, single-file PyTorch implementation of MoonViT, the native-resolution vision encoder from Kimi-VL. I implemented this model because I think it's a great ViT variation with the ability to ingest images of dynamic sizes and resolutions at scale.
Install
$ pip install open-moonvit
Or from source:
$ git clone https://github.com/kyegomez/open-moonvit
$ cd open-moonvit
$ pip install -e .
FlashAttention is optional. If flash_attn is importable and you're on CUDA, the var-length kernel is used automatically. Otherwise a block-diagonal SDPA fallback runs on CPU / MPS / CUDA with no extra dependencies.
$ pip install flash-attn --no-build-isolation # optional
Usage
import torch
from open_moonvit import MoonViT, MoonViTConfig, MLPProjector
encoder = MoonViT(MoonViTConfig()) # ~413M params, SigLIP-SO-400M defaults
# a batch of images at different resolutions, no padding, no resizing
images = [
torch.randn(3, 224, 280),
torch.randn(3, 140, 196),
torch.randn(3, 336, 336),
]
out = encoder(images)
out.last_hidden_state # (L_total, 1152) packed patch tokens
out.cu_seqlens # (4,) int32 image boundaries in the packed seq
out.grid_shapes # [(16,20), (10,14), (24,24)]
To feed an LLM, compose with the MLP projector (2×2 pixel-shuffle then a two-layer MLP):
projector = MLPProjector(
vision_hidden_size = 1152,
llm_hidden_size = 2048,
)
tokens, grids, cu = projector(out.last_hidden_state, out.grid_shapes, out.cu_seqlens)
tokens.shape # (L_total // 4, 2048)
How it works
flowchart TD
A["list of native-res images<br/>(3, H_i, W_i)"] --> B["patch embed<br/>Conv2d stride=14"]
B --> C["+ interpolated<br/>SigLIP abs-pos-embed<br/>(bicubic, per image)"]
C --> D["flatten & pack<br/>→ (L_total, D)<br/>cu_seqlens tracks boundaries"]
D --> E["27× Transformer block<br/>pre-norm · QKV-bias"]
E -.->|inside attn| F["2D RoPE<br/>head_dim/2 for H<br/>head_dim/2 for W"]
E -.->|inside attn| G["varlen attention<br/>FlashAttn or<br/>block-diagonal SDPA"]
E --> H["post LayerNorm"]
H --> I["MLP Projector<br/>2×2 pixel-shuffle · 2-layer MLP"]
I --> J["LLM-space tokens<br/>(L_total/4, D_llm)"]
Four things to internalize:
- Packing, not padding. Images of different shapes become one long sequence. No wasted compute on pad tokens.
- Two positional embeddings, added together. The paper is insistent on this. Interpolated SigLIP absolute pos embed preserves the pretrained prior; 2D RoPE supplies the fine-grained, resolution-robust signal.
- Varlen attention is what makes (1) safe.
cu_seqlensslices the packed sequence so image i only attends to itself. FlashAttention does this in one kernel; the fallback loops per-image over SDPA. - The projector lives outside the encoder. Pixel shuffle is a 2×2 space-to-depth: four tokens become one, channels 4×. Then a plain two-layer MLP projects into LLM space.
Citations
@article{kimivl2025,
title = {Kimi-VL Technical Report},
author = {{Kimi Team}},
journal = {arXiv preprint arXiv:2504.07491},
year = {2025},
url = {https://arxiv.org/abs/2504.07491}
}
@article{dehghani2023navit,
title = {Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution},
author = {Dehghani, Mostafa and Mustafa, Basil and Djolonga, Josip and Heek, Jonathan and Minderer, Matthias and Caron, Mathilde and Steiner, Andreas and Puigcerver, Joan and Geirhos, Robert and Alabdulmohsin, Ibrahim and Oliver, Avital and Padlewski, Piotr and Gritsenko, Alexey and Lucic, Mario and Houlsby, Neil},
journal = {arXiv preprint arXiv:2307.06304},
year = {2023}
}
@article{zhai2023siglip,
title = {Sigmoid Loss for Language Image Pre-Training},
author = {Zhai, Xiaohua and Mustafa, Basil and Kolesnikov, Alexander and Beyer, Lucas},
journal = {arXiv preprint arXiv:2303.15343},
year = {2023}
}
@article{su2021roformer,
title = {RoFormer: Enhanced Transformer with Rotary Position Embedding},
author = {Su, Jianlin and Lu, Yu and Pan, Shengfeng and Murtadha, Ahmed and Wen, Bo and Liu, Yunfeng},
journal = {arXiv preprint arXiv:2104.09864},
year = {2021}
}
License
APACHE License 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file open_moonvit-0.2.0.tar.gz.
File metadata
- Download URL: open_moonvit-0.2.0.tar.gz
- Upload date:
- Size: 22.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.12.3 Darwin/24.5.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
976bcb3b8181d6761187d50a8ea83b0bd46bb18c8c5c397b875915a80d22b1fa
|
|
| MD5 |
d28b372cd4f9e8d78cdb8e5b6e4b04f5
|
|
| BLAKE2b-256 |
ea077fe7855211a426da5c8fa7946537146610b4decf208122c69e23e4b493c4
|
File details
Details for the file open_moonvit-0.2.0-py3-none-any.whl.
File metadata
- Download URL: open_moonvit-0.2.0-py3-none-any.whl
- Upload date:
- Size: 21.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.12.3 Darwin/24.5.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
017a23bf4997ce3b2a9d9ff1644c455d1817cc559537fa5a9de1a4905984a3b9
|
|
| MD5 |
429deccaf53d8c071bd7e36fc10f730e
|
|
| BLAKE2b-256 |
ad81f5ed5f1d0c02a625e3f0dae82f4ae4306b37ca15e7de8ad1512d0be68c1f
|