Open-source PyTorch reference implementation of MoonViT, the native-resolution vision encoder from the Kimi-VL Technical Report (arXiv:2504.07491)

These details have not been verified by PyPI

Project links

Project description

MoonViT - Pytorch

Model Architecture

This is an ultra-simple, single-file PyTorch implementation of MoonViT, the native-resolution vision encoder from Kimi-VL. I implemented this model because I think it's a great ViT variation with the ability to ingest images of dynamic sizes and resolutions at scale.

Install

$ git clone https://github.com/kyegomez/open-moonvit
$ cd open-moonvit
$ pip install torch

FlashAttention is optional. If flash_attn is importable and you're on CUDA, the var-length kernel is used automatically. Otherwise a block-diagonal SDPA fallback runs on CPU / MPS / CUDA with no extra dependencies.

$ pip install flash-attn --no-build-isolation  # optional

Usage

import torch
from main import MoonViT, MoonViTConfig, MLPProjector

encoder = MoonViT(MoonViTConfig())    # ~413M params, SigLIP-SO-400M defaults

# a batch of images at different resolutions, no padding, no resizing
images = [
    torch.randn(3, 224, 280),
    torch.randn(3, 140, 196),
    torch.randn(3, 336, 336),
]

out = encoder(images)
out.last_hidden_state    # (L_total, 1152)   packed patch tokens
out.cu_seqlens           # (4,) int32        image boundaries in the packed seq
out.grid_shapes          # [(16,20), (10,14), (24,24)]

To feed an LLM, compose with the MLP projector (2×2 pixel-shuffle then a two-layer MLP):

projector = MLPProjector(
    vision_hidden_size = 1152,
    llm_hidden_size    = 2048,
)

tokens, grids, cu = projector(out.last_hidden_state, out.grid_shapes, out.cu_seqlens)
tokens.shape   # (L_total // 4, 2048)

How it works

flowchart TD
    A["list of native-res images<br/>(3, H_i, W_i)"] --> B["patch embed<br/>Conv2d stride=14"]
    B --> C["+ interpolated<br/>SigLIP abs-pos-embed<br/>(bicubic, per image)"]
    C --> D["flatten &amp; pack<br/>→ (L_total, D)<br/>cu_seqlens tracks boundaries"]
    D --> E["27× Transformer block<br/>pre-norm · QKV-bias"]
    E -.->|inside attn| F["2D RoPE<br/>head_dim/2 for H<br/>head_dim/2 for W"]
    E -.->|inside attn| G["varlen attention<br/>FlashAttn or<br/>block-diagonal SDPA"]
    E --> H["post LayerNorm"]
    H --> I["MLP Projector<br/>2×2 pixel-shuffle · 2-layer MLP"]
    I --> J["LLM-space tokens<br/>(L_total/4, D_llm)"]

Four things to internalize:

Packing, not padding. Images of different shapes become one long sequence. No wasted compute on pad tokens.
Two positional embeddings, added together. The paper is insistent on this. Interpolated SigLIP absolute pos embed preserves the pretrained prior; 2D RoPE supplies the fine-grained, resolution-robust signal.
Varlen attention is what makes (1) safe. cu_seqlens slices the packed sequence so image i only attends to itself. FlashAttention does this in one kernel; the fallback loops per-image over SDPA.
The projector lives outside the encoder. Pixel shuffle is a 2×2 space-to-depth: four tokens become one, channels 4×. Then a plain two-layer MLP projects into LLM space.

Citations

@article{kimivl2025,
    title   = {Kimi-VL Technical Report},
    author  = {{Kimi Team}},
    journal = {arXiv preprint arXiv:2504.07491},
    year    = {2025},
    url     = {https://arxiv.org/abs/2504.07491}
}

@article{dehghani2023navit,
    title   = {Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution},
    author  = {Dehghani, Mostafa and Mustafa, Basil and Djolonga, Josip and Heek, Jonathan and Minderer, Matthias and Caron, Mathilde and Steiner, Andreas and Puigcerver, Joan and Geirhos, Robert and Alabdulmohsin, Ibrahim and Oliver, Avital and Padlewski, Piotr and Gritsenko, Alexey and Lucic, Mario and Houlsby, Neil},
    journal = {arXiv preprint arXiv:2307.06304},
    year    = {2023}
}

@article{zhai2023siglip,
    title   = {Sigmoid Loss for Language Image Pre-Training},
    author  = {Zhai, Xiaohua and Mustafa, Basil and Kolesnikov, Alexander and Beyer, Lucas},
    journal = {arXiv preprint arXiv:2303.15343},
    year    = {2023}
}

@article{su2021roformer,
    title   = {RoFormer: Enhanced Transformer with Rotary Position Embedding},
    author  = {Su, Jianlin and Lu, Yu and Pan, Shengfeng and Murtadha, Ahmed and Wen, Bo and Liu, Yunfeng},
    journal = {arXiv preprint arXiv:2104.09864},
    year    = {2021}
}

License

APACHE License 2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

Apr 22, 2026

This version

0.1.0

Apr 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

open_moonvit-0.1.0.tar.gz (22.2 kB view details)

Uploaded Apr 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

open_moonvit-0.1.0-py3-none-any.whl (21.2 kB view details)

Uploaded Apr 22, 2026 Python 3

File details

Details for the file open_moonvit-0.1.0.tar.gz.

File metadata

Download URL: open_moonvit-0.1.0.tar.gz
Upload date: Apr 22, 2026
Size: 22.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.12.3 Darwin/24.5.0

File hashes

Hashes for open_moonvit-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`49053a9e16c56d5f27f22b1c8795add9d7f85a5e1644e11a4615c15564a5e901`
MD5	`5dc67c954c8446a1e88412d7175e404e`
BLAKE2b-256	`52c09994ed08fb0e2b59cfd93bb90092c22605c96f0114e9cb90f64dd70aa31d`

See more details on using hashes here.

File details

Details for the file open_moonvit-0.1.0-py3-none-any.whl.

File metadata

Download URL: open_moonvit-0.1.0-py3-none-any.whl
Upload date: Apr 22, 2026
Size: 21.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.12.3 Darwin/24.5.0

File hashes

Hashes for open_moonvit-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cde44a27658555391f1a0674f21cdabe3e760bea0744bd5e1b066a654b320c3e`
MD5	`9504abe76f71e3bbbfe2d78ce418d36f`
BLAKE2b-256	`77e7a0eefa73ae694ede215a057b6d92a60b0ccc3ea87104c5cebf631c6994a2`

See more details on using hashes here.

open-moonvit 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MoonViT - Pytorch

Install

Usage

How it works

Citations

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes