mmdit

MMDiT

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3.9
Topic
- Scientific/Engineering :: Artificial Intelligence

Project description

MMDiT

Implementation of a single layer of the MMDiT, proposed by Esser et al. in Stable Diffusion 3, in Pytorch

Besides a straight reproduction, will also generalize to > 2 modalities, as I can envision an MMDiT for images, audio, and text.

Will also offer an improvised variant of the attention that adaptively selects the weights to use through learned gating.

Install

$ pip install mmdit

Usage

import torch
from mmdit import MMDiTBlock

# define mm dit block

block = MMDiTBlock(
    dim_joint_attn = 512,
    dim_cond = 256,
    dim_text = 768,
    dim_image = 512,
    qk_rmsnorm = True
)

# mock inputs

time_cond = torch.randn(1, 256)

text_tokens = torch.randn(1, 512, 768)
text_mask = torch.ones((1, 512)).bool()

image_tokens = torch.randn(1, 1024, 512)

# single block forward

text_tokens_next, image_tokens_next = block(
    time_cond = time_cond,
    text_tokens = text_tokens,
    text_mask = text_mask,
    image_tokens = image_tokens
)

A generalized version can be used as so

import torch
from mmdit.mmdit_generalized_pytorch import MMDiT

mmdit = MMDiT(
    depth = 2, 
    dim_modalities = (768, 512, 384),
    dim_joint_attn = 512,
    dim_cond = 256,
    qk_rmsnorm = True
)

# mock inputs

time_cond = torch.randn(1, 256)

text_tokens = torch.randn(1, 512, 768)
text_mask = torch.ones((1, 512)).bool()

video_tokens = torch.randn(1, 1024, 512)

audio_tokens = torch.randn(1, 256, 384)

# forward

text_tokens, video_tokens, audio_tokens = mmdit(
    modality_tokens = (text_tokens, video_tokens, audio_tokens),
    modality_masks = (text_mask, None, None),
    time_cond = time_cond,
)

Citations

@article{Esser2024ScalingRF,
    title   = {Scaling Rectified Flow Transformers for High-Resolution Image Synthesis},
    author  = {Patrick Esser and Sumith Kulal and A. Blattmann and Rahim Entezari and Jonas Muller and Harry Saini and Yam Levi and Dominik Lorenz and Axel Sauer and Frederic Boesel and Dustin Podell and Tim Dockhorn and Zion English and Kyle Lacey and Alex Goodwin and Yannik Marek and Robin Rombach},
    journal = {ArXiv},
    year    = {2024},
    volume  = {abs/2403.03206},
    url     = {https://api.semanticscholar.org/CorpusID:268247980}
}

@inproceedings{Darcet2023VisionTN,
    title   = {Vision Transformers Need Registers},
    author  = {Timoth'ee Darcet and Maxime Oquab and Julien Mairal and Piotr Bojanowski},
    year    = {2023},
    url     = {https://api.semanticscholar.org/CorpusID:263134283}
}

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3.9
Topic
- Scientific/Engineering :: Artificial Intelligence

Release history Release notifications | RSS feed

This version

0.1.0

May 10, 2024

0.0.11

May 6, 2024

0.0.10

May 6, 2024

0.0.9

May 5, 2024

0.0.8

May 5, 2024

0.0.7

May 5, 2024

0.0.6

May 5, 2024

0.0.5

May 5, 2024

0.0.4

May 4, 2024

0.0.3

May 4, 2024

0.0.2

May 4, 2024

0.0.1

May 4, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mmdit-0.1.0.tar.gz (148.2 kB view hashes)

Uploaded May 10, 2024 Source

Built Distribution

mmdit-0.1.0-py3-none-any.whl (9.6 kB view hashes)

Uploaded May 10, 2024 Python 3

Hashes for mmdit-0.1.0.tar.gz

Hashes for mmdit-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`bd23bcc83eced0b362135851ea5b1aca1678a8edfc91f34e0d2a3ae96bb46fa1`
MD5	`68322c3ef0e9e13ab66eb8259f5bfd72`
BLAKE2b-256	`950554ca19c0d43a845a755543cebc538260b6aa41a0995db33adf6dd8ebdfdb`

Hashes for mmdit-0.1.0-py3-none-any.whl

Hashes for mmdit-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5a5f9939d9e9ae395bfd973f2356ab3b4024b8246b917127abb5092e5d3ad5a3`
MD5	`8617b3ad4a6491433396d5a81ea83a86`
BLAKE2b-256	`e54d33e357493ccd921063e25192623f72b6bf9d6465ea995667dcfde4520d82`