Skip to main content

Gemini - Pytorch

Project description

Multi-Modality

Gemini

gemini

The open source implementation of Gemini, the model that will "eclipse ChatGPT", it seems to work by directly taking in all modalities all at once into a transformer with special decoders for text or img generation!

Join the Agora discord channel to help with the implementation! and Here is the project board:

The input sequences for Gemini consist of texts, audio, images, and videos. These inputs are transformed into tokens, which are then processed by a transformer. Subsequently, conditional decoding takes place to generate image outputs.

Interestingly, the architecture of Gemini bears resemblance to Fuyu's architecture but is expanded to encompass multiple modalities. Instead of utilizing a visual transformer (vit) encoder, Gemini simply feeds image embeddings directly into the transformer.

For Gemini, the token inputs will likely be indicated by special modality tokens such as [IMG], , [AUDIO], or . Codi, a component of Gemini, also employs conditional generation and makes use of the tokenized outputs.

To implement this model effectively, I intend to initially focus on the image embeddings to ensure their smooth integration. Subsequently, I will proceed with incorporating audio embeddings and then video embeddings.

Install

pip3 install gemini-torch

Usage

Gemini Transformer Usage

  • Base transformer
  • Multi Grouped Query Attn / flash attn
  • rope
  • alibi
  • xpos
  • qk norm
  • no pos embeds
  • kv cache
import torch
from gemini_torch.model import Gemini

# Initialize model with smaller dimensions
model = Gemini(
    num_tokens=50432,
    max_seq_len=4096,  # Reduced from 8192
    dim=1280,  # Reduced from 2560
    depth=16,  # Reduced from 32
    dim_head=64,  # Reduced from 128
    heads=12,  # Reduced from 24
    use_abs_pos_emb=False,
    attn_flash=True,
    attn_kv_heads=2,
    qk_norm=True,
    attn_qk_norm=True,
    attn_qk_norm_dim_scale=True,
)

# Text shape: [batch, seq_len, dim]
text = torch.randint(0, 50432, (1, 4096))  # Reduced seq_len from 8192

# Img shape: [batch, channels, height, width]
img = torch.randn(1, 3, 128, 128)  # Reduced height and width from 256

# Audio shape: [batch, audio_seq_len, dim]
audio = torch.randn(1, 64)  # Reduced audio_seq_len from 128

# Apply model to text and img
y = model(text, img, audio)

# Output shape: [batch, seq_len, dim]
print(y)

Multi-Modal with Imgs + Audio

  • Img processing through a specially crafted module that takes in img -> patches it -> then reshapes to the shape of the text tensors, [B, seqlen, dim] -> align with text tokens
import torch
from gemini_torch.model import Gemini

# Initialize model
model = Gemini(
    num_tokens=50432,
    max_seq_len=8192,
    dim=2560,
    depth=32,
    dim_head=128,
    heads=24,
    use_abs_pos_emb=False,
    alibi_pos_bias=True,
    alibi_num_heads=12,
    rotary_xpos=True,
    attn_flash=True,
    attn_kv_heads=2,
    qk_norm=True,
    attn_qk_norm=True,
    attn_qk_norm_dim_scale=True,
)

# Text shape: [batch, seq_len, dim]
text = torch.randint(0, 50432, (1, 8192))

# Img shape: [batch, channels, height, width]
img = torch.randn(1, 3, 256, 256)

# Audio shape: [batch, audio_seq_len, dim]
audio = torch.randn(1, 128)

# Apply model to text and img
y = model(text, img, audio)

# Output shape: [batch, seq_len, dim]
print(y.shape)

Tokenizer

  • Sentencepiece, tokenizer
  • We're using the same tokenizer as LLAMA with special tokens denoting the beginning and end of the multi modality tokens.
  • Does not fully process img, audio, or videos now we need help on that
from gemini_torch.tokenizer import MultimodalSentencePieceTokenizer

# Example usage
tokenizer_name = "hf-internal-testing/llama-tokenizer"
tokenizer = MultimodalSentencePieceTokenizer(tokenizer_name=tokenizer_name)

# Encoding and decoding examples
encoded_audio = tokenizer.encode("Audio description", modality="audio")
decoded_audio = tokenizer.decode(encoded_audio)

print("Encoded audio:", encoded_audio)
print("Decoded audio:", decoded_audio)

ImgToEmbeddings

  • takes in img -> patches -> reshapes to [B, SEQLEN, Dim] to align with transformer
import torch
from gemini_torch.utils import ImgToEmbeddings

# Example usage
num_patches = 16
patch_size = 16
transformer_dim = 512
img_channels = 3
seq_len = 50000
reduced_dim = 256  # Reduced dimension after dimensionality reduction

model = ImgToEmbeddings(
    num_patches, patch_size, transformer_dim, img_channels, seq_len, reduced_dim
)

# Dummy image input [BATCH, CHANNELS, HEIGHT, WIDTH]
dummy_img = torch.randn(1, 3, 64, 64)  # Batch size of 1, 64x64 RGB image

# Forward pass
seq_space_output = model(dummy_img)
print(seq_space_output.shape)  # Expected shape: [1, 50000, 256]

AudioToEmbeddings

  • Transforms audio into the same shape as text tensors.
import torch 
from gemini_torch.utils import AudioToEmbeddings

# Example usage
audio_seq_len = 32000  # Input audio sequence length
seqlen = 512  # Sequence length to align with the language transformer
dim = 512  # Embedding dimension

model = AudioToEmbeddings(audio_seq_len, seqlen, dim)
audio_input = torch.randn(1, audio_seq_len)  # Example input tensor
output = model(audio_input)

print("Output shape:", output.shape)  # Should be [1, 512, 512]

References

  • Combine Reinforcment learning with modular pretrained transformer, multi-modal capabilities, image, audio,
  • self improving mechanisms like robocat
  • PPO? or MPO
  • get good at backtracking and exploring alternative paths
  • speculative decoding
  • Algorithm of Thoughts
  • RLHF
  • Gemini Report
  • Gemini Landing Page

Todo

  • Check out the project board for more todos

  • Implement the img feature embedder and align imgs with text and pass into transformer: Gemini models are trained to accommodate textual input interleaved with a wide variety of audio and visual inputs, such as natural images, charts, screenshots, PDFs, and videos, and they can produce text and image outputs (see Figure 2). The visual encoding of Gemini models is inspired by our own foundational work on Flamingo (Alayrac et al., 2022), CoCa (Yu et al., 2022a), and PaLI (Chen et al., 2022), with the important distinction that the models are multimodal from the beginning and can natively output images using discrete image tokens (Ramesh et al., 2021; Yu et al., 2022b).

  • Implement the audio processing using USM by Google:In addition, Gemini can directly ingest audio signals at 16kHz from Universal Speech Model (USM) (Zhang et al., 2023) features. This enables the model to capture nuances that are typically lost when the audio is naively mapped to a text input (for example, see audio understanding demo on the website).

  • Video Processing Technique: " Video understanding is accomplished by encoding the video as a sequence of frames in the large context window. Video frames or images can be interleaved naturally with text or audio as part of the model input"

  • Prompting Technique: We find Gemini Ultra achieves highest accuracy when used in combination with a chain-of-thought prompting approach (Wei et al., 2022) that accounts for model uncertainty. The model produces a chain of thought with k samples, for example 8 or 32. If there is a consensus above a preset threshold (selected based on the validation split), it selects this answer, otherwise it reverts to a greedy sample based on maximum likelihood choice without chain of thought. We refer the reader to appendix for a detailed breakdown of how this approach compares with only chain-of-thought prompting or only greedy sampling.

  • Train a 1.8B + 3.25 Model: Nano-1 and Nano-2 model sizes are only 1.8B and 3.25B parameters respectively. Despite their size, they show exceptionally strong performance on factuality, i.e. retrieval-related tasks, and significant performance on reasoning, STEM, coding, multimodal and

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gemini_torch-0.1.2.tar.gz (25.9 kB view details)

Uploaded Source

Built Distribution

gemini_torch-0.1.2-py3-none-any.whl (23.3 kB view details)

Uploaded Python 3

File details

Details for the file gemini_torch-0.1.2.tar.gz.

File metadata

  • Download URL: gemini_torch-0.1.2.tar.gz
  • Upload date:
  • Size: 25.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.2 CPython/3.11.0 Darwin/22.4.0

File hashes

Hashes for gemini_torch-0.1.2.tar.gz
Algorithm Hash digest
SHA256 7d3b01e9e55a9419380b7f302946f7b1661ea39fab8b8beb48a3745605461ea1
MD5 537faf3c5f73f840e371036e3c3ee99b
BLAKE2b-256 3eccafb0c219c9fc9e58ca467f038d14ede1f6c9b414fc48222eb643501a8e15

See more details on using hashes here.

File details

Details for the file gemini_torch-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: gemini_torch-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 23.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.2 CPython/3.11.0 Darwin/22.4.0

File hashes

Hashes for gemini_torch-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 9eb9707174f7dea0695d2c871578913d83e847e3bbb996d44e956a04d63be4ce
MD5 9b701894a5c8a01762a2b2a33f6cbd95
BLAKE2b-256 aab157d315aeb8894d68cc1dfb5d445baded593436f27588a908758509b3dd12

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page