Skip to main content

MobileVLM - Pytorch

Project description

Multi-Modality

MobileVLM

Implementation of the LDP module block in PyTorch and Zeta from the paper: "MobileVLM: A Fast, Strong and Open Vision Language Assistant for Mobile Devices"

Install

pip3 install mobilevlm

Usage

# Import the necessary libraries
import torch
from mobilevlm import LDP

# Create an instance of the LDP model
ldp = LDP(in_channels=128, out_channels=128, depth=3)

# Create an example input tensor
input_tensor = torch.randn(1, 128, 64, 64)

# Pass the input tensor through the LDP model to get the output
output = ldp(input_tensor)

# Print the shape of the output tensor
print(output.shape)

Lightweight Downsample Projection (LDP) Layer

The Lightweight Downsample Projection (LDP) Layer is a component designed for efficient feature extraction and dimensionality reduction in convolutional neural networks. The LDP layer is particularly suited for mobile and edge devices where computational resources are limited.

The LDP layer combines depthwise separable convolutions with pointwise convolutions and skip connections, allowing for a reduced number of parameters while maintaining a rich feature representation. The incorporation of Layer Normalization stabilizes the training process and allows for faster convergence.

Architecture

The LDP layer is structured as follows:

  1. Initial Pointwise Convolution: This is a 1x1 convolution that transforms the input feature map to the desired number of channels. It is computationally efficient and serves as a channel-wise feature transformation.

  2. GELU Activation: After the initial pointwise convolution, we apply a Gaussian Error Linear Unit (GELU) activation function. GELU provides non-linearity to the model, allowing it to learn more complex patterns.

  3. First Depthwise Convolution: A depthwise convolution with a stride of 1 follows, which applies a single filter per input channel. It is used for spatial feature extraction without altering the dimensionality of the feature map.

  4. First Skip Connection: The output of the first depthwise convolution is added back to the output of the initial pointwise convolution. This skip connection allows gradients to flow directly through the network, mitigating the vanishing gradient problem and enabling deeper architectures.

  5. Second Pointwise Convolution: Another 1x1 convolution is applied to further mix the channel-wise features.

  6. Layer Normalization: Normalization is applied over the channel dimension to stabilize the mean and variance of activations, leading to improved training dynamics.

  7. Second GELU Activation: A second GELU activation function is applied for additional non-linearity.

  8. Second Depthwise Convolution: This depthwise convolution has a stride of 2, halving the spatial dimensions of the feature map and effectively downsampling the input.

  9. Second Skip Connection: A pixel-wise addition combines the downsampled input to the block with the output of the second depthwise convolution. This connection helps to preserve information lost due to downsampling.

  10. Third Pointwise Convolution: A final 1x1 convolution adjusts the channel dimensions if necessary and refines the features before passing them to subsequent layers.

  11. Layer Normalization: Another layer normalization is applied to the output of the final pointwise convolution.

Why It Works

The LDP layer is designed to capture the essence of the input features while reducing the spatial resolution in a computationally efficient manner. The use of depthwise separable convolutions significantly decreases the number of parameters compared to standard convolutions, reducing both the computational cost and the risk of overfitting.

Skip connections not only help to preserve information throughout the layer but also improve gradient flow during backpropagation, allowing for deeper network architectures. Layer Normalization is known to accelerate training and make the model less sensitive to initialization and learning rate choices.

This combination of efficiency and robustness makes the LDP layer a versatile component in designing neural networks for resource-constrained environments.

Citation

@misc{chu2023mobilevlm,
    title={MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices}, 
    author={Xiangxiang Chu and Limeng Qiao and Xinyang Lin and Shuang Xu and Yang Yang and Yiming Hu and Fei Wei and Xinyu Zhang and Bo Zhang and Xiaolin Wei and Chunhua Shen},
    year={2023},
    eprint={2312.16886},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mobilevlm-0.0.1.tar.gz (5.8 kB view details)

Uploaded Source

Built Distribution

mobilevlm-0.0.1-py3-none-any.whl (6.1 kB view details)

Uploaded Python 3

File details

Details for the file mobilevlm-0.0.1.tar.gz.

File metadata

  • Download URL: mobilevlm-0.0.1.tar.gz
  • Upload date:
  • Size: 5.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.2 CPython/3.11.0 Darwin/22.4.0

File hashes

Hashes for mobilevlm-0.0.1.tar.gz
Algorithm Hash digest
SHA256 cacaf7b284d316b6023a63fe1cd81f95b2d0a96c5f1eb3ae3c211b8817309010
MD5 68d5f7c9ad1b7d0976d0cf46a04f1abd
BLAKE2b-256 de43e4503369316d5eae2c032b530c24c75969a8a26d71760d8c30f894777780

See more details on using hashes here.

File details

Details for the file mobilevlm-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: mobilevlm-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 6.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.2 CPython/3.11.0 Darwin/22.4.0

File hashes

Hashes for mobilevlm-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 065956a0e630579d8a5ee900846bcd400b241a23dcfa79bc02893de4b52b3ce0
MD5 cad839d228a4f7101727a04e5dccb22e
BLAKE2b-256 8376d154fff5ac110bb1742e0bba629d43d113c87a65a33c76c25c4f7a3943b9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page