Skip to main content

kosmos-2 - Pytorch

Project description

Multi-Modality

Kosmos-2

My personal implementation of Kosmos-2, Kosmos-2: Grounding Multimodal Large Language Models to the World, much simpler codebase

Install

pip3 install qwen


Usage

import torch
from qwen.model import QwenVL

#usage
img = torch.randn(1, 3, 256, 256)
caption = torch.randint(0, 20000, (1, 1024))

model = QwenVL()
output = model(img, caption)
print(output.shape)

Inference

from qwen.inference import QwenVLChat


qwen_chat = QwenVLChat(model_name="Qwen/Qwen-VL-Chat", device_map="cuda")
response = qwen_chat.chat([
    {"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
    {"text": "这是什么?"}
])
print(response)

Training

from qwen.train import Train


def train():
    os.environ['MASTER_ADDR'] #'localhost'
    os.environ['MASTER_PORT'] #= '9994'
    
    # # [CRITICAL] Pay attention to this when scaling to multiple GPUs and clusters
    os.environ['RANK']       #= str(0) # Number of nodes (servers)
    os.environ['WORLD_SIZE'] # = str(torch.cuda.device_count())

    dist.init_process_group(backend='nccl') #init_method="env://")
    
    Train()

if __name__ == '__main__':
    train()
  1. Set the environment variables:

    • ENTITY_NAME: Your wandb project name
    • OUTPUT_DIR: Directory to save the weights (e.g., ./weights)
    • MASTER_ADDR: For distributed training
    • MASTER_PORT For master port distributed training
    • RANK- Number of nodes services
    • WORLD_SIZE Number of gpus
  2. Configure the training:

    • Accelerate Config
    • Enable Deepspeed 3
    • Accelerate launch train_distributed_accelerate.py

For more information, refer to the Training SOP.


Todo

  • Position aware vision language adapter, compresses image features. Singer layer cross attention module inited randomly => group of trainable embeddings as query vectors + image features from the visual encoder as keys for cross attention ops => OUTPUT: compresses visual feature sequence to a fixed lnegth of 256, 2d absolute positional encodings are integrated into the cross attentions mechanisms query key pairs => compressed feature sequence of length of 256 => fed into decoder llm

  • Bounding Boxes, for any given accurate bounding box, a norm process is applied in the range [0, 1000] and transformed into a string format (Xtope, Ytople)(Xottomright, Ybottomright) -> the string is tokenized as text and does not require positional vocabulary. Detection strings and regular text strings, two special tokens and are added to the beginning and end of the bounding box string. + another sed of special tokens ( and ) is introduced.

Citations

Please use the following to cite this work:

@article{bai2023qwen,
  title={Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities},
  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2308.12966},
  year={2023},
  url={https://doi.org/10.48550/arXiv.2308.12966}
}

For more details, please refer to the full paper.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kosmos_2-0.1.0.tar.gz (24.2 kB view details)

Uploaded Source

Built Distribution

kosmos_2-0.1.0-py3-none-any.whl (23.2 kB view details)

Uploaded Python 3

File details

Details for the file kosmos_2-0.1.0.tar.gz.

File metadata

  • Download URL: kosmos_2-0.1.0.tar.gz
  • Upload date:
  • Size: 24.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.2 CPython/3.11.0 Darwin/22.4.0

File hashes

Hashes for kosmos_2-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9e920d7f8365f5e42f87831058542ce9921e3c4dc6c60400bb07e370e6cca7cc
MD5 959cc7a88ebbe936831392c1d265e162
BLAKE2b-256 559bb741084a44d63278ddb68a9b2dcad2668feea8d21439fd479e56b428fe78

See more details on using hashes here.

File details

Details for the file kosmos_2-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: kosmos_2-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 23.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.2 CPython/3.11.0 Darwin/22.4.0

File hashes

Hashes for kosmos_2-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5f3e5777745fc909b381405f9a3b69aa816a0d4eb3c52023e000562c7d66d9a0
MD5 ac364beff954b9e28c26db9db1e31f17
BLAKE2b-256 f8dd5c959ec87491cb2f9a17e8de871b1b29b5f4d8d5aa3b80c9a575846a9716

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page