Skip to main content

A collection of multimodal datasets multimodal for research.

Project description

multimodal

PyPI Documentation Status Downloads Join the chat at https://gitter.im/multimodal-learning/multimodal

A collection of multimodal (vision and language) datasets and visual features for deep learning research. See the Documentation.

Pretrained models

  • ALBEF
from multimodal.models import ALBEF
albef = ALBEF.from_pretrained()

Visual Features

Currently it supports the following visual features (downloaded automatically):

Datasets

It also supports the following datasets, with their evaluation metric (VQA evaluation metric)

Note that when instanciating those datasets, large data might be downloaded. You can always specify the dir_data argument when instanciating, or you can set the environment variable MULTIMODAL_DATA_DIR so that all data always goes to the specified directory.

Models

  • Bottom-Up and Top-Down attention (UpDown)
  • ALBEF (pretrained model)

WordEmbeddings

And also word embeddings (either from scratch, or pretrained from torchtext, that can be fine-tuned).

Simple Usage

To install the library, run pip install multimodal. It is supported for python 3.6 and 3.7.

Visual Features

Available features are COCOBottomUpFeatures

>>> from multimodal.features import COCOBottomUpFeatures
>>> bottomup = COCOBottomUpFeatures(features="trainval_36", dir_data="/tmp")
>>> image_id = 13455
>>> feats = bottomup[image_id]
>>> print(feats.keys())
['image_w', 'image_h', 'num_boxes', 'boxes', 'features']
>>> print(feats["features"].shape)  # numpy array
(36, 2048)

Datasets

VQA

Available VQA datasets are VQA, VQA v2, VQA-CP, VQA-CP v2, and their associated pytorch-lightinng data modules.

You can run a simple evaluation of predictions using the following commands. Data will be downloaded and processed if necessary. Predictions must have the same format as the official VQA result format (see https://visualqa.org/evaluation.html).

# vqa 1.0
python -m multimodal vqa-eval -p <path/to/predictions> -s "val"
# vqa 2.0
python -m multimodal vqa2-eval -p <path/to/predictions> -s "val"
# vqa-cp 1.0
python -m multimodal vqacp-eval -p <path/to/predictions> -s "val"
# vqa-cp 2.0
python -m multimodal vqacp2-eval -p <path/to/predictions> -s "val"

To use the datasets for your training runs, use the following:

# Visual Question Answering
from multimodal.datasets import VQA, VQA2, VQACP, VQACP2

dataset = VQA(split="train", features="coco-bottomup", dir_data="/tmp")
item = dataset[0]

dataloader = torch.utils.data.Dataloader(dataset, collate_fn = VQA.collate_fn)

for batch in dataloader:
    out = model(batch)
    # training code...

We also provide a pytorch_lightning datamodule, available here: multimodal.datasets.lightning.VQADataModule and similarly for other VQA datasets. See documentation.

CLEVR

from multimodal.datasets import CLEVR

# Warning, this will download a 18Gb file. 
# You can specify the multimodal data directory 
#   by providing the dir_data argument
clevr = CLEVR(split="train") 

Pretrained Tokenizer and Word embeddings

Word embeddings are implemented as pytorch modules. Thus, they are trainable if needed, but can be freezed.

Pretrained embedding weights are downloaded with torchtext. The following pretrained embeddings are available: charngram.100d, fasttext.en.300d, fasttext.simple.300d, glove.42B.300d, glove.6B.100d, glove.6B.200d, glove.6B.300d, glove.6B.50d, glove.840B.300d, glove.twitter.27B.100d, glove.twitter.27B.200d, glove.twitter.27B.25d, glove.twitter.27B.50d

Usage

from multimodal.text import PretrainedWordEmbedding
from multimodal.text import BasicTokenizer

# tokenizer converts words to tokens, and to token_ids. Pretrained tokenizers 
# save token_ids from an existing vocabulary.
tokenizer = BasicTokenizer.from_pretrained("pretrained-vqa")

# Pretrained word embedding, freezed. A list of tokens as input to initialize embeddings.
wemb = PretrainedWordEmbedding.from_pretrained("glove.840B.300d", tokens=tokenizer.tokens, freeze=True)

embeddings = wemb(tokenizer(["Inputs are batched, and padded. This is the first batch item", "This is the second batch item."]))

Models

The Bottom-Up and Top-Down Attention for VQA model is implemented. To train, run python multimodal/models/updown.py --dir-data <path_to_multimodal_data> --dir-exp logs/vqa2/updown

It uses pytorch lightning, with the class multimodal.models.updown.VQALightningModule

You can check the code to see other parameters.

You can train the model manually:

from multimodal.models import UpDownModel
from multimodal.datasets.import VQA2
from multimodal.text import BasicTokenizer
vqa_tokenizer = BasicTokenizer.from_pretrained("pretrained-vqa2")

train_dataset = VQA(split="train", features="coco-bottomup", dir_data="/tmp")
train_loader = torch.utils.data.Dataloader(train_dataset, collate_fn = VQA.collate_fn)

updown = UpDownModel(num_ans=len(train_dataset.answers))

for batch in train_loader:
    batch["question_tokens"] = vqa_tokenizer(batch["question"])
    out = updown(batch)
    logits = out["logits"]
    loss = F.binary_cross_entropy_with_logits(logits, batch["label"])
    loss.backward()
    optimizer.step()

Or train it with Pytorch Lightning:

from multimodal.datasets.lightning import VQA2DataModule
from multimodal.models.lightning import VQALightningModule
from multimodal.text import BasicTokenizer
import pytorch_lightning as pl

tokenizer = BasicTokenizer.from_pretrained("pretrained-vqa2")

vqa2 = VQA2DataModule(
    features="coco-bottomup-36",
    batch_size=512,
    num_workers=4,
)

vqa2.prepare_data()
num_ans = len(vqa2.num_ans)

updown = UpDownModel(
    num_ans=num_ans,
    tokens=tokenizer.tokens,  # to init word embeddings
)

lightningmodel = VQALightningModule(
    updown,
    train_dataset=vqa2.train_dataset,
    val_dataset=vqa2.val_dataset,
    tokenizer=tokenizer,
)

trainer = pl.Trainer(
    gpus=1,
    max_epochs=30,
    gradient_clip_val=0.25,
    default_root_dir="logs/updown",
)

trainer.fit(lightningmodel, datamodule=vqa2)

API

Features

features = COCOBottomUpFeatures(
    features="test2014_36",   # one of [trainval2014, trainval2014_36, test2014, test2014_36, test2015, test2015_36]
    dir_data=None             # directory for multimodal data. By default, in the application directory for multimodal.
)

Then, to get the features for a specific image:

feats = features[image_id]

The features have the following keys :

{
    "image_id": int,
    "image_w": int,
    "image_h" : int,
    "num_boxes": int
    "boxes": np.array(N, 4),
    "features": np.array(N, 2048),
}

Datasets

# Datasets
dataset = VQA(
    dir_data=None,       # dir where multimodal data will be downloaded. Default is HOME/.multimodal
    features=None,       # which visual features should be used. Choices: coco-bottomup or coco-bottomup-36
    split="train",       # "train", "val" or "test"
    min_ans_occ=8,       # Minimum occurences to keep an answer.
    dir_features=None,   # Specific directory for features. By default, they will be located in dir_data/features.
    label="multilabel",  # "multilabel", or "best". This changes the shape of the ground truth label (class number for best, or tensor of scores for multilabel)
)
item = dataset[0]

The item will contain the following keys :

>>> print(item.keys())
{'image_id',
'question_id',
'question_type',
'question',                 # full question (not tokenized, tokenization is done in the WordEmbedding class)
'answer_type',              # yes/no, number or other
'multiple_choice_answer',
'answers',
'image_id',
'label',                    # either class label (if label="best") or target class scores (tensor of N classes).
'scores',                   # VQA scores for every answer
}

Word embeddings

# Word embedding from scratch, and trainable.
wemb = Wordembedding(
    tokens,   # Token list. We recommend using torchtext basic_english tokenizer.
    dim=50,   # Dimension for word embeddings.
    freeze=False   # freeze=True means that word embeddings will be set with `requires_grad=False`. 
)



wemb = WordEmbedding.from_pretrained(
    name="glove.840B.300d", # embedding name (from torchtext)
    tokens,                 # tokens to load from the word embedding.
    max_tokens=None,        # if set to N, only the N most common tokens will be loaded.
    freeze=True,            # same parameter as default model. 
    dir_data=None,          # dir where data will be downloaded. Default is multimodal directory in apps dir.
)

# Forward pass
sentences = ["How many people are in the picture?", "What color is the car?"]
wemb(
    sentences, 
    tokenized=False  # set tokenized to True if sentence is already tokenized.
)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

multimodal-0.0.13.tar.gz (60.5 kB view details)

Uploaded Source

Built Distribution

multimodal-0.0.13-py3-none-any.whl (70.3 kB view details)

Uploaded Python 3

File details

Details for the file multimodal-0.0.13.tar.gz.

File metadata

  • Download URL: multimodal-0.0.13.tar.gz
  • Upload date:
  • Size: 60.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.21.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.7.1

File hashes

Hashes for multimodal-0.0.13.tar.gz
Algorithm Hash digest
SHA256 3bdc09a963a31dcbf24490e1d14a9fa9b456bfd39d5e9c4e088c546ba3f52585
MD5 1aecbf1fb7b22b96b7e1a6cc875ba197
BLAKE2b-256 2b11b134e064bc3d7b42a06c006ee10c094e25d2a6de7e31be501d186e44dfab

See more details on using hashes here.

File details

Details for the file multimodal-0.0.13-py3-none-any.whl.

File metadata

  • Download URL: multimodal-0.0.13-py3-none-any.whl
  • Upload date:
  • Size: 70.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.21.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.7.1

File hashes

Hashes for multimodal-0.0.13-py3-none-any.whl
Algorithm Hash digest
SHA256 edec889db50db517027dcdccb2109fe08a40a35190a4b211c9888c7d5a355e5c
MD5 c9d7ab9151e36b58699d539db0de335c
BLAKE2b-256 b6f7c5352839c9f30ee422bd80007503d16e7d4acec36e33d21bfee3ae794ba6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page