Skip to main content

Text classification datasets

Project description

Textbook: Universal NLP Datasets

Dependency

  • av==6.2.0
  • jsonnet==0.14.0
  • opencv_python==4.1.1.26
  • torch==1.3.1
  • torchvision==0.4.2
  • numpy==1.17.4

Download raw datasets

bash fetch.sh

It downloads alphanli, hellaswag, physicaliqa and socialiqa from AWS.

Usage

Load multichoice dataset

from textbook.config import *
from textbook.transforms_video import *

from transformers import GPT2Tokenizer

text_renderer = TextRenderer(
    tokenizer=GPT2Tokenizer.from_pretrained('distilgpt2'),
    special_tokens={'cls_token': '[CLS]', 'pad_token': '[PAD]', 'mask_token': '[MASK]'},
)
config_alphanli = Configuration(alphanli_config)
alphanli_dataset = ClassificationDataset("data_cache/alphanli/eval.jsonl", config_alphanli, [text_renderer])

alphanli_dev_dataloader: DataLoader = iter(
    DataLoader(
        alphanli_dataset, batch_sampler=DynamicBatchSampler(alphanli_dataset),
        collate_fn=collate_fn))

Load multimodal dataset

upscale_size = int(84 * 1.1)
transform_pre = ComposeMix([
    [Scale(upscale_size), "img"],
    [RandomCropVideo(84), "vid"],
])

transform_post = ComposeMix([
    [torchvision.transforms.ToTensor(), "img"],
])

video_renderer = VisionRenderer(
    nframe=72,
    nclip=1,
    nstep=2,
    transform_pre=transform_pre,
    transform_post=transform_post,
    data_dir="data_cache/smthsmth/20bn-something-something-v2"
)

config_smthsmth = Configuration(smthsmth_config)

smthsmth_dataset = ClassificationVisionDataset(
    "data_cache/smthsmth/something-something-v2-validation.json", config_smthsmth, [text_renderer, video_renderer])

smthsmth_dev_dataloader: DataLoader = iter(DataLoader(
    smthsmth_dataset, batch_size=16, collate_fn=lambda x: collate_fn(
        x, mlm=True, mlm_probability=0.15, tokenizer=text_renderer.tokenizer)))

Let's multitask

multitask_dataloader = MultiTaskDataset([alphanli_dev_dataloader, smthsmth_dev_dataloader])

# alternate through different dataloaders
for batch in multitask_dataloader:
    print(batch["input_ids"].shape)
    print(batch["images"].shape)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textbook-0.0.7.tar.gz (8.9 kB view details)

Uploaded Source

File details

Details for the file textbook-0.0.7.tar.gz.

File metadata

  • Download URL: textbook-0.0.7.tar.gz
  • Upload date:
  • Size: 8.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.7.3

File hashes

Hashes for textbook-0.0.7.tar.gz
Algorithm Hash digest
SHA256 87d64471a89c225a36fe9ecfa39204738734c22cefd5f3af7b87008235a5997b
MD5 fb1a4befc324ef2cf1947c1100464235
BLAKE2b-256 6904b28c8d576b76a3947eca33c5eed51370ba109bb32a406c50637ff9755f21

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page