Skip to main content

Text classification datasets

Project description

Textbook: Universal NLP Datasets

Dependency

  • av==6.2.0
  • jsonnet==0.14.0
  • opencv_python==4.1.1.26
  • torch==1.3.1
  • torchvision==0.4.2
  • numpy==1.17.4

Download raw datasets

bash fetch.sh

It downloads alphanli, hellaswag, physicaliqa and socialiqa from AWS.

Usage

Load multichoice dataset

from textbook.config import *
from textbook.transforms_video import *

from transformers import GPT2Tokenizer

text_renderer = TextRenderer(
    tokenizer=GPT2Tokenizer.from_pretrained('distilgpt2'),
    special_tokens={'cls_token': '[CLS]', 'pad_token': '[PAD]', 'mask_token': '[MASK]'},
)
config_alphanli = Configuration(alphanli_config)
alphanli_dataset = ClassificationDataset("data_cache/alphanli/eval.jsonl", config_alphanli, [text_renderer])

alphanli_dev_dataloader: DataLoader = iter(
    DataLoader(
        alphanli_dataset, batch_sampler=DynamicBatchSampler(alphanli_dataset),
        collate_fn=collate_fn))

Load multimodal dataset

upscale_size = int(84 * 1.1)
transform_pre = ComposeMix([
    [Scale(upscale_size), "img"],
    [RandomCropVideo(84), "vid"],
])

transform_post = ComposeMix([
    [torchvision.transforms.ToTensor(), "img"],
])

video_renderer = VisionRenderer(
    nframe=72,
    nclip=1,
    nstep=2,
    transform_pre=transform_pre,
    transform_post=transform_post,
    data_dir="data_cache/smthsmth/20bn-something-something-v2"
)

config_smthsmth = Configuration(smthsmth_config)

smthsmth_dataset = ClassificationVisionDataset(
    "data_cache/smthsmth/something-something-v2-validation.json", config_smthsmth, [text_renderer, video_renderer])

smthsmth_dev_dataloader: DataLoader = iter(DataLoader(
    smthsmth_dataset, batch_size=16, collate_fn=lambda x: collate_fn(
        x, mlm=True, mlm_probability=0.15, tokenizer=text_renderer.tokenizer)))

Let's multitask

multitask_dataloader = MultiTaskDataset([alphanli_dev_dataloader, smthsmth_dev_dataloader])

# alternate through different dataloaders
for batch in multitask_dataloader:
    print(batch["input_ids"].shape)
    print(batch["images"].shape)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textbook-0.0.5.tar.gz (8.9 kB view details)

Uploaded Source

File details

Details for the file textbook-0.0.5.tar.gz.

File metadata

  • Download URL: textbook-0.0.5.tar.gz
  • Upload date:
  • Size: 8.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.7.3

File hashes

Hashes for textbook-0.0.5.tar.gz
Algorithm Hash digest
SHA256 463475493c991f7abca48c48ccf6fff050aa38d12ab4db0418490c3c8d102ea3
MD5 984462fd6d05b16f06f24e8890892b27
BLAKE2b-256 86a7912d235bed67896d6b3148f06550cee0fca4ddddcb2e07b177950f0d92ad

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page