Skip to main content

Text classification datasets

Project description

Textbook: Universal NLP Datasets

Dependency

  • av==6.2.0
  • jsonnet==0.14.0
  • opencv_python==4.1.1.26
  • torch==1.3.1
  • torchvision==0.4.2
  • numpy==1.17.4

Download raw datasets

bash fetch.sh

It downloads alphanli, hellaswag, physicaliqa and socialiqa from AWS.

Usage

Load multichoice dataset

from textbook.config import *
from textbook.transforms_video import *

from transformers import GPT2Tokenizer

text_renderer = TextRenderer(
    tokenizer=GPT2Tokenizer.from_pretrained('distilgpt2'),
    special_tokens={'cls_token': '[CLS]', 'pad_token': '[PAD]', 'mask_token': '[MASK]'},
)
config_alphanli = Configuration(alphanli_config)
alphanli_dataset = ClassificationDataset("data_cache/alphanli/eval.jsonl", config_alphanli, [text_renderer])

alphanli_dev_dataloader: DataLoader = iter(
    DataLoader(
        alphanli_dataset, batch_sampler=DynamicBatchSampler(alphanli_dataset),
        collate_fn=collate_fn))

Load multimodal dataset

upscale_size = int(84 * 1.1)
transform_pre = ComposeMix([
    [Scale(upscale_size), "img"],
    [RandomCropVideo(84), "vid"],
])

transform_post = ComposeMix([
    [torchvision.transforms.ToTensor(), "img"],
])

video_renderer = VisionRenderer(
    nframe=72,
    nclip=1,
    nstep=2,
    transform_pre=transform_pre,
    transform_post=transform_post,
    data_dir="data_cache/smthsmth/20bn-something-something-v2"
)

config_smthsmth = Configuration(smthsmth_config)

smthsmth_dataset = ClassificationVisionDataset(
    "data_cache/smthsmth/something-something-v2-validation.json", config_smthsmth, [text_renderer, video_renderer])

smthsmth_dev_dataloader: DataLoader = iter(DataLoader(
    smthsmth_dataset, batch_size=16, collate_fn=lambda x: collate_fn(
        x, mlm=True, mlm_probability=0.15, tokenizer=text_renderer.tokenizer)))

Let's multitask

multitask_dataloader = MultiTaskDataset([alphanli_dev_dataloader, smthsmth_dev_dataloader])

# alternate through different dataloaders
for batch in multitask_dataloader:
    print(batch["input_ids"].shape)
    print(batch["images"].shape)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textbook-0.0.9.tar.gz (8.9 kB view details)

Uploaded Source

File details

Details for the file textbook-0.0.9.tar.gz.

File metadata

  • Download URL: textbook-0.0.9.tar.gz
  • Upload date:
  • Size: 8.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.7.3

File hashes

Hashes for textbook-0.0.9.tar.gz
Algorithm Hash digest
SHA256 1e9841302186abfa78bbda52850e7a5d145ee8d89b1c979625954bbadfb77107
MD5 a1771e72669355b0f6abd8cec3d663bd
BLAKE2b-256 6a69ff3602799bb2dc0b6b06b8b37fedc507dc70b9fcae113bbdc6980096575a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page