Skip to main content

Text classification datasets

Project description

Textbook: Universal NLP Datasets

Dependency

  • av==6.2.0
  • jsonnet==0.14.0
  • opencv_python==4.1.1.26
  • torch==1.3.1
  • torchvision==0.4.2
  • numpy==1.17.4
  • transformers=2.1.1

Download raw datasets

bash fetch.sh

It downloads alphanli, hellaswag, physicaliqa and socialiqa from AWS. In case you want to use something-something, pelase download the dataset from 20bn's website

Usage

from transformers import BertTokenizer
from textbook import *

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text_renderer = TextRenderer(tokenizer)

anli_tool = BatchTool(tokenizer, max_seq_len=128, source="anli")
anli_dataset = TextDataset(path='data_cache/alphanli/eval.jsonl',
                            config=ANLIConfiguration(), renderers=[text_renderer])
anli_iter = DataLoader(anli_dataset, batch_size=2, collate_fn=anli_tool.collate_fn)

hellaswag_tool = BatchTool(tokenizer, max_seq_len=128, source="hellaswag")
hellaswag_dataset = TextDataset(path='data_cache/hellaswag/eval.jsonl',
                                config=HellaswagConfiguration(), renderers=[text_renderer])
hellaswag_iter = DataLoader(hellaswag_dataset, batch_size=2, collate_fn=hellaswag_tool.collate_fn)

siqa_tool = BatchTool(tokenizer, max_seq_len=128, source="siqa")
siqa_dataset = TextDataset(path='data_cache/socialiqa/eval.jsonl',
                            config=SIQAConfiguration(), renderers=[text_renderer])
siqa_iter = DataLoader(siqa_dataset, batch_size=2, collate_fn=siqa_tool.collate_fn)

piqa_tool = BatchTool(tokenizer, max_seq_len=128, source="piqa")
piqa_dataset = TextDataset(path='data_cache/physicaliqa/eval.jsonl',
                            config=PIQAConfiguration(), renderers=[text_renderer])
piqa_iter = DataLoader(piqa_dataset, batch_size=2, collate_fn=piqa_tool.collate_fn)

# video_renderer = VideoRenderer(data_dir="data_cache/smthsmth/20bn-something-something-v2")
# smth_tool = BatchTool(tokenizer, max_seq_len=128, source="smth", mlm=True)
# smth_config = SMTHSMTHConfiguration()
# smth_dataset = VideoDataset("data_cache/smthsmth/something-something-v2-validation.json",
#                             smth_config, [text_renderer, video_renderer])
# smth_iter = DataLoader(smth_dataset, batch_size=2, collate_fn=smth_tool.uncollate_fn)

dataset = MultiTaskDataset([anli_iter, hellaswag_iter, siqa_iter, piqa_iter], shuffle=True)

for batch in tqdm(DataLoader(dataset, batch_size=1, collate_fn=BatchTool.uncollate_fn)):

    pass

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textbook-0.1.0.tar.gz (8.7 kB view details)

Uploaded Source

File details

Details for the file textbook-0.1.0.tar.gz.

File metadata

  • Download URL: textbook-0.1.0.tar.gz
  • Upload date:
  • Size: 8.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.7.3

File hashes

Hashes for textbook-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2cd6b8c733eacd9efd9bba60b86fb695946af02377035fdabad726b5e613ad93
MD5 b9ed16de780986fb6813c58aec2f0c0e
BLAKE2b-256 6f7d72c10b4564aeb195824f0a16408dda6b6104d924f50ebda7cdc77b827de8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page