Skip to main content

Text classification datasets

Project description

Textbook: Universal NLP Datasets

Dependency

  • av==6.2.0
  • jsonnet==0.14.0
  • opencv_python==4.1.1.26
  • torch==1.3.1
  • torchvision==0.4.2
  • numpy==1.17.4
  • transformers=2.1.1

Download raw datasets

bash fetch.sh

It downloads alphanli, hellaswag, physicaliqa and socialiqa from AWS. In case you want to use something-something, pelase download the dataset from 20bn's website

Usage

from transformers import BertTokenizer
from textbook import *

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text_renderer = TextRenderer(tokenizer)

anli_tool = BatchTool(tokenizer, max_seq_len=128, source="anli")
anli_dataset = TextDataset(path='data_cache/alphanli/eval.jsonl',
                            config=ANLIConfiguration(), renderers=[text_renderer])
anli_iter = DataLoader(anli_dataset, batch_size=2, collate_fn=anli_tool.collate_fn)

hellaswag_tool = BatchTool(tokenizer, max_seq_len=128, source="hellaswag")
hellaswag_dataset = TextDataset(path='data_cache/hellaswag/eval.jsonl',
                                config=HellaswagConfiguration(), renderers=[text_renderer])
hellaswag_iter = DataLoader(hellaswag_dataset, batch_size=2, collate_fn=hellaswag_tool.collate_fn)

siqa_tool = BatchTool(tokenizer, max_seq_len=128, source="siqa")
siqa_dataset = TextDataset(path='data_cache/socialiqa/eval.jsonl',
                            config=SIQAConfiguration(), renderers=[text_renderer])
siqa_iter = DataLoader(siqa_dataset, batch_size=2, collate_fn=siqa_tool.collate_fn)

piqa_tool = BatchTool(tokenizer, max_seq_len=128, source="piqa")
piqa_dataset = TextDataset(path='data_cache/physicaliqa/eval.jsonl',
                            config=PIQAConfiguration(), renderers=[text_renderer])
piqa_iter = DataLoader(piqa_dataset, batch_size=2, collate_fn=piqa_tool.collate_fn)

# video_renderer = VideoRenderer(data_dir="data_cache/smthsmth/20bn-something-something-v2")
# smth_tool = BatchTool(tokenizer, max_seq_len=128, source="smth", mlm=True)
# smth_config = SMTHSMTHConfiguration()
# smth_dataset = VideoDataset("data_cache/smthsmth/something-something-v2-validation.json",
#                             smth_config, [text_renderer, video_renderer])
# smth_iter = DataLoader(smth_dataset, batch_size=2, collate_fn=smth_tool.uncollate_fn)

dataset = MultiTaskDataset([anli_iter, hellaswag_iter, siqa_iter, piqa_iter], shuffle=True)

for batch in tqdm(DataLoader(dataset, batch_size=1, collate_fn=BatchTool.uncollate_fn)):

    pass

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textbook-0.1.1.tar.gz (8.7 kB view details)

Uploaded Source

File details

Details for the file textbook-0.1.1.tar.gz.

File metadata

  • Download URL: textbook-0.1.1.tar.gz
  • Upload date:
  • Size: 8.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.7.3

File hashes

Hashes for textbook-0.1.1.tar.gz
Algorithm Hash digest
SHA256 243a2a1d4cd5da905e5da802858cb7417e150d294474823f561f8559abf6e7cc
MD5 03a8903527fee469a37647b44243e735
BLAKE2b-256 4ad8db4b10a563d35cfdcd93cd203dd5572a132ff4e40e450b9469cad492fefe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page