Text classification datasets
Project description
Textbook: Universal NLP Datasets
Dependency
av==6.2.0
jsonnet==0.14.0
opencv_python==4.1.1.26
torch==1.3.1
torchvision==0.4.2
numpy==1.17.4
Download raw datasets
bash fetch.sh
It downloads alphanli
, hellaswag
, physicaliqa
and socialiqa
from AWS.
Usage
Load multichoice dataset
from textbook.config import *
from textbook.transforms_video import *
from transformers import GPT2Tokenizer
text_renderer = TextRenderer(
tokenizer=GPT2Tokenizer.from_pretrained('distilgpt2'),
special_tokens={'cls_token': '[CLS]', 'pad_token': '[PAD]', 'mask_token': '[MASK]'},
)
config_alphanli = Configuration(alphanli_config)
alphanli_dataset = ClassificationDataset("data_cache/alphanli/eval.jsonl", config_alphanli, [text_renderer])
alphanli_dev_dataloader: DataLoader = iter(
DataLoader(
alphanli_dataset, batch_sampler=DynamicBatchSampler(alphanli_dataset),
collate_fn=collate_fn))
Load multimodal dataset
upscale_size = int(84 * 1.1)
transform_pre = ComposeMix([
[Scale(upscale_size), "img"],
[RandomCropVideo(84), "vid"],
])
transform_post = ComposeMix([
[torchvision.transforms.ToTensor(), "img"],
])
video_renderer = VisionRenderer(
nframe=72,
nclip=1,
nstep=2,
transform_pre=transform_pre,
transform_post=transform_post,
data_dir="data_cache/smthsmth/20bn-something-something-v2"
)
config_smthsmth = Configuration(smthsmth_config)
smthsmth_dataset = ClassificationVisionDataset(
"data_cache/smthsmth/something-something-v2-validation.json", config_smthsmth, [text_renderer, video_renderer])
smthsmth_dev_dataloader: DataLoader = iter(DataLoader(
smthsmth_dataset, batch_size=16, collate_fn=lambda x: collate_fn(
x, mlm=True, mlm_probability=0.15, tokenizer=text_renderer.tokenizer)))
Let's multitask
multitask_dataloader = MultiTaskDataset([alphanli_dev_dataloader, smthsmth_dev_dataloader])
# alternate through different dataloaders
for batch in multitask_dataloader:
print(batch["input_ids"].shape)
print(batch["images"].shape)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
textbook-0.0.9.tar.gz
(8.9 kB
view details)
File details
Details for the file textbook-0.0.9.tar.gz
.
File metadata
- Download URL: textbook-0.0.9.tar.gz
- Upload date:
- Size: 8.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
1e9841302186abfa78bbda52850e7a5d145ee8d89b1c979625954bbadfb77107
|
|
MD5 |
a1771e72669355b0f6abd8cec3d663bd
|
|
BLAKE2b-256 |
6a69ff3602799bb2dc0b6b06b8b37fedc507dc70b9fcae113bbdc6980096575a
|