Text classification datasets
Project description
Textbook
Universal NLU/NLI Dataset Processing Framework
It is designed with BERT
in mind and currently support seven commonsense reasoning datsets(alphanli
, hellaswag
, physicaliqa
, socialiqa
, codah
, cosmosqa
, and commonsenseqa
). It can be also applied to other datasets with few line of codes. It adopts pandas
and modin[ray]
's multiprocessing in loading/processing the datasets.
Architecture
Dependency
pip install -r requirements.txt
Download raw datasets
./fetch.sh
It downloads alphanli
, hellaswag
, physicaliqa
, socialiqa
, codah
, cosmosqa
, and commonsenseqa
from AWS in data_cache
.
In case you want to use something-something, pelase download the dataset from 20bn's website.
Usage
Following examples show how to load a dataset or create a multitask dataset from multiple datasets.
Load a dataset in parallel with modin[ray]
from transformers import BertTokenizer
from textbook import *
import modin.pandas as pd
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
d1 = MultiModalDataset(
df=pd.read_json("data_cache/alphanli/train.jsonl", lines=True),
template=lambda x: template_anli(x, LABEL2INT['anli']),
renderers=[lambda x: renderer_text(x[0], tokenizer)],
parallel=True
)
bt1 = BatchTool(tokenizer, source="anli")
i1 = DataLoader(d1, batch_sampler=TokenBasedSampler(d1, batch_size=64), collate_fn=bt1.collate_fn)
Load a dataset with naive pandas
from transformers import BertTokenizer
from textbook import *
import pandas as pd
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
d1 = MultiModalDataset(
df=pd.read_json("data_cache/alphanli/train.jsonl", lines=True),
template=lambda x: template_anli(x, LABEL2INT['anli']),
renderers=[lambda x: renderer_text(x, tokenizer)],
parallel=False
)
bt1 = BatchTool(tokenizer, source="anli")
i1 = DataLoader(d1, batch_sampler=TokenBasedSampler(d1, batch_size=64), collate_fn=bt1.collate_fn)
Create a multitask dataset with multiple datasets
from transformers import BertTokenizer
from textbook import *
import pandas as pd
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
# add additional tokens for each task as special `cls_token`
tokenizer.add_special_tokens({"additional_special_tokens": [
"[ANLI]", "[HELLASWAG]"
]})
d1 = MultiModalDataset(
df=pd.read_json("data_cache/alphanli/train.jsonl", lines=True),
template=lambda x: template_anli(x, LABEL2INT['anli']),
renderers=[lambda x: renderer_text(x, tokenizer, "[ANLI]")],
parallel=False
)
bt1 = BatchTool(tokenizer, source="anli")
i1 = DataLoader(d1, batch_sampler=TokenBasedSampler(d1, batch_size=64), collate_fn=bt1.collate_fn)
d2 = MultiModalDataset(
df=pd.read_json("data_cache/hellaswag/train.jsonl", lines=True),
template=lambda x: template_hellaswag(x, LABEL2INT['hellaswag']),
renderers=[lambda x: renderer_text(x, tokenizer, "[HELLASWAG]")],
parallel=False
)
bt2 = BatchTool(tokenizer, source="hellaswag")
i2 = DataLoader(d2, batch_sampler=TokenBasedSampler(d1, batch_size=64), collate_fn=bt2.collate_fn)
d = MultiTaskDataset([i1, i2], shuffle=False)
#! batch size must be 1 for multitaskdataset, because we already batched in each sub dataset.
for batch in DataLoader(d, batch_size=1, collate_fn=BatchTool.uncollate_fn):
pass
# {
# "source": "anli" or "hellaswag",
# "labels": ...,
# "input_ids": ...,
# "attentions": ...,
# "token_type_ids": ...,
# "images": ...,
# }
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file textbook-0.2.4.tar.gz
.
File metadata
- Download URL: textbook-0.2.4.tar.gz
- Upload date:
- Size: 9.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
2f2b841447d604a59e87079da0ca2346f54ea08cc7c4a9a6320f72cfc994e7a6
|
|
MD5 |
eb107a9b5b9911ae15a7c8de70e17b89
|
|
BLAKE2b-256 |
3171d00b3fed4a81f13e25c3062e423103a79c1131969efe994d7f4c6251c615
|