Skip to main content

Text classification datasets

Project description

Textbook

Universal NLU/NLI Dataset Processing Framework

It is designed with BERT in mind and currently support seven commonsense reasoning datsets(alphanli, hellaswag, physicaliqa, socialiqa, codah, cosmosqa, and commonsenseqa). It can be also applied to other datasets with few line of codes.

Architecture

Architecture Image

Dependency

pip install -r requirements.txt

Download raw datasets

./fetch.sh

It downloads alphanli, hellaswag, physicaliqa, socialiqa, codah, cosmosqa, and commonsenseqa from AWS in data_cache. In case you want to use something-something, pelase download the dataset from 20bn's website.

Usage

Following examples show how to load a dataset or create a multitask dataset from multiple datasets.

Load a dataset in parallel with modin[ray]

from transformers import BertTokenizer
from textbook import *
import modin.pandas as pd

tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

d1 = MultiModalDataset(
    df=pd.read_json("data_cache/alphanli/train.jsonl", lines=True),
    template=lambda x: template_anli(x, LABEL2INT['anli']),
    renderers=[lambda x: renderer_text(x, tokenizer)],
)
bt1 = BatchTool(tokenizer, source="anli")
i1 = DataLoader(d1, batch_sampler=TokenBasedSampler(d1, batch_size=64), collate_fn=bt1.collate_fn)

Load a dataset with naive pandas

from transformers import BertTokenizer
from textbook import *
import pandas as pd

tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

d1 = MultiModalDataset(
    df=pd.read_json("data_cache/alphanli/train.jsonl", lines=True),
    template=lambda x: template_anli(x, LABEL2INT['anli']),
    renderers=[lambda x: renderer_text(x, tokenizer)],
)
bt1 = BatchTool(tokenizer, source="anli")
i1 = DataLoader(d1, batch_sampler=TokenBasedSampler(d1, batch_size=64), collate_fn=bt1.collate_fn)

Create a multitask dataset with multiple datasets

from transformers import BertTokenizer
from textbook import *
import pandas as pd

tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

# add additional tokens for each task as special `cls_token`
tokenizer.add_special_tokens({"additional_special_tokens": [
        "[ANLI]", "[HELLASWAG]"
]})

d1 = MultiModalDataset(
    df=pd.read_json("data_cache/alphanli/train.jsonl", lines=True),
    template=lambda x: template_anli(x, LABEL2INT['anli']),
    renderers=[lambda x: renderer_text(x, tokenizer, "[ANLI]")],
)
bt1 = BatchTool(tokenizer, source="anli")
i1 = DataLoader(d1, batch_sampler=TokenBasedSampler(d1, batch_size=64), collate_fn=bt1.collate_fn)

d2 = MultiModalDataset(
        df=pd.read_json("data_cache/hellaswag/train.jsonl", lines=True),
        template=lambda x: template_hellaswag(x, LABEL2INT['hellaswag']),
        renderers=[lambda x: renderer_text(x, tokenizer, "[HELLASWAG]")],
    )
bt2 = BatchTool(tokenizer, source="hellaswag")
i2 = DataLoader(d2, batch_sampler=TokenBasedSampler(d1, batch_size=64), collate_fn=bt2.collate_fn)

d = MultiTaskDataset([i1, i2], shuffle=False)

#! batch size must be 1 for multitaskdataset, because we already batched in each sub dataset.
for batch in DataLoader(d, batch_size=1, collate_fn=BatchTool.uncollate_fn):

    pass

    # {
    #     "source": "anli" or "hellaswag",
    #     "labels": ...,
    #     "input_ids": ...,
    #     "attentions": ...,
    #     "token_type_ids": ...,
    #     "images": ...,
    # }

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textbook-0.2.5.tar.gz (9.6 kB view details)

Uploaded Source

File details

Details for the file textbook-0.2.5.tar.gz.

File metadata

  • Download URL: textbook-0.2.5.tar.gz
  • Upload date:
  • Size: 9.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.7.3

File hashes

Hashes for textbook-0.2.5.tar.gz
Algorithm Hash digest
SHA256 754dc32818c6743303ee855455c8af3c66a5f965b533edd948cb56e0acdbc9e7
MD5 6fa38f7fb192ecc6411aa2e3645f754c
BLAKE2b-256 40f2ea0639ff7701147f21f267db28f809ef1d583eaabb233fa196a10763b6cf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page