Skip to main content

Text classification datasets

Project description

Logo

made-with-python PyPI version PyPI - License Madein

The framework is designed with BERT in mind and currently support seven commonsense reasoning datsets(alphanli, hellaswag, physicaliqa, socialiqa, codah, cosmosqa, and commonsenseqa). It can be also applied to other datasets with few line of codes.

Architecture

Architecture Image

Dependency

conda install av -c conda-forge
pip install -r requirements.txt
pip install --editable .

# or

pip install textbook

Download raw datasets

./fetch.sh

It downloads alphanli, hellaswag, physicaliqa, socialiqa, codah, cosmosqa, and commonsenseqa from AWS in data_cache. In case you want to use something-something, pelase download the dataset from 20bn's website.

Usage

Template

The goal of a template is to transform raw text into a intermediate datum where abstractive information are provided for later use.

Ideally, the template should do the following things:

  • construct text: a list of list. The outer list is ideal for multichoice situations and inner list if for each input pairs/triplets(e.g context, question, and choice);
  • construct label: an integer representing a zero-indexed label for the truth, or None;
  • construct token_type_id and attention: abstractive representation of the segment id and attention. In the following example of anli, both token_type_id and attention have three digits, each for the three components of each row of the text.
  • construct image: any forms of image id/path you want to read later.

One example of anli is as follows:

# raw
case = {"story_id": "58090d3f-8a91-4c89-83ef-2b4994de9d241", "obs1": "Ron started his new job as a landscaper today.",
        "obs2": "Ron is immediately fired for insubordination.", "hyp1": "Ron ignores his bosses's orders and called him an idiot.",
        "hyp2": "Ron's boss called him an idiot.", "label": "1"}

# target intermediate datum
target = {
    'text':
    [['Ron started his new job as a landscaper today.', "Ron ignores his bosses's orders and called him an idiot.",
        'Ron is immediately fired for insubordination.'],
        ['Ron started his new job as a landscaper today.', "Ron's boss called him an idiot.",
        'Ron is immediately fired for insubordination.']],
    'label': 0, 'image': None, 'token_type_id': [0, 1, 0],
    'attention': [1, 1, 1]}

LABEL2INT = {
    "anli": {
        "1": 0,
        "2": 1,
    },
}
assert template_anli(case, LABEL2INT['anli']) == target

Renderer

Renderer transformer your intermediate datum into a fully blown datum. Each renderer takes care of different part of the datum. For example, renderer_text renders the text into input_id and generate all token-based attention and token_type_id, while renderer_video renders the image path to an image tensor. renderers are passed to the dataset constructer in a list, therefore are execute sequentially.

BatchTool

We provided a BatchTool where MLM or padding can be used easily, you can check the doc for the class for more information.

Load a dataset with pandas

from transformers import BertTokenizer
from textbook import MultiModalDataset, template_anli, renderer_text, BatchTool, TokenBasedSampler
from torch.utils.data import Dataset, DataLoader
from textbook import LABEL2INT
import pandas as pd

tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

d1 = MultiModalDataset(
    df=pd.read_json("data_cache/alphanli/train.jsonl", lines=True),
    template=lambda x: template_anli(x, LABEL2INT['anli']),
    renderers=[lambda x: renderer_text(x, tokenizer)],
)
bt1 = BatchTool(tokenizer, source="anli")
i1 = DataLoader(d1, batch_sampler=TokenBasedSampler(d1, batch_size=64), collate_fn=bt1.collate_fn)

Create a multitask dataset with multiple datasets

from transformers import BertTokenizer
from textbook import *
import pandas as pd

tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

# add additional tokens for each task as special `cls_token`
tokenizer.add_special_tokens({"additional_special_tokens": [
        "[ANLI]", "[HELLASWAG]"
]})

d1 = MultiModalDataset(
    df=pd.read_json("data_cache/alphanli/train.jsonl", lines=True),
    template=lambda x: template_anli(x, LABEL2INT['anli']),
    renderers=[lambda x: renderer_text(x, tokenizer, "[ANLI]")],
)
bt1 = BatchTool(tokenizer, source="anli")
i1 = DataLoader(d1, batch_sampler=TokenBasedSampler(d1, batch_size=64), collate_fn=bt1.collate_fn)

d2 = MultiModalDataset(
        df=pd.read_json("data_cache/hellaswag/train.jsonl", lines=True),
        template=lambda x: template_hellaswag(x, LABEL2INT['hellaswag']),
        renderers=[lambda x: renderer_text(x, tokenizer, "[HELLASWAG]")],
    )
bt2 = BatchTool(tokenizer, source="hellaswag")
i2 = DataLoader(d2, batch_sampler=TokenBasedSampler(d1, batch_size=64), collate_fn=bt2.collate_fn)

d = MultiTaskDataset([i1, i2], shuffle=False)

#! batch size must be 1 for multitaskdataset, because we already batched in each sub dataset.
for batch in DataLoader(d, batch_size=1, collate_fn=BatchTool.uncollate_fn):

    pass

    # {
    #     "source": "anli" or "hellaswag",
    #     "labels": ...,
    #     "input_ids": ...,
    #     "attentions": ...,
    #     "token_type_ids": ...,
    #     "images": ...,
    # }

Impletement a New template or rennderer

It is advised to follow the following conventions but you can do whatever you like since you can call lambda anywhere.

def template_xxx(raw_datum, *args, **kwargs):
    pass

def renderer_xxx(intermediate_datum, *args, **kwargs):
    pass

Contact

Author: Chenghao Mou Email: mouchenghao@gmail.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textbook-0.3.6.tar.gz (12.7 kB view details)

Uploaded Source

Built Distribution

textbook-0.3.6-py3-none-any.whl (11.8 kB view details)

Uploaded Python 3

File details

Details for the file textbook-0.3.6.tar.gz.

File metadata

  • Download URL: textbook-0.3.6.tar.gz
  • Upload date:
  • Size: 12.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.7.5

File hashes

Hashes for textbook-0.3.6.tar.gz
Algorithm Hash digest
SHA256 65ca014d6f48cc3fe59c75d5ec17d39baa8de834e0fde6d512b67bc0248331a0
MD5 0595a980ae2685d95dc720b1a7f3bf8c
BLAKE2b-256 72a4fd482a8f6709c5fb8c5773bf8d2f9b1b629fe342a939c31c5900752745d7

See more details on using hashes here.

File details

Details for the file textbook-0.3.6-py3-none-any.whl.

File metadata

  • Download URL: textbook-0.3.6-py3-none-any.whl
  • Upload date:
  • Size: 11.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.7.5

File hashes

Hashes for textbook-0.3.6-py3-none-any.whl
Algorithm Hash digest
SHA256 0b47d0e3bf9b5c6cd00b10d0c53d68d995d45e65821a052554414dda36f269d5
MD5 a53e658408d8019097b34d66153cf6f1
BLAKE2b-256 1ae297580a8bb055b699856807d427e9b38a49667cd76e1daa46658ada5edd20

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page