Skip to main content

Text classification datasets

Project description

Logo

made-with-python PyPI version PyPI - License Madein

The framework is designed with BERT in mind and currently support seven commonsense reasoning datsets(alphanli, hellaswag, physicaliqa, socialiqa, codah, cosmosqa, and commonsenseqa). It can be also applied to other datasets with few line of codes.

Architecture

Architecture Image

Dependency

conda install av -c conda-forge
pip install -r requirements.txt
pip install --editable .

# or

pip install textbook

Download raw datasets

./fetch.sh

It downloads alphanli, hellaswag, physicaliqa, socialiqa, codah, cosmosqa, and commonsenseqa from AWS in data_cache. In case you want to use something-something, pelase download the dataset from 20bn's website.

Usage

Template

The goal of a template is to transform raw text into a intermediate datum where abstractive information are provided for later use.

Ideally, the template should do the following things:

  • construct text: a list of list. The outer list is ideal for multichoice situations and inner list if for each input pairs/triplets(e.g context, question, and choice);
  • construct label: an integer representing a zero-indexed label for the truth, or None;
  • construct token_type_id and attention: abstractive representation of the segment id and attention. In the following example of anli, both token_type_id and attention have three digits, each for the three components of each row of the text.
  • construct image: any forms of image id/path you want to read later.

One example of anli is as follows:

# raw
case = {"story_id": "58090d3f-8a91-4c89-83ef-2b4994de9d241", "obs1": "Ron started his new job as a landscaper today.",
        "obs2": "Ron is immediately fired for insubordination.", "hyp1": "Ron ignores his bosses's orders and called him an idiot.",
        "hyp2": "Ron's boss called him an idiot.", "label": "1"}

# target intermediate datum
target = {
    'text':
    [['Ron started his new job as a landscaper today.', "Ron ignores his bosses's orders and called him an idiot.",
        'Ron is immediately fired for insubordination.'],
        ['Ron started his new job as a landscaper today.', "Ron's boss called him an idiot.",
        'Ron is immediately fired for insubordination.']],
    'label': 0, 'image': None, 'token_type_id': [0, 1, 0],
    'attention': [1, 1, 1]}

LABEL2INT = {
    "anli": {
        "1": 0,
        "2": 1,
    },
}
assert template_anli(case, LABEL2INT['anli']) == target

Renderer

Renderer transformer your intermediate datum into a fully blown datum. Each renderer takes care of different part of the datum. For example, renderer_text renders the text into input_id and generate all token-based attention and token_type_id, while renderer_video renders the image path to an image tensor. renderers are passed to the dataset constructer in a list, therefore are execute sequentially.

BatchTool

We provided a BatchTool where MLM or padding can be used easily, you can check the doc for the class for more information.

Load a dataset with pandas

from transformers import BertTokenizer
from textbook import MultiModalDataset, template_anli, renderer_text, BatchTool, TokenBasedSampler
from torch.utils.data import Dataset, DataLoader
from textbook import LABEL2INT
import pandas as pd

tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

d1 = MultiModalDataset(
    df=pd.read_json("data_cache/alphanli/train.jsonl", lines=True),
    template=lambda x: template_anli(x, LABEL2INT['anli']),
    renderers=[lambda x: renderer_text(x, tokenizer)],
)
bt1 = BatchTool(tokenizer, source="anli")
i1 = DataLoader(d1, batch_sampler=TokenBasedSampler(d1, batch_size=64), collate_fn=bt1.collate_fn)

Create a multitask dataset with multiple datasets

from transformers import BertTokenizer
from textbook import *
import pandas as pd

tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

# add additional tokens for each task as special `cls_token`
tokenizer.add_special_tokens({"additional_special_tokens": [
        "[ANLI]", "[HELLASWAG]"
]})

d1 = MultiModalDataset(
    df=pd.read_json("data_cache/alphanli/train.jsonl", lines=True),
    template=lambda x: template_anli(x, LABEL2INT['anli']),
    renderers=[lambda x: renderer_text(x, tokenizer, "[ANLI]")],
)
bt1 = BatchTool(tokenizer, source="anli")
i1 = DataLoader(d1, batch_sampler=TokenBasedSampler(d1, batch_size=64), collate_fn=bt1.collate_fn)

d2 = MultiModalDataset(
        df=pd.read_json("data_cache/hellaswag/train.jsonl", lines=True),
        template=lambda x: template_hellaswag(x, LABEL2INT['hellaswag']),
        renderers=[lambda x: renderer_text(x, tokenizer, "[HELLASWAG]")],
    )
bt2 = BatchTool(tokenizer, source="hellaswag")
i2 = DataLoader(d2, batch_sampler=TokenBasedSampler(d1, batch_size=64), collate_fn=bt2.collate_fn)

d = MultiTaskDataset([i1, i2], shuffle=False)

#! batch size must be 1 for multitaskdataset, because we already batched in each sub dataset.
for batch in DataLoader(d, batch_size=1, collate_fn=BatchTool.uncollate_fn):

    pass

    # {
    #     "source": "anli" or "hellaswag",
    #     "labels": ...,
    #     "input_ids": ...,
    #     "attentions": ...,
    #     "token_type_ids": ...,
    #     "images": ...,
    # }

Impletement a New template or rennderer

It is advised to follow the following conventions but you can do whatever you like since you can call lambda anywhere.

def template_xxx(raw_datum, *args, **kwargs):
    pass

def renderer_xxx(intermediate_datum, *args, **kwargs):
    pass

e.g. For Quora question pairs dataset:

def template_qqp(raw_datum, label2int={"0": 0, "1": 1},):

    result = {
        "text": [
            [datum['question1'], datum['question2']]
        ],
        "image": None,
        "label": None if 'is_duplicate' not in datum or datum['is_duplicate'] is None else label2int[str(datum['i_duplicate'])],
        "token_type_id": [0, 1],
        "attention": [1, 1],
    }

    return result

Contact

Author: Chenghao Mou

Email: mouchenghao@gmail.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textbook-0.3.10.tar.gz (13.0 kB view details)

Uploaded Source

Built Distribution

textbook-0.3.10-py3-none-any.whl (12.0 kB view details)

Uploaded Python 3

File details

Details for the file textbook-0.3.10.tar.gz.

File metadata

  • Download URL: textbook-0.3.10.tar.gz
  • Upload date:
  • Size: 13.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.7.5

File hashes

Hashes for textbook-0.3.10.tar.gz
Algorithm Hash digest
SHA256 215d51b915d9f5b08902bfa66d41d4d4a786b464c41d44c0cefa691126687ba9
MD5 8e3479293b54ee8402f809d51cc89b36
BLAKE2b-256 1aa0dba78f887b38086aee7d5c664743af44e8d990480c1aed325b42e76ee6f8

See more details on using hashes here.

File details

Details for the file textbook-0.3.10-py3-none-any.whl.

File metadata

  • Download URL: textbook-0.3.10-py3-none-any.whl
  • Upload date:
  • Size: 12.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.7.5

File hashes

Hashes for textbook-0.3.10-py3-none-any.whl
Algorithm Hash digest
SHA256 b4585d7bf79c2b8e48893eae61868bc83887f78ebcd6f99cc74d52e610aa9c7b
MD5 86439be018ea770e44a27c24bf95c993
BLAKE2b-256 5e4936ba588b9e4f34daef25f0602bff46c4a3037ea5b3b24219bf6bbdcbfad0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page