Preprocessings to prepare datasets for a task
Project description
tasksource, 400+ dataset preprocessings annotations for extreme multitask learning
Huggingface Datasets is a great library, but it lacks standardization, and datasets require different preprocessings to be used interchangeably.
tasksource
is a collection of task preprocessings, facilitating multi-task learning and reproducibility.
import tasksource
from datasets import load_dataset
tasksource.bigbench(load_dataset('bigbench', 'movie_recommendation'))
Each dataset is mapped to either MultipleChoice
, Classification
, or TokenClassification
task with standardized fields.
We do not support generation tasks as they are addressed by promptsource.
All implemented preprocessings can be found in tasks.py. Each preprocessing is a function that takes a dataset as input and returns a standardized dataset. The preprocessing code is designed to be human-readable: adding a new preprocessing only takes a few lines, e.g:
cos_e = tasksource.MultipleChoice(
'question',
choices_list='choices',
labels= lambda x: x['choices_list'].index(x['answer']),
config_name='v1.0')
Installation and usage:
pip install tasksource
List tasks:
from tasksource import list_tasks, load_task
df = list_tasks()
Iterate over tasks:
for _, x in df[df.task_type=="MultipleChoice"].iterrows():
dataset = load_task(x.dataset_name,x.config_name, x.task_name)
See supported 420 tasks in tasks.md (+200 MultipleChoice tasks, +200 Classification tasks). Feel free to request or propose a new task.
contact
I can help you integrate tasksource in your experiments. damien.sileo@inria.fr
@misc{sileod23-tasksource,
author = {Sileo, Damien},
doi = {10.5281/zenodo.7473446},
month = {01},
title = {{tasksource: preprocessings for reproducibility and multitask-learning}},
url = {https://github.com/sileod/tasksource},
version = {1.5.0},
year = {2023}}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for tasksource-0.0.5-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cd50b0eda3c2e006b18cf08b76af2de383f4fda82bf5c47fead19020d6c1c5e9 |
|
MD5 | f72ec373d5cc82a91a3c957f070c5e0c |
|
BLAKE2b-256 | 97455c44acb44586b8cab5a2a6e8f81f301ca7df0903e0f716ee8694d8251bc8 |