Preprocessings to prepare datasets for a task
Project description
tasksource, 400+ dataset preprocessings annotations for extreme multitask learning
Huggingface Datasets is a great library, but it lacks standardization, and datasets require different preprocessings to be used interchangeably.
tasksource
is a collection of task preprocessings, facilitating multi-task learning and reproducibility.
import tasksource
from datasets import load_dataset
tasksource.bigbench(load_dataset('bigbench', 'movie_recommendation'))
Each dataset is mapped to either MultipleChoice
, Classification
, or TokenClassification
task with standardized fields.
We do not support generation tasks as they are addressed by promptsource.
All implemented preprocessings can be found in tasks.py. Each preprocessing is a function that takes a dataset as input and returns a standardized dataset. The preprocessing code is designed to be human-readable: adding a new preprocessing only takes a few lines, e.g:
cos_e = tasksource.MultipleChoice(
'question',
choices_list='choices',
labels= lambda x: x['choices_list'].index(x['answer']),
config_name='v1.0')
Installation and usage:
pip install tasksource
List tasks:
from tasksource import list_tasks, load_task
df = list_tasks()
Iterate over tasks:
for _, x in df[df.task_type=="MultipleChoice"].iterrows():
dataset = load_task(x.dataset_name,x.config_name, x.task_name)
See supported 420 tasks in tasks.md (+200 MultipleChoice tasks, +200 Classification tasks). Feel free to request or propose a new task.
contact
I can help you integrate tasksource in your experiments. damien.sileo@inria.fr
@misc{sileod23-tasksource,
author = {Sileo, Damien},
doi = {10.5281/zenodo.7473446},
month = {01},
title = {{tasksource: preprocessings for reproducibility and multitask-learning}},
url = {https://github.com/sileod/tasksource},
version = {1.5.0},
year = {2023}}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for tasksource-0.0.6-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 82a628835bae2bbb0e5f26bfedbf8b2b4f1218173029cd1f8b33cba3d1596491 |
|
MD5 | 843f74d3f068795a2ae08387f320e99b |
|
BLAKE2b-256 | 354a482ee8c57bc3bc84a7666ba25ed2592410af3793da969e1c00371bae77f1 |