Preprocessings to prepare datasets for a task
Project description
tasksource: 480+ dataset harmonization preprocessings with structured annotations for frictionless extreme multi-task learning and evaluation
Huggingface Datasets is a great library, but it lacks standardization, and datasets require preprocessing work to be used interchangeably.
tasksource
automates this and facilitates multi-task learning scaling and reproducibility.
import tasksource
from datasets import load_dataset
tasksource.bigbench(load_dataset('bigbench', 'movie_recommendation')) # returns standardized MultipleChoice dataset
Each dataset is mapped to either MultipleChoice
, Classification
, or TokenClassification
dataset with standardized fields.
We do not support generation tasks as they are addressed by promptsource.
All implemented preprocessings can be found in tasks.py or tasks.md. Each preprocessing is a function that takes a dataset as input and returns a standardized dataset. The preprocessing annotation is designed to be human-readable: adding a new preprocessing only takes a few lines, e.g:
cos_e = tasksource.MultipleChoice(
'question',
choices_list='choices',
labels= lambda x: x['choices_list'].index(x['answer']),
config_name='v1.0')
Installation and usage:
pip install tasksource
Get the task index and iterate over harmonized tasks:
from tasksource import list_tasks, load_task
df = list_tasks()
for id in df[df.task_type=="MultipleChoice"].id:
dataset = load_task(id)
# all yielded datasets can be used interchangeably
See supported 480+ tasks in tasks.md (+200 MultipleChoice tasks, +200 Classification tasks). Feel free to request or propose a new task.
Pretrained model:
I pretrained models on tasksource and obtained state-of-the-art results: https://huggingface.co/sileod/deberta-v3-base-tasksource-nli
Contact
I can help you integrate tasksource in your experiments. damien.sileo@inria.fr
More details on this article:
@article{sileo2023tasksource,
title={tasksource: Structured Dataset Preprocessing Annotations for Frictionless Extreme Multi-Task Learning and Evaluation},
author={Sileo, Damien},
url= {https://arxiv.org/abs/2301.05948},
journal={arXiv preprint arXiv:2301.05948},
year={2023}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for tasksource-0.0.10-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9c098792c6d1b364e9629ff43ad320ff12171583408f4d55c6607881bc80c367 |
|
MD5 | 939c0e4e7285e1165b60523da788513e |
|
BLAKE2b-256 | 3bf220e6d7d312e3ec0239e470cf6d780cc6133c23d9f2888f95b3f397f24127 |