Skip to main content

Preprocessings to prepare datasets for a task

Project description

tasksource 600+ curated datasets and preprocessings for instant and interchangeable use

Huggingface Datasets is an excellent library, but it lacks standardization, and datasets often require preprocessing work to be used interchangeably. tasksource streamlines interchangeable datasets usage to scale evaluation or multi-task learning.

Each dataset is standardized to a MultipleChoice, Classification, or TokenClassification template with canonical fields. We focus on discriminative tasks (= with negative examples or classes) for our annotations but also provide a SequenceToSequence template. All implemented preprocessings are in tasks.py or tasks.md. A preprocessing is a function that accepts a dataset and returns the standardized dataset. Preprocessing code is concise and human-readable.

Installation and usage:

pip install tasksource

from tasksource import list_tasks, load_task
df = list_tasks(multilingual=False) # takes some time

for id in df[df.task_type=="MultipleChoice"].id:
    dataset = load_task(id) # all yielded datasets can be used interchangeably

Browse the 500+ curated tasks in tasks.md (200+ MultipleChoice tasks, 200+ Classification tasks), and feel free to request a new task. Datasets are downloaded to $HF_DATASETS_CACHE (like any Hugging Face dataset), so ensure you have more than 100GB of space available.

You can now also use:

load_dataset("tasksource/data", "glue/rte",max_rows=30_000)

Pretrained models:

Text encoder pretrained on tasksource reached state-of-the-art results: 🤗/deberta-v3-base-tasksource-nli

Tasksource pretraining is notably helpful for RLHF reward modeling or any kind of classification, including zero-shot. You can also find a large and a multilingual version.

tasksource-instruct

The repo also contains some recasting code to convert tasksource datasets to instructions, providing one of the richest instruction-tuning datasets: 🤗/tasksource-instruct-v0

tasksource-label-nli

We also recast all classification tasks as natural language inference, to improve entailment-based zero-shot classification detection: 🤗/zero-shot-label-nli

Write and use custom preprocessings

from tasksource import MultipleChoice

codah = MultipleChoice('question_propmt',choices_list='candidate_answers',
    labels='correct_answer_idx',
    dataset_name='codah', config_name='codah')
    
winogrande = MultipleChoice('sentence',['option1','option2'],'answer',
    dataset_name='winogrande',config_name='winogrande_xl',
    splits=['train','validation',None]) # test labels are not usable
    
tasks = [winogrande.load(), codah.load()]) #  Aligned datasets (same columns) can be used interchangably  

Citation and contact

For more details, refer to this article:

@inproceedings{sileo-2024-tasksource,
    title = "tasksource: A Large Collection of {NLP} tasks with a Structured Dataset Preprocessing Framework",
    author = "Sileo, Damien",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.1361",
    pages = "15655--15684",
}

For help integrating tasksource into your experiments, please contact damien.sileo@inria.fr.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tasksource-0.0.47.tar.gz (90.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tasksource-0.0.47-py3-none-any.whl (45.7 kB view details)

Uploaded Python 3

File details

Details for the file tasksource-0.0.47.tar.gz.

File metadata

  • Download URL: tasksource-0.0.47.tar.gz
  • Upload date:
  • Size: 90.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for tasksource-0.0.47.tar.gz
Algorithm Hash digest
SHA256 3d26ba4cbb9e6b960e88bcb94c59044522206bc16d0d77cd72d767798fbcc6f0
MD5 d7494dfd2dfd1ac168d052d5156bc5a2
BLAKE2b-256 7a265d678342fb1b33381a80842015e4bb7657706399fe01ed19a125f159c7bb

See more details on using hashes here.

File details

Details for the file tasksource-0.0.47-py3-none-any.whl.

File metadata

  • Download URL: tasksource-0.0.47-py3-none-any.whl
  • Upload date:
  • Size: 45.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for tasksource-0.0.47-py3-none-any.whl
Algorithm Hash digest
SHA256 e7ea2a369ef59ec3729c087220b548f6f2a826d5146726e7c8bcd1b6e8599efe
MD5 0712c8403b0023de1fb13ef308a71f24
BLAKE2b-256 059573e051016d194c762a4a038a770252cd668e1a6cf1ef0af1162910340b6d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page