Skip to main content

A convenient and fast Python package to find the best datasets for intermediate fine-tuning for your task.

Project description

hf-dataset-selector

PyPI version

A convenient and fast Python package to find the best datasets for intermediate fine-tuning for your task.

Why hf-dataset-selector?

You don't have enough training data for your problem

If you don't have a enough training data for your problem, just use hf-dataset-selector to find more. You can supplement model training by including publicly available datasets in the training process.

  1. Fine-tune a language model on suitable intermediate dataset.
  2. Fine-tune the resulting model on your target dataset.

This workflow is called intermediate task transfer learning and it can significantly improve the target performance.

But what is a suitable dataset for your problem? hf-dataset-selector enables you to quickly rank thousands of datasets on the Hugging Face Hub by how well they are exptected to transfer to your target task. Just specify a base language model and your target dataset, and hf-dataset-selector produces a ranking of intermediate datasets.

You want to find similar datasets to your target dataset

hf-dataset-selector can be used like search engine on the Hugging Face Hub. You can find similar tasks to your target task without having to rely on heuristics. hf-dataset-selector estimates how language models fine-tuned on each intermediate task would benefinit your target task. This quantitative approach combines the effects of domain similarity and task similarity.

How to install

hf-dataset-selector is available on PyPi:

$ pip install hf-dataset-selector

Quickstart

How to find suitable datasets for your problem

from hfselect import Dataset, compute_task_ranking

# Load target dataset from the Hugging Face Hub
dataset = Dataset.from_hugging_face(
    name="stanfordnlp/imdb",
    split="train",
    text_col="text",
    label_col="label",
    is_regression=False,
    num_examples=1000,
    seed=42
)

# Fetch ESMs and rank tasks
task_ranking = compute_task_ranking(
    dataset=dataset,
    model_name="bert-base-multilingual-uncased"
)

# Display top 5 recommendations
print(task_ranking[:5])
1.   davanstrien/test_imdb_embedd2                     Score: -0.618529
2.   davanstrien/test_imdb_embedd                      Score: -0.618644
3.   davanstrien/test1                                 Score: -0.619334
4.   stanfordnlp/imdb                                  Score: -0.619454
5.   stanfordnlp/sst                                   Score: -0.62995
Rank Task ID Task Subset Text Column Label Column Task Split Num Examples ESM Architecture Score
1 davanstrien/test_imdb_embedd2 default text label train 10000 linear -0.618529
2 davanstrien/test_imdb_embedd default text label train 10000 linear -0.618644
3 davanstrien/test1 default text label train 10000 linear -0.619334
4 stanfordnlp/imdb plain_text text label train 10000 linear -0.619454
5 stanfordnlp/sst dictionary phrase label dictionary 10000 linear -0.62995
6 stanfordnlp/sst default sentence label train 8544 linear -0.63312
7 kuroneko5943/snap21 CDs_and_Vinyl_5 sentence label train 6974 linear -0.634365
8 kuroneko5943/snap21 Video_Games_5 sentence label train 6997 linear -0.638787
9 kuroneko5943/snap21 Movies_and_TV_5 sentence label train 6989 linear -0.639068
10 fancyzhx/amazon_polarity amazon_polarity content label train 10000 linear -0.639718

Tutorials

We provide tutorials for finding intermediate datasets, and for training your own ESM for others to rank.

Documentation

We host a documentation on Read the Docs.

How it works

hf-dataset-selector enables you to find good datasets from the Hugging Face Hub for intermediate fine-tuning before training on your task. It downloads small (~2.4MB each) neural networks for each intermediate task from the Hugging Face Hub. These neural networks are called Embedding Space Maps (ESMs) and transform embeddings produced by the language model. The transformed embeddings are ranked using LogME.

hf-dataset-selector ranks only datasets with a corresponding ESM on the Hugging Face Hub. We encourage you to train and publish your own ESMs for your datasets to enable others to rank them.

What are Embedding Space Maps?

Embedding Space Maps (ESMs) are neural networks that approximate the effect of fine-tuning a language model on a task. They can be used to quickly transform embeddings from a base model to approximate how a fine-tuned model would embed the the input text. ESMs can be used for intermediate task selection with the ESM-LogME workflow.

How to cite

If you are using this hf-dataset-selector, please cite our paper.

BibTeX:

@inproceedings{schulte-etal-2024-less,
    title = "Less is More: Parameter-Efficient Selection of Intermediate Tasks for Transfer Learning",
    author = "Schulte, David  and
      Hamborg, Felix  and
      Akbik, Alan",
    editor = "Al-Onaizan, Yaser  and
      Bansal, Mohit  and
      Chen, Yun-Nung",
    booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.emnlp-main.529/",
    doi = "10.18653/v1/2024.emnlp-main.529",
    pages = "9431--9442",
    abstract = "Intermediate task transfer learning can greatly improve model performance. If, for example, one has little training data for emotion detection, first fine-tuning a language model on a sentiment classification dataset may improve performance strongly. But which task to choose for transfer learning? Prior methods producing useful task rankings are infeasible for large source pools, as they require forward passes through all source language models. We overcome this by introducing Embedding Space Maps (ESMs), light-weight neural networks that approximate the effect of fine-tuning a language model. We conduct the largest study on NLP task transferability and task selection with 12k source-target pairs. We find that applying ESMs on a prior method reduces execution time and disk space usage by factors of 10 and 278, respectively, while retaining high selection performance (avg. regret@5 score of 2.95)."
}

APA:

Schulte, D., Hamborg, F., & Akbik, A. (2024, November). Less is More: Parameter-Efficient Selection of Intermediate Tasks for Transfer Learning. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (pp. 9431-9442).

How to reproduce the results from the paper

For reproducing the results of our paper, please refer to the emnlp-submission branch.

Acknowledgements

Our methods extends the LogME method for intermediate task selection. We adapt the implementation by the authors. https://github.com/tuvuumass/task-transferability

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hf_dataset_selector-0.2.1.tar.gz (31.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hf_dataset_selector-0.2.1-py3-none-any.whl (34.8 kB view details)

Uploaded Python 3

File details

Details for the file hf_dataset_selector-0.2.1.tar.gz.

File metadata

  • Download URL: hf_dataset_selector-0.2.1.tar.gz
  • Upload date:
  • Size: 31.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for hf_dataset_selector-0.2.1.tar.gz
Algorithm Hash digest
SHA256 3424fb7a0c4ed424eeec7264288bda94fa5a2c4dd6ade2b1e8bbbe54faa147fe
MD5 bc3a605a733f999d3cb435f0dc504cd2
BLAKE2b-256 e16e9c2eeba288e08a0c76a1dc7d19e07a1b08dbc2d7e58cb36aae2ad9fee890

See more details on using hashes here.

File details

Details for the file hf_dataset_selector-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for hf_dataset_selector-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 24d897ad203af7cbea22b8a35a6cd7769907403218d1c656e46f98d28f11fe46
MD5 4d91e39bed2265e97b7144fbd7960c47
BLAKE2b-256 af721c55cf363591e68dc473f0e72b3d979442db9ab5ece7b203b4305816a75a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page