Load any mixture of text to text data in one line of code
Project description
Unitxt is a python library for getting data fired up and set for utilization. In one line of code, it preps a dataset or mixtures-of-datasets into an input-output format for training and evaluation. We aspire to be simple, adaptable and transparent.
Unitxt builds on separation. Separation allows adding a dataset, without knowing anything about the models using it. Separation allows training without caring for preprocessing, switching models without loading the data differently and changing formats (instruction\ICL\etc.) without changing anything else.
Where to start? 🦄
Why Unitxt? 🦄
🦄 Simplicity
Everything in Unitxt is simple and designed to feel natural and self-explanatory.
🦄 Adaptability
Adding new datasets, loading recipes, instructions and formatters is possible and encouraged!
🦄 Transparency
The resources and formatters of Unitxt are stored as shared datasets and therefore can easily reviewed by the crowd. Moreover, when assembling a dataset with Unitxt, it is very clear to others what's in it.
Contributers
Please install Unitxt from source by:
git clone git@github.com:IBM/unitxt.git
cd unitxt
pip install -e ".[dev]"
pre-commit install
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.