Skip to main content

Load any mixture of text to text data in one line of code

Project description

Image Description

Unitxt is a python library for getting data fired up and set for utilization. In one line of code, it preps a dataset or mixtures-of-datasets into an input-output format for training and evaluation. We aspire to be simple, adaptable and transparent.

Unitxt builds on separation. Separation allows adding a dataset, without knowing anything about the models using it. Separation allows training without caring for preprocessing, switching models without loading the data differently and changing formats (instruction\ICL\etc.) without changing anything else.

version license python tests codecov Read the Docs downloads

Unitxt Flow

Where to start? 🦄

Button Button Button Button Button

Why Unitxt? 🦄

🦄 Simplicity

Everything in Unitxt is simple and designed to feel natural and self-explanatory.

🦄 Adaptability

Adding new datasets, loading recipes, instructions and formatters is possible and encouraged!

🦄 Transparency

The resources and formatters of Unitxt are stored as shared datasets and therefore can easily reviewed by the crowd. Moreover, when assembling a dataset with Unitxt, it is very clear to others what's in it.

Contributers

Please install Unitxt from source by:

git clone git@github.com:IBM/unitxt.git
cd unitxt
pip install -e ".[dev]"
pre-commit install

Ensuring a Linear Git History

Configure your Git to maintain a linear history with these commands:

  1. Automatic Rebasing for Pulls:

    • Command: git config --global pull.rebase true
    • This sets git pull to rebase changes, keeping your history linear without unnecessary merge commits.
  2. Fast-Forward Merges Only:

    • Command: git config --global merge.ff only
    • This allows only fast-forward merges, preventing merge commits when branches diverge, to maintain a linear history.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unitxt-1.2.0.tar.gz (75.7 kB view hashes)

Uploaded Source

Built Distribution

unitxt-1.2.0-py3-none-any.whl (87.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page