Skip to main content

Load any mixture of text to text data in one line of code

Project description

Image Description

Button Button Button Button Button Button Button Button

In the dynamic landscape of generative NLP, traditional text processing pipelines limit research flexibility and reproducibility, as they are tailored to specific dataset, task, and model combinations. The escalating complexity, involving system prompts, model-specific formats, instructions, and more, calls for a shift to a structured, modular, and customizable solution.

Addressing this need, we present Unitxt, an innovative library for customizable textual data preparation and evaluation tailored to generative language models. Unitxt natively integrates with common libraries like HuggingFace and LM-eval-harness and deconstructs processing flows into modular components, enabling easy customization and sharing between practitioners. These components encompass model-specific formats, task prompts, and many other comprehensive dataset processing definitions. The Unitxt-Catalog centralizes these components, fostering collaboration and exploration in modern textual data workflows. Beyond being a tool, Unitxt is a community-driven platform, empowering users to build, share, and advance their pipelines collaboratively.

version license python tests codecov Read the Docs downloads

https://github.com/IBM/unitxt/assets/23455264/baef9131-39d4-4164-90b2-05da52919fdf

🦄 Currently on Unitxt Catalog

NLP Tasks Dataset Cards Templates Formats Metrics

🦄 Run Unitxt Exploration Dashboard

To launch unitxt graphical user interface first install unitxt with ui requirements:

pip install unitxt[ui]

Then launch the ui by running:

unitxt-explore

🦄 Contributors

Please install Unitxt from source by:

git clone git@github.com:IBM/unitxt.git
cd unitxt
pip install -e ".[dev]"
pre-commit install

🦄 Citation

If you use Unitxt in your research, please cite our paper:

@misc{unitxt,
      title={Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI},
      author={Elron Bandel and Yotam Perlitz and Elad Venezian and Roni Friedman-Melamed and Ofir Arviv and Matan Orbach and Shachar Don-Yehyia and Dafna Sheinwald and Ariel Gera and Leshem Choshen and Michal Shmueli-Scheuer and Yoav Katz},
      year={2024},
      eprint={2401.14019},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Project details


Release history Release notifications | RSS feed

This version

1.7.9

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unitxt-1.7.9.tar.gz (287.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

unitxt-1.7.9-py3-none-any.whl (666.7 kB view details)

Uploaded Python 3

File details

Details for the file unitxt-1.7.9.tar.gz.

File metadata

  • Download URL: unitxt-1.7.9.tar.gz
  • Upload date:
  • Size: 287.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for unitxt-1.7.9.tar.gz
Algorithm Hash digest
SHA256 041234cd537b3985801dcf79ed8735fd6774176ef87e6f5556bcb38aa76f7c6a
MD5 de1dbf092d6624db6af0626efde50f47
BLAKE2b-256 2c74aeb29b7ea4c2c5164d7ad2db78380dc0dafff3fb110e55f40990b7dfeb5f

See more details on using hashes here.

File details

Details for the file unitxt-1.7.9-py3-none-any.whl.

File metadata

  • Download URL: unitxt-1.7.9-py3-none-any.whl
  • Upload date:
  • Size: 666.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for unitxt-1.7.9-py3-none-any.whl
Algorithm Hash digest
SHA256 847160564399439f08d4fa4481e15098173382bffdd7dfd056540c1f33405ac0
MD5 beb2ccabf697c57f775cf4904e847bf1
BLAKE2b-256 65a4082a98d142a50b232009deb8870060ed1159bef7c7700fc945a5e1b1c576

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page