Skip to main content

Distilabel is an AI Feedback (AIF) framework for building datasets with and for LLMs.

Project description

Distilabel Logo

Synthesize data for AI and add feedback on the fly!

CI CI

Distilabel is the framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.

If you just want to get started, we recommend you check the documentation. Curious, and want to know more? Keep reading!

Why use distilabel?

Distilabel can be used for generating synthetic data and AI feedback for a wide variety of projects including traditional predictive NLP (classification, extraction, etc.), or generative and large language model scenarios (instruction following, dialogue generation, judging etc.). Distilabel's programmatic approach allows you to build scalable pipelines for data generation and AI feedback. The goal of distilabel is to accelerate your AI development by quickly generating high-quality, diverse datasets based on verified research methodologies for generating and judging with AI feedback.

Improve your AI output quality through data quality

Compute is expensive and output quality is important. We help you focus on data quality, which tackles the root cause of both of these problems at once. Distilabel helps you to synthesize and judge data to let you spend your valuable time achieving and keeping high-quality standards for your data.

Take control of your data and models

Ownership of data for fine-tuning your own LLMs is not easy but Distilabel can help you to get started. We integrate AI feedback from any LLM provider out there using one unified API.

Improve efficiency by quickly iterating on the right research and LLMs

Synthesize and judge data with latest research papers while ensuring flexibility, scalability and fault tolerance. So you can focus on improving your data and training your models.

Community

We are an open-source community-driven project and we love to hear from you. Here are some ways to get involved:

  • Community Meetup: listen in or present during one of our bi-weekly events.

  • Discord: get direct support from the community in #argilla-general and #argilla-help.

  • Roadmap: plans change but we love to discuss those with our community so feel encouraged to participate.

What do people build with Distilabel?

The Argilla community uses distilabel to create amazing datasets and models.

  • The 1M OpenHermesPreference is a dataset of ~1 million AI preferences derived from teknium/OpenHermes-2.5. It shows how we can use Distilabel to synthesize data on an immense scale.
  • Our distilabeled Intel Orca DPO dataset and the improved OpenHermes model, show how we improve model performance by filtering out 50% of the original dataset through AI feedback.
  • The haiku DPO data outlines how anyone can create a dataset for a specific task and the latest research papers to improve the quality of the dataset.

Installation

pip install distilabel --upgrade

Requires Python 3.9+

In addition, the following extras are available:

LLMs

  • anthropic: for using models available in Anthropic API via the AnthropicLLM integration.
  • cohere: for using models available in Cohere via the CohereLLM integration.
  • argilla: for exporting the generated datasets to Argilla.
  • groq: for using models available in Groq using groq Python client via the GroqLLM integration.
  • hf-inference-endpoints: for using the Hugging Face Inference Endpoints via the InferenceEndpointsLLM integration.
  • hf-transformers: for using models available in transformers package via the TransformersLLM integration.
  • litellm: for using LiteLLM to call any LLM using OpenAI format via the LiteLLM integration.
  • llama-cpp: for using llama-cpp-python Python bindings for llama.cpp via the LlamaCppLLM integration.
  • mistralai: for using models available in Mistral AI API via the MistralAILLM integration.
  • ollama: for using Ollama and their available models via OllamaLLM integration.
  • openai: for using OpenAI API models via the OpenAILLM integration, or the rest of the integrations based on OpenAI and relying on its client as AnyscaleLLM, AzureOpenAILLM, and TogetherLLM.
  • vertexai: for using Google Vertex AI proprietary models via the VertexAILLM integration.
  • vllm: for using vllm serving engine via the vLLM integration.
  • sentence-transformers: for generating sentence embeddings using sentence-transformers.

Structured generation

  • outlines: for using structured generation of LLMs with outlines.
  • instructor: for using structured generation of LLMs with Instructor.

Data processing

  • ray: for scaling and distributing a pipeline with Ray.
  • faiss-cpu and faiss-gpu: for generating sentence embeddings using faiss.
  • text-clustering: for using text clustering with UMAP and Scikit-learn.
  • minhash: for using minhash for duplicate detection with datasketch and nltk.

Example

To run the following example you must install distilabel with the hf-inference-endpoints extra:

pip install "distilabel[hf-inference-endpoints]" --upgrade

Then run:

from distilabel.llms import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromHub
from distilabel.steps.tasks import TextGeneration

with Pipeline(
    name="simple-text-generation-pipeline",
    description="A simple text generation pipeline",
) as pipeline:
    load_dataset = LoadDataFromHub(output_mappings={"prompt": "instruction"})

    text_generation = TextGeneration(
        llm=InferenceEndpointsLLM(
            model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
            tokenizer_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
        ),
    )

    load_dataset >> text_generation

if __name__ == "__main__":
    distiset = pipeline.run(
        parameters={
            load_dataset.name: {
                "repo_id": "distilabel-internal-testing/instruction-dataset-mini",
                "split": "test",
            },
            text_generation.name: {
                "llm": {
                    "generation_kwargs": {
                        "temperature": 0.7,
                        "max_new_tokens": 512,
                    }
                }
            },
        },
    )
    distiset.push_to_hub(repo_id="distilabel-example")

Badges

If you build something cool with distilabel consider adding one of these badges to your dataset or model card.

[<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)

Built with Distilabel

[<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-dark.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)

Built with Distilabel

Contribute

To directly contribute with distilabel, check our good first issues or open a new one.

Citation

@misc{distilabel-argilla-2024,
  author = {Álvaro Bartolomé Del Canto and Gabriel Martín Blázquez and Agustín Piqueres Lajarín and Daniel Vila Suero},
  title = {Distilabel: An AI Feedback (AIF) framework for building datasets with and for LLMs},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/argilla-io/distilabel}}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

distilabel-1.4.1.tar.gz (6.4 MB view details)

Uploaded Source

Built Distribution

distilabel-1.4.1-py3-none-any.whl (442.2 kB view details)

Uploaded Python 3

File details

Details for the file distilabel-1.4.1.tar.gz.

File metadata

  • Download URL: distilabel-1.4.1.tar.gz
  • Upload date:
  • Size: 6.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for distilabel-1.4.1.tar.gz
Algorithm Hash digest
SHA256 0c373be234e8f2982ec7f940d9a95585b15306b6ab5315f5a6a45214d8f34006
MD5 09d68d3923b4bd15e0c94af99e7bcd95
BLAKE2b-256 2e1b331aeeda851a888e8bff84b8074cb1301909b06e509140a85a23dd1345cf

See more details on using hashes here.

File details

Details for the file distilabel-1.4.1-py3-none-any.whl.

File metadata

  • Download URL: distilabel-1.4.1-py3-none-any.whl
  • Upload date:
  • Size: 442.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for distilabel-1.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4643da7f3abae86a330d86d1498443ea56978e462e21ae3d106a4c6013386965
MD5 82ae0760c65e44b4d594899c06631a02
BLAKE2b-256 b6b362d07a936cd9c3039d811681c33b9fc898e48219cf22c9186954e2575365

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page