Skip to main content

Conveniently generating datasets with large language models.

Project description

Fabricator Logo Fabricator Logo

A flexible open-source framework to generate datasets with large language models.

version python Static Badge

News

  • [10/23] We released the first version of this repository on PyPI. You can install it via pip install fabricator-ai.
  • [10/23] Our paper got accepted at EMNLP 2023. You can find the preprint here. You can find the experimental scripts under release v0.1.0.
  • [09/23] Support for gpt-3.5-turbo-instruct added in the new Haystack release!
  • [08/23] Added several experimental scripts to investigate the generation and annotation ability of gpt-3.5-turbo on various downstream tasks + the influence of few-shot examples on the performance for different downstream tasks.
  • [07/23] Refactorings of majors classes - you can now simply use our BasePrompt class to create your own customized prompts for every downstream task!
  • [07/23] Added dataset transformations for token classification to prompt LLMs with textual spans rather than with list of tags.
  • [06/23] Initial version of fabricator supporting text classification and question answering tasks.

Overview

This repository:

  • is an easy-to-use open-source library to generate datasets with large language models. If you want to train a model on a specific domain / label distribution / downstream task, you can use this framework to generate a dataset for it.
  • builds on top of deepset's haystack and huggingface's datasets libraries. Thus, we support a wide range of language models and you can load and use the generated datasets as you know it from the Datasets library for your model training.
  • is highly flexible and offers various adaptions possibilities such as prompt customization, integration and sampling of fewshot examples or annotation of the unlabeled datasets.

Installation

Using conda:

git clone git@github.com:flairNLP/fabricator.git
cd fabricator
conda create -y -n fabricator python=3.10
conda activate fabricator
pip install fabricator-ai

If you want to install in editable mode, you can use the following command:

pip install -e .

Basic Concepts

This framework is based on the idea of using large language models to generate datasets for specific tasks. To do so, we need four basic modules: a dataset, a prompt, a language model and a generator:

  • Dataset: We use huggingface's datasets library to load fewshot or unlabeled datasets and store the generated or annotated datasets with their Dataset class. Once created, you can share the dataset with others via the hub or use it for your model training.
  • Prompt: A prompt is the instruction made to the language model. It can be a simple sentence or a more complex template with placeholders. We provide an easy interface for custom dataset generation prompts in which you can specify label options for the LLM to choose from, provide fewshot examples to support the prompt with or annotate an unlabeled dataset in a specific way.
  • LLM: We use deepset's haystack library as our LLM interface. deepset supports a wide range of LLMs including OpenAI, all models from the HuggingFace model hub and many more.
  • Generator: The generator is the core of this framework. It takes a dataset, a prompt and a LLM and generates a dataset based on your specifications.

Examples

With our library, you can generate datasets for any task you want. You can start as simple as that:

Generate a dataset from scratch

import os
from haystack.nodes import PromptNode
from fabricator import DatasetGenerator
from fabricator.prompts import BasePrompt

prompt = BasePrompt(
    task_description="Generate a short movie review.",
)

prompt_node = PromptNode(
    model_name_or_path="gpt-3.5-turbo",
    api_key=os.environ.get("OPENAI_API_KEY"),
    max_length=100,
)

generator = DatasetGenerator(prompt_node)
generated_dataset = generator.generate(
    prompt_template=prompt,
    max_prompt_calls=10,
)

generated_dataset.push_to_hub("your-first-generated-dataset")

In our tutorial, we introduce how to create classification datasets with label options to choose from, how to include fewshot examples or how to annotate unlabeled data into predefined categories.

Citation

If you find this repository useful, please cite our work.

@misc{golde2023fabricator,
      title={Fabricator: An Open Source Toolkit for Generating Labeled Training Data with Teacher LLMs}, 
      author={Jonas Golde and Patrick Haller and Felix Hamborg and Julian Risch and Alan Akbik},
      year={2023},
      eprint={2309.09582},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fabricator-ai-0.2.0.tar.gz (27.9 kB view details)

Uploaded Source

Built Distribution

fabricator_ai-0.2.0-py3-none-any.whl (23.9 kB view details)

Uploaded Python 3

File details

Details for the file fabricator-ai-0.2.0.tar.gz.

File metadata

  • Download URL: fabricator-ai-0.2.0.tar.gz
  • Upload date:
  • Size: 27.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for fabricator-ai-0.2.0.tar.gz
Algorithm Hash digest
SHA256 2a39fc5a6d3e225aa04581294bcc13e07661711c59a2b1fcb2d8da27e3b32f26
MD5 c0f164d0344bf5e90ea149fc105a66eb
BLAKE2b-256 8d894c31ad150546405f13321c148e16488938727d1edc8da1d8978cd063b28e

See more details on using hashes here.

File details

Details for the file fabricator_ai-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for fabricator_ai-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0f90ae4a84ecba16028b65cdb68dc3a0d12171ca697056edea97e6644e086652
MD5 806e81c35862fb00f3f1572bf9b578be
BLAKE2b-256 db4155549054350a173525fcfbfa94a3abf7887bf6d024f08fa1247e3e7505d2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page