Skip to main content

Generate textual and conversational datasets with LLMs.

Project description

The Distiller

The Distiller is a command-line interface and a Python library for inferencing Large Language Models to generate textual or multi-modal datasets.

Possible use cases may include:

  • Constructing textual corpora to train/fine-tune detectors for content written by AI.
  • Collecting datasets of LLM-produced conversations for research purposes, analysis of AI performance/impact/ethics, etc.
  • Automating a task that a LLM can handle over big amounts of input texts. For example, using GPT-3 to summarize 1000 paragraphs with a single CLI command.
  • Leveraging APIs of especially big LLMs to produce diverse texts for a specific task and then fine-tune a smaller model with them.

This tool is distributed freely and doesn't imply any restrictions on the downstream use cases. However, you should make sure to follow the Terms of use of the backend APIs (OpenAI, Cohere, Petals, etc.) in your specific context.

Agora

The Distiller is brought to you by Agora, we're an all-new open source multi-modality AI Research organization devoted to advancing Humanity!

Join our Discord and help work on this project or 30+ others!

Installation

pip install distiller

Most of the generation features rely on third-party APIs. Install their respective packages:

pip install openai cohere petals

Usage examples

Inference LLMs at scale

export OPENAI_API_KEY="..."
export COHERE_API_KEY="..."

distiller texts \             
    --prompt "If {country} was a planet in the Star Wars universe it would be called" \
    --backend "openai|text-davinci-003" \
    --backend "cohere|medium" \
    --temperature 0.9 \
    --option country Germany \
    --option country France \
    --max-length 50 \
    --num-samples 1 \
    --single-file

The command above should produce a dataset file with 4 texts. Each possible combination of options is used for each of the backend LLMs. Check out the CLI reference for more details. A dataset file looks like this:

[
    {
        "sample_id": 0,
        "backend": "openai|text-davinci-003",
        "max_length": 50,
        "temperature": 0.9,
        "country": "Germany",
        "prompt": "If Germany was a planet in the Star Wars universe it would be called",
        "output": " Euron. The planet would be home to a powerful and diverse species of aliens, known as the Eurons, that have evolved to a higher level of understanding and technological advancement compared to many of the other planets in the galaxy. The planet would be"
    },
    {
        "sample_id": 0,
        "backend": "openai|text-davinci-003",
        "max_length": 50,
        "temperature": 0.9,
        "country": "France",
        "prompt": "If France was a planet in the Star Wars universe it would be called",
        "output": " The Empire of Liberty. It would be a peaceful, democratic planet with a strong sense of justice and equality. The planet would be home to many different species of aliens but the majority of its population would be humans. It would have a strong military and"
    },
    {
        "sample_id": 0,
        "backend": "cohere|medium",
        "max_length": 50,
        "temperature": 0.9,
        "country": "Germany",
        "prompt": "If Germany was a planet in the Star Wars universe it would be called",
        "output": " the Hoth of the universe.\nAfter the Soviet invasion of Eastern Europe and the subsequent Western anti-Soviet sentiment, Germany's arms manufacturers went into hyperdrive and the country churned out guns at a frightening pace. By the early 1930"
    },
    ...
]

Alternatively, you can use our TextsGenerator to produce texts in Python.

Generate conversations with the ChatGPT API

export OPENAI_API_KEY="..."
distiller conversations \
    --length 4 \
    --length 5 \
    --agent1 "You're a shop assistant in a pet store. Answer to customer questions politely." \
    --agent2 "You're a customer in a pet store. You should behave like a human. You want to buy {n} pets. Ask questions about the pets in the store." \
    --temperature 0.1 \
    --temperature 0.2 \
    --option n 2 \
    --option n 3 \
    --path dataset

The command above should produce a dataset directory with 16 conversations saved as JSON files. You can specify if you want all of them to be saved in a single file. But why 16? Because num_samples dialogues are generated for each possible combination of parameters (conversation length, LLM temperature, and custom prompt options). A dataset item looks like this:

{
    "length": 5,
    "temperature": 0.1,
    "n": "2",
    "agent1": "You're a shop assistant in a pet store. Answer to customer questions politely. When the whole conversation is over end with \"Goodbye\".",
    "agent2": "You're a customer in a pet store. You should behave like a human. You want to buy 2 pets. Ask questions about the pets in the store. When the whole conversation is over end with \"Goodbye\".",
    "utterances": [
        [
            "agent1",
            "Hello! How can I assist you today?"
        ],
        [
            "agent2",
            "Hi! I'm interested in buying two pets. Can you tell me what kind of pets you have available in the store?"
        ],
        [
            "agent1",
            "Certainly! We have a variety of pets available, including dogs, cats, birds, fish, hamsters, guinea pigs, rabbits, and reptiles. Is there a specific type of pet you're interested in?"
        ],
        [
            "agent2",
            "I'm not sure yet. Can you tell me more about the dogs and cats you have available? What breeds do you have?"
        ],
        ...
    ]
}

You can also use our ConversationsGenerator to produce texts with the gpt-3.5-turbo API programatically.

from distiller import ConversationsGenerator, ConversationsGeneratorConfig, DatasetWriter

dataset_writer = DatasetWriter() # single_file=True

generator_config = ConversationsGeneratorConfig(agent1="You're a shop assistant in a pet store. Answer to customer questions politely.",
                                                agent2="You're a customer in a pet store. You should behave like a human. You want to buy {n} pets. Ask questions about the pets in the store.",
                                                num_samples=2,
                                                interruption="length",
                                                lengths=[4, 5],
                                                temperatures=[0.1, 0.2],
                                                options=[("n", "2"), ("n", "3")])

conversations_generator = ConversationsGenerator(generator_config)

for conversation in conversations_generator:
    dataset_writer.save_intermediate_result(conversation)

Contributing

Still under active development.

Contributions will be highly appreciated. Currently these features are under development:

  • distiller conversations - Make two ChatGPT agents talk with one another and record the conversation history.
  • distiller texts - Inference different LLMs with a given input prompt and generate multiple outputs by varying parameters.
  • distiller transformations - Apply a list of transformations to a list of texts. For example, summarizing a list of texts for a {child | university student | PhD candidate} to understand.
  • Support more backend LLMs.

To set up a local development environment:

git clone https://github.com/radi-cho/distiller/
cd distiller
pip install -e .

CLI Reference

distiller [OPTIONS] COMMAND [ARGS]...

  Command line interface that generates datasets with LLMs.

Options:
  --help  Show this message and exit.

Commands:
  conversations  Produce conversations between two gpt-3.5-turbo agents...
  texts          Inference multiple LLMs at scale.
distiller texts [OPTIONS]

  Inference multiple LLMs at scale.

Options:
  -p, --prompt TEXT            Input prompt.  [required]
  -b, --backend TEXT           LLM APIs to use as backends. Use
                               "backend|model_name" notation. For example:
                               "openai|text-davinci-003".
  -l, --max-length INTEGER     Maximum number of tokens to generate for each
                               prompt.
  -t, --temperature FLOAT      Possible temperature values for the backend
                               language model.
  -n, --num-samples INTEGER    Number of conversations for each configuration.
  -o, --option <TEXT TEXT>...  Values for additional options denoted in your
                               prompts by {OPTION_NAME}.
  -f, --path PATH              Where to save the dataset. Either a file or a
                               directory (folder).
  -s, --single-file            Either save the whole dataset to a single file
                               or create multiple files.
  --help                       Show this message and exit.
  • You can specify multiple variants for the following options: --length, --temperature, --num-samples, --option. A dataset item will be generated for each possible combination of the supplied values.
  • Each --option provided must be formatted as follows: --option option_name "Some option value".
  • Currently supported backends: GPT-3 model variants by OpenAI, the language models by Cohere, BLOOM through the Petals API.
distiller conversations [OPTIONS]

  Produce conversations between two gpt-3.5-turbo agents with given roles.

Options:
  -k, --openai-api-key TEXT       OpenAI API key.
  -a, --agent1 TEXT               Agent role description.  [required]
  -b, --agent2 TEXT               Agent role description.  [required]
  -i, --interruption [length|end_phrase]
                                  Interruption mode.
  -e, --end-phrase TEXT           Interrupt after this phrase is outputted by
                                  one of the agents.
  -d, --end-agent [agent1|agent2|both]
                                  In which agent's messages to look for the
                                  end phrase.
  -l, --length INTEGER            Maximum number of utterances for each agent.
                                  A conversation sample will be generated for
                                  each length.
  -t, --temperature FLOAT         Possible temperature values for the backend
                                  language model.
  -n, --num-samples INTEGER       Number of conversations for each
                                  configuration.
  -o, --option <TEXT TEXT>...     Values for additional options denoted in
                                  your prompts by {OPTION_NAME}.
  -f, --path PATH                 Where to save the dataset. Either a file or
                                  a directory (folder).
  -s, --single-file               Either save the whole dataset to a single
                                  file or create multiple files.
  --help                          Show this message and exit.
  • The length parameter specifies how many utterances each agent should make. A length of 4 typically produces 8 utterances in total.
  • You can specify either length (default) or end_phrase as an interruption strategy. When using end_phrase a conversation will be interrupted once the --end-phrase has appeared in the messages of the --end-agent (could be both). In this case, the lengths provided will be treated as maximum conversation lengths.

Roadmap

  • Add Hugginggace polymorphic function to use any model on hf

  • Add CLI arg for huggingface model to utilize

  • Integrate tree of thoughts prompting

  • Integrate multi-modality models from hf

  • Integrate multi-modality conversation => store images in folder then agent2 looks up image address

  • Enable Multi-Modal Dataset Generation using generate, critique

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

The Distiller-0.0.2.tar.gz (10.4 kB view details)

Uploaded Source

Built Distribution

The_Distiller-0.0.2-py3-none-any.whl (12.9 kB view details)

Uploaded Python 3

File details

Details for the file The Distiller-0.0.2.tar.gz.

File metadata

  • Download URL: The Distiller-0.0.2.tar.gz
  • Upload date:
  • Size: 10.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for The Distiller-0.0.2.tar.gz
Algorithm Hash digest
SHA256 757a1659ceae74d1065e59770fd3c33bd2335d9f42af9041b2aafb91c201e34d
MD5 55d5a1d3d99c2bb55a9220f145363d6a
BLAKE2b-256 d75cae6ec94a88d87b83c2f0bb4c3bb424632c749ac1c3ee185e9d5117fbbb1c

See more details on using hashes here.

File details

Details for the file The_Distiller-0.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for The_Distiller-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 e7e4feb9d29f2929db47e49fdb9d9aaa7f810a3c78f9c9bf1b520c6bd69ace22
MD5 f0c58e0783dcdaf0d47082b9ed9607cd
BLAKE2b-256 aa016c5c3b25629aea3a67463f757303a3c637b12d03b5fca764f4768e9ca660

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page