Generate textual and conversational datasets with LLMs.
Project description
The Distiller
The Distiller
is a command-line interface and a Python library for inferencing Large Language Models to generate textual or multi-modal datasets.
Possible use cases may include:
- Constructing textual corpora to train/fine-tune detectors for content written by AI.
- Collecting datasets of LLM-produced conversations for research purposes, analysis of AI performance/impact/ethics, etc.
- Automating a task that a LLM can handle over big amounts of input texts. For example, using GPT-3 to summarize 1000 paragraphs with a single CLI command.
- Leveraging APIs of especially big LLMs to produce diverse texts for a specific task and then fine-tune a smaller model with them.
This tool is distributed freely and doesn't imply any restrictions on the downstream use cases. However, you should make sure to follow the Terms of use of the backend APIs (OpenAI, Cohere, Petals, etc.) in your specific context.
Agora
The Distiller is brought to you by Agora, we're an all-new open source multi-modality AI Research organization devoted to advancing Humanity!
Join our Discord and help work on this project or 30+ others!
Installation
pip install distiller
Most of the generation features rely on third-party APIs. Install their respective packages:
pip install openai cohere petals
Usage examples
Inference LLMs at scale
export OPENAI_API_KEY="..."
export COHERE_API_KEY="..."
distiller texts \
--prompt "If {country} was a planet in the Star Wars universe it would be called" \
--backend "openai|text-davinci-003" \
--backend "cohere|medium" \
--temperature 0.9 \
--option country Germany \
--option country France \
--max-length 50 \
--num-samples 1 \
--single-file
The command above should produce a dataset file with 4 texts. Each possible combination of options is used for each of the backend LLMs. Check out the CLI reference for more details. A dataset file looks like this:
[
{
"sample_id": 0,
"backend": "openai|text-davinci-003",
"max_length": 50,
"temperature": 0.9,
"country": "Germany",
"prompt": "If Germany was a planet in the Star Wars universe it would be called",
"output": " Euron. The planet would be home to a powerful and diverse species of aliens, known as the Eurons, that have evolved to a higher level of understanding and technological advancement compared to many of the other planets in the galaxy. The planet would be"
},
{
"sample_id": 0,
"backend": "openai|text-davinci-003",
"max_length": 50,
"temperature": 0.9,
"country": "France",
"prompt": "If France was a planet in the Star Wars universe it would be called",
"output": " The Empire of Liberty. It would be a peaceful, democratic planet with a strong sense of justice and equality. The planet would be home to many different species of aliens but the majority of its population would be humans. It would have a strong military and"
},
{
"sample_id": 0,
"backend": "cohere|medium",
"max_length": 50,
"temperature": 0.9,
"country": "Germany",
"prompt": "If Germany was a planet in the Star Wars universe it would be called",
"output": " the Hoth of the universe.\nAfter the Soviet invasion of Eastern Europe and the subsequent Western anti-Soviet sentiment, Germany's arms manufacturers went into hyperdrive and the country churned out guns at a frightening pace. By the early 1930"
},
...
]
Alternatively, you can use our TextsGenerator
to produce texts in Python.
Generate conversations with the ChatGPT API
export OPENAI_API_KEY="..."
distiller conversations \
--length 4 \
--length 5 \
--agent1 "You're a shop assistant in a pet store. Answer to customer questions politely." \
--agent2 "You're a customer in a pet store. You should behave like a human. You want to buy {n} pets. Ask questions about the pets in the store." \
--temperature 0.1 \
--temperature 0.2 \
--option n 2 \
--option n 3 \
--path dataset
The command above should produce a dataset directory with 16 conversations saved as JSON files. You can specify if you want all of them to be saved in a single file. But why 16? Because num_samples
dialogues are generated for each possible combination of parameters (conversation length, LLM temperature, and custom prompt options). A dataset item looks like this:
{
"length": 5,
"temperature": 0.1,
"n": "2",
"agent1": "You're a shop assistant in a pet store. Answer to customer questions politely. When the whole conversation is over end with \"Goodbye\".",
"agent2": "You're a customer in a pet store. You should behave like a human. You want to buy 2 pets. Ask questions about the pets in the store. When the whole conversation is over end with \"Goodbye\".",
"utterances": [
[
"agent1",
"Hello! How can I assist you today?"
],
[
"agent2",
"Hi! I'm interested in buying two pets. Can you tell me what kind of pets you have available in the store?"
],
[
"agent1",
"Certainly! We have a variety of pets available, including dogs, cats, birds, fish, hamsters, guinea pigs, rabbits, and reptiles. Is there a specific type of pet you're interested in?"
],
[
"agent2",
"I'm not sure yet. Can you tell me more about the dogs and cats you have available? What breeds do you have?"
],
...
]
}
You can also use our ConversationsGenerator
to produce texts with the gpt-3.5-turbo
API programatically.
from distiller import ConversationsGenerator, ConversationsGeneratorConfig, DatasetWriter
dataset_writer = DatasetWriter() # single_file=True
generator_config = ConversationsGeneratorConfig(agent1="You're a shop assistant in a pet store. Answer to customer questions politely.",
agent2="You're a customer in a pet store. You should behave like a human. You want to buy {n} pets. Ask questions about the pets in the store.",
num_samples=2,
interruption="length",
lengths=[4, 5],
temperatures=[0.1, 0.2],
options=[("n", "2"), ("n", "3")])
conversations_generator = ConversationsGenerator(generator_config)
for conversation in conversations_generator:
dataset_writer.save_intermediate_result(conversation)
Contributing
Still under active development.
Contributions will be highly appreciated. Currently these features are under development:
-
distiller conversations
- Make two ChatGPT agents talk with one another and record the conversation history. -
distiller texts
- Inference different LLMs with a given input prompt and generate multiple outputs by varying parameters. -
distiller transformations
- Apply a list of transformations to a list of texts. For example, summarizing a list of texts for a {child | university student | PhD candidate} to understand. - Support more backend LLMs.
To set up a local development environment:
git clone https://github.com/radi-cho/distiller/
cd distiller
pip install -e .
CLI Reference
distiller [OPTIONS] COMMAND [ARGS]...
Command line interface that generates datasets with LLMs.
Options:
--help Show this message and exit.
Commands:
conversations Produce conversations between two gpt-3.5-turbo agents...
texts Inference multiple LLMs at scale.
distiller texts [OPTIONS]
Inference multiple LLMs at scale.
Options:
-p, --prompt TEXT Input prompt. [required]
-b, --backend TEXT LLM APIs to use as backends. Use
"backend|model_name" notation. For example:
"openai|text-davinci-003".
-l, --max-length INTEGER Maximum number of tokens to generate for each
prompt.
-t, --temperature FLOAT Possible temperature values for the backend
language model.
-n, --num-samples INTEGER Number of conversations for each configuration.
-o, --option <TEXT TEXT>... Values for additional options denoted in your
prompts by {OPTION_NAME}.
-f, --path PATH Where to save the dataset. Either a file or a
directory (folder).
-s, --single-file Either save the whole dataset to a single file
or create multiple files.
--help Show this message and exit.
- You can specify multiple variants for the following options:
--length
,--temperature
,--num-samples
,--option
. A dataset item will be generated for each possible combination of the supplied values. - Each
--option
provided must be formatted as follows:--option option_name "Some option value"
. - Currently supported backends: GPT-3 model variants by OpenAI, the language models by Cohere, BLOOM through the Petals API.
distiller conversations [OPTIONS]
Produce conversations between two gpt-3.5-turbo agents with given roles.
Options:
-k, --openai-api-key TEXT OpenAI API key.
-a, --agent1 TEXT Agent role description. [required]
-b, --agent2 TEXT Agent role description. [required]
-i, --interruption [length|end_phrase]
Interruption mode.
-e, --end-phrase TEXT Interrupt after this phrase is outputted by
one of the agents.
-d, --end-agent [agent1|agent2|both]
In which agent's messages to look for the
end phrase.
-l, --length INTEGER Maximum number of utterances for each agent.
A conversation sample will be generated for
each length.
-t, --temperature FLOAT Possible temperature values for the backend
language model.
-n, --num-samples INTEGER Number of conversations for each
configuration.
-o, --option <TEXT TEXT>... Values for additional options denoted in
your prompts by {OPTION_NAME}.
-f, --path PATH Where to save the dataset. Either a file or
a directory (folder).
-s, --single-file Either save the whole dataset to a single
file or create multiple files.
--help Show this message and exit.
- The length parameter specifies how many utterances each agent should make. A length of 4 typically produces 8 utterances in total.
- You can specify either
length
(default) orend_phrase
as an interruption strategy. When usingend_phrase
a conversation will be interrupted once the--end-phrase
has appeared in the messages of the--end-agent
(could be both). In this case, the lengths provided will be treated as maximum conversation lengths.
Roadmap
-
Add Hugginggace polymorphic function to use any model on hf
-
Add CLI arg for huggingface model to utilize
-
Integrate tree of thoughts prompting
-
Integrate multi-modality models from hf
-
Integrate multi-modality conversation => store images in folder then agent2 looks up image address
-
Enable Multi-Modal Dataset Generation using generate, critique
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file The Distiller-0.0.2.tar.gz
.
File metadata
- Download URL: The Distiller-0.0.2.tar.gz
- Upload date:
- Size: 10.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 757a1659ceae74d1065e59770fd3c33bd2335d9f42af9041b2aafb91c201e34d |
|
MD5 | 55d5a1d3d99c2bb55a9220f145363d6a |
|
BLAKE2b-256 | d75cae6ec94a88d87b83c2f0bb4c3bb424632c749ac1c3ee185e9d5117fbbb1c |
File details
Details for the file The_Distiller-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: The_Distiller-0.0.2-py3-none-any.whl
- Upload date:
- Size: 12.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e7e4feb9d29f2929db47e49fdb9d9aaa7f810a3c78f9c9bf1b520c6bd69ace22 |
|
MD5 | f0c58e0783dcdaf0d47082b9ed9607cd |
|
BLAKE2b-256 | aa016c5c3b25629aea3a67463f757303a3c637b12d03b5fca764f4768e9ca660 |