A framework for optimizing prompts through multi-task evaluation and iterative improvement

Project description

Promptim

Promptim is an experimental prompt optimization library to help you systematically improve your AI systems.

Promptim automates the process of improving prompts on specific tasks. You provide initial prompt, a dataset, and custom evaluators (and optional human feedback), and promptim runs an optimization loop to produce a refined prompt that aims to outperform the original.

For setup and usage details, see the Quick Start guide below.

Optimization

Quick start

Let's try prompt optimization on a simple task to generate tweets.

1. Install

First install the CLI.

pip install -U promptim

And make sure you have a valid LangSmith API Key in your environment. For the quick start task, we will use Anthropic's Claude model for our optimizer and for the target system.

LANGSMITH_API_KEY=CHANGEME
ANTHROPIC_API_KEY=CHANGEME

2. Create task

Next, create a task to optimize over. The CLI will walk you

promptim create task ./my-tweet-task

Each task requires a few things. When the CLI requests, provide the corresponding values.

name: provide a useful name for the task (like "ticket classifier" or "report generator"). You may use the default here.
prompt: this is an identifier in the LangSmith prompt hub. Use the following public prompt to start:

langchain-ai/tweet-generator-example-with-nothing:starter

Hit "Enter" to confirm cloning into your workspace (so that you can push optimized commits to it). 3. dataset: this is the name (or public URL) for the dataset we are optimizing over. Optionally, it can have train/dev/test splits to report separate metrics throughout the training process.

https://smith.langchain.com/public/6ed521df-c0d8-42b7-a0db-48dd73a0c680/d

description: this is a high-level description of the purpose for this prompt. The optimizer uses this to help focus its improvements.

Write informative tweets on any subject.

Once you've completed the template creation, you should have two files in the my-tweet-task directory:

└── my-tweet-task
    ├── config.json
    └── task.py

We can ignore the config.json file for now (we'll discuss that later). The last thing we need to do before training is create an evaluator.

3. Define evaluators

Next we need to quantify prompt performance on our task. What does "good" and "bad" look like? We do this using evaluators.

Open the evaluator stub written in my-tweet-task/task.py and find the line that assigns a score to a prediction:

    # Implement your evaluation logic here
    score = len(str(predicted.content)) < 180  # Replace with actual score

We are going to make this evaluator penalize outputs with hashtags. Update that line to be:

    score = int("#" not in result)

Next, update the evaluator name. We do this using the key field in the evaluator response.

    "key": "tweet_omits_hashtags",

To help the optimizer know the ideal behavior, we can add additional instructions in the comment field in the response.

Update the "comment" line to explicitly give pass/fail comments:

        "comment": "Pass: tweet omits hashtags" if score == 1 else "Fail: omit all hashtags from generated tweets",

And now we're ready to train! The final evaluator should look like:

def example_evaluator(run: Run, example: Example) -> dict:
    """An example evaluator. Larger numbers are better."""
    predicted: AIMessage = run.outputs["output"]

    result = str(predicted.content)
    score = int("#" not in result)
    return {
        "key": "tweet_omits_hashtags",
        "score": score,
        "comment": "Pass: tweet omits hashtags" if score == 1 else "Fail: omit all hashtags from generated tweets",
    }

4. Train

To start optimizing your prompt, run the train command:

promptim train --task ./my-tweet-task/config.json

You will see the progress in your terminal. once it's completed, the training job will print out the final "optimized" prompt in the terminal, as well as a link to the commit in the hub.

Explanation

Whenever you you run promptim train, promptim first loads the prompt and dataset specified in your configuration. It then evaluates your prompt on the dev split (if present; full dataset otherwise) using the evaluator(s) configured above. This gives us baseline metrics to compare against throughout the optimization process.

After computing a baseline, promptim begins optimizing the prompt by looping over minibatches of training examples. For each minibatch, promptim computes the metrics and then applies a metaprompt to suggest changes to the current prompt. It then applies that updated prompt to the next minibatch of training examples and repeats the process. It does this over the entire train split (if present, full dataset otherwise).

After promptim has consumed the whole train split, it computes metrics again on the dev split. If the metrics show improvement (average score is greater), then the updated prompt is retained for the next round. If the metrics are the same or worse than the current best score, the prompt is discarded.

This process is repeated --num-epochs times before the process terminates.

How to:

Add human labels

To add human labeling using the annotation queue:

Set up an annotation queue: When running the train command, use the --annotation-queue option to specify a queue name:
```
promptim train --task ./my-tweet-task/config.json --annotation-queue my_queue
```
During training, the system will pause after each batch and print out instructions on how to label the results. It will wait for human annotations.
Access the annotation interface:
- Open the LangSmith UI
- Navigate to the specified queue (e.g., "my_queue")
- Review and label as many examples as you'd like, adding notes and scores
Resume:
- Type 'c' in the terminal
- The training loop will fetch your annotations and include them in the metaprompt's next optimizatin pass

This human-in-the-loop approach allows you to guide the prompt optimization process by providing direct feedback on the model's outputs.

Reference

CLI Arguments

The current CLI arguments are as follows. They are experimental and may change in the future:

Usage: promptim [OPTIONS] COMMAND [ARGS]...

  Optimize prompts for different tasks.

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  create  Commands for creating new tasks and examples.
  train   Train and optimize prompts for different tasks.

create

Usage: promptim create [OPTIONS] COMMAND [ARGS]...

  Commands for creating new tasks and examples.

Options:
  --help  Show this message and exit.

Commands:
  example  Clone a pre-made tweet generation task
  task     Walkthrough to create a new task directory from your own prompt and dataset

promptim create task

Usage: promptim create task [OPTIONS] PATH

  Create a new task directory with config.json and task file for a custom
  prompt and dataset.

Options:
  --name TEXT         Name for the task.
  --prompt TEXT       Name of the prompt in LangSmith
  --description TEXT  Description of the task for the optimizer.
  --dataset TEXT      Name of the dataset in LangSmith
  --help              Show this message and exit.

train

Usage: promptim train [OPTIONS]

  Train and optimize prompts for different tasks.

Options:
  --task TEXT              Task to optimize. You can pick one off the shelf or
                           select a path to a config file. Example:
                           'examples/tweet_writer/config.json
  --batch-size INTEGER     Batch size for optimization
  --train-size INTEGER     Training size for optimization
  --epochs INTEGER         Number of epochs for optimization
  --debug                  Enable debug mode
  --annotation-queue TEXT  The name of the annotation queue to use. Note: we
                           will delete the queue whenever you resume training
                           (on every batch).
  --no-commit              Do not commit the optimized prompt to the hub
  --help                   Show this message and exit.

Configuration

The schema for your config.json file can be found in config-schema.json.

It contains the following arguments:

name (string, required): The name of your task.
dataset (string, required): The name of the dataset in LangSmith to be used for training and evaluation.
initial_prompt (object, required): Configuration for the initial prompt to be optimized.
- identifier (string, optional): If optimizing a prompt from the hub. Do not provide if using a prompt string directly.
- model_config (object, optional): Configuration for the model used in optimization.
- which (integer, default: 0): Which message in the prompt to optimize.
description (string, optional): A detailed explanation of the task's objectives and constraints.
evaluator_descriptions (object, optional): A mapping of evaluator names to their descriptions.
optimizer (object, optional): Configuration for the optimization process.
- model (object, required): Configuration for the model used in optimization, including model name and parameters.
evaluators (string, required): Path to the Python file and variable name containing the evaluator functions for the task. Example: ./task/evaluators.py:evaluators
system (string, optional): Path to the Python file defining the custom system for making predictions. If not provided, one will be constructed for you (contaiing just a prompt and LLM). Example: ./task/my_system.py:chain

Below is an example config.json file:

{
  "name": "Tweet Generator",
  "dataset": "tweet_dataset",
  "initial_prompt": {
    "prompt_str": "Write a tweet about {topic} in the style of {author}",
    "which": 0
  },
  "description": "Generate engaging tweets on various topics in the style of different authors",
  "evaluator_descriptions": {
    "engagement_score": "Measures the potential engagement of the tweet",
    "style_match": "Evaluates how well the tweet matches the specified author's style"
  },
  "evaluators": "./tweet_evaluators.py:evaluators",
  "optimizer": {
    "model": {
      "name": "gpt-3.5-turbo",
      "temperature": 0.7
    }
  }
}

Project details

Release history Release notifications | RSS feed

0.0.6

Nov 18, 2024

0.0.5

Nov 13, 2024

0.0.5rc2 pre-release

Nov 13, 2024

0.0.5rc1 pre-release

Nov 13, 2024

0.0.4

Nov 8, 2024

0.0.4rc0 pre-release

Nov 8, 2024

0.0.3

Nov 7, 2024

This version

0.0.3rc1 pre-release

Nov 7, 2024

0.0.2

Nov 2, 2024

0.0.2rc5 pre-release

Nov 2, 2024

0.0.2rc4 pre-release

Nov 2, 2024

0.0.2rc3 pre-release

Nov 2, 2024

0.0.2rc2 pre-release

Nov 2, 2024

0.0.2rc1 pre-release

Nov 2, 2024

0.0.2rc0 pre-release

Nov 2, 2024

0.0.1

Nov 2, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

promptim-0.0.3rc1.tar.gz (5.7 MB view details)

Uploaded Nov 7, 2024 Source

Built Distribution

promptim-0.0.3rc1-py3-none-any.whl (33.9 kB view details)

Uploaded Nov 7, 2024 Python 3

File details

Details for the file promptim-0.0.3rc1.tar.gz.

File metadata

Download URL: promptim-0.0.3rc1.tar.gz
Upload date: Nov 7, 2024
Size: 5.7 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.4.29

File hashes

Hashes for promptim-0.0.3rc1.tar.gz
Algorithm	Hash digest
SHA256	`f810c83639d5fb3f95ca67b53a84520ced3f344f51b40bec0b0eca80a62a37a3`
MD5	`a32f89c7d065dd2efb133cccfa93ed4e`
BLAKE2b-256	`178b108d49f7d2afd52f05c7fe31e20850b85f5e7a1db37dcaca7883ff80f7cc`

See more details on using hashes here.

File details

Details for the file promptim-0.0.3rc1-py3-none-any.whl.

File metadata

Download URL: promptim-0.0.3rc1-py3-none-any.whl
Upload date: Nov 7, 2024
Size: 33.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.4.29

File hashes

Hashes for promptim-0.0.3rc1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fc0f12e951eb9cc29312c59de01eeefa1c001587706555623ae158e6b9f55900`
MD5	`8f8fafd636aefc1bdf496d5fb85baedb`
BLAKE2b-256	`40854993001f08d4a5af98603cea897c6cf9a1f1979e70df58d8d418ae26e10c`