Skip to main content

Language Model Evaluation Harness

Project description

Promptuna

promptuna evaluates and optimizes functions that use an LM to accomplish a goal.

Such functions (hereinafter referred to as programs) do not contain just the bare completion call: they can be surrounded by arbitrary code that prepares the prompt (pre-processing) and refines the model's output (post-processing) - as it is typically the case in real-world scenarios (you don't just "call a model" and return the raw output).

In the refinement loop below, promptuna provides the primitives for you to define the metrics that judge how well your program performs (3). Then, it can use those scores to drive automated improvements on the prompt template (4).

flowchart LR
    A[1. Make a program] --> B[2. Run the program]
    B --> C[3. Evaluate the program]
    C --> D[4. Improve the program]
    D --> B

The loop above maps directly onto the package layout:

Step Module Role Key API
1. Make a program promptuna.program Wire what is under test Program, Example, Experiment, LMConfig
2. Run the program promptuna.run Execute a program on one dataset row run_trial, Trial
3. Evaluate the program promptuna.evaluate Score trials and run full experiments Metric, run_experiment, RunResults, default_llm_judge
4. Improve the program promptuna.optimize Search for a better prompt template optimize, Step, OptimizationResult

promptuna.report sits alongside evaluation and optimization: it renders RunResults and optimization trajectories as markdown (render_run, render_history).

See the getting started notebook for a full working example of this cycle end to end.

Optimization

Prompt-template search (OPRO-style) treats evaluation as multi-criteria: each candidate is scored on several normalized metrics, forming a quality vector in metric space. Before comparing checkpoints, that vector is collapsed by a fixed linear scalarization—the unweighted mean of per-metric means (RunResults.overall.mean), a compensatory aggregation where gains on one metric can offset losses on another. The search is therefore single-objective in template space: it maximizes one scalar utility, keeps the best checkpoint seen so far, and does not explore a Pareto front over metrics. The proposer still receives per-metric breakdowns in the trajectory (render_history); only ranking and early stopping use the headline score.

The optimizer uses the metrics to learn the representation of the data and the expectations of the task, then encodes that knowledge in the prompt template.

Inspiration

promptuna is a proud Frankenstein of DSPy, Ragas, OPRO] and Optuna.

First and foremost, promptuna's value proposition is most similar to DSPy. The differences:

  • Programs: DSPy models a program as a composable graph of predictors (dspy.Module). promptuna treats a program as an ordinary Python function: arbitrary pre/post-processing around a completion call, without forcing signature/module abstractions.
  • Evaluation. DSPy passes a single metric callable to its optimizers. Multiple quality dimensions must be folded into that one function by hand. promptuna takes a list[Metric] instead: each metric has its own name, scale (Range, Ordinal, …), and scorer (programmatic or LLM judge). Results are naively aggregated to collapse multiple metrics into the single optimization objective.
  • Optimization. DSPy offers several teleprompters. promptuna's simple optimizer is OPRO-style: it rewrites a free-form prompt template from a trajectory, using the same multi-metric evaluation harness at every step, keeping the full metric breakdown visible throughout the search.

Some ideas regarding evaluation metrics are taken from the seemingly already abandoned ragas: named metrics where an LLM judge scores a trial against a rubric, with typed scales and optional rationales.

The optimization loop itself takes concepts from DeepMind's OPRO: at each step an LM proposer rewrites the prompt template from scratch using the full scored history of prior candidates.

The name of the package itself is a reference to the infamous Optuna: a fixed-budget search over trials that archives every checkpoint and returns the best one seen.

License

MIT

Made with mold

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

promptuna-1.10.0.tar.gz (16.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

promptuna-1.10.0-py3-none-any.whl (19.9 kB view details)

Uploaded Python 3

File details

Details for the file promptuna-1.10.0.tar.gz.

File metadata

  • Download URL: promptuna-1.10.0.tar.gz
  • Upload date:
  • Size: 16.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.21 {"installer":{"name":"uv","version":"0.11.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for promptuna-1.10.0.tar.gz
Algorithm Hash digest
SHA256 6b89b4c2f14bc7cd2c287203018951b4b23c0b15140ae43e2e21425a7d5504de
MD5 4c69ee3860fb2bd8839f8c9ccba86c53
BLAKE2b-256 ac9a26b2ca6fe9e7b416cbed56429092b7176dabc88c24c51ed642a175557d92

See more details on using hashes here.

File details

Details for the file promptuna-1.10.0-py3-none-any.whl.

File metadata

  • Download URL: promptuna-1.10.0-py3-none-any.whl
  • Upload date:
  • Size: 19.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.21 {"installer":{"name":"uv","version":"0.11.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for promptuna-1.10.0-py3-none-any.whl
Algorithm Hash digest
SHA256 924a5b549e2f17b4b33ce40c9c8829db75674afabcc3f3454e8c0517c2451d56
MD5 81966b04fc95bde9de6b376c29d3e204
BLAKE2b-256 018fc495807bdb605d02f0ef7b7d80c87ca6a0643a1be7ffd3b03e0fb410e355

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page