GEPA-powered prompt optimization for vowel eval spec generation

These details have not been verified by PyPI

Project links

Project description

vowel-optimization

GEPA-powered prompt optimization playground for vowel eval spec generation.

This repository contains a research tool for optimizing vowel's EVAL_SPEC_CONTEXT prompt using GEPA (Genetic Pareto). By iteratively generating evaluation specs and measuring their quality against ground-truth function implementations, it discovers improved prompts that produce better test cases.

Overview

vowel generates YAML evaluation specs from function signatures and descriptions. The quality of these specs depends heavily on the system prompt (EVAL_SPEC_CONTEXT) used during generation. This tool:

Evaluates the current prompt against a set of reference functions
Optimizes the prompt via GEPA's evolutionary search
Compares baseline vs optimized performance

Debug

Set LOGFIRE_ENABLED to true for enabling debugging and monitoring.

export LOGFIRE_ENABLED=true

echo "LOGFIRE_ENABLED=true" > .env

How It Works

EVAL_SPEC_CONTEXT (prompt)
         ↓
   Generate YAML evals
         ↓
   Run against ground truth
         ↓
     Score (pass rate)
         ↓
   GEPA: propose improvements
         ↓
     [repeat until convergence]

Each iteration:

Generates eval specs for reference functions (e.g., json_encode, slugify, levenshtein)
Executes specs against original implementations
Diagnoses failures (wrong expected values, invented raises, over-strict assertions)
Feeds structured failure reports to a proposer LLM to suggest prompt refinements

Installation

From source

git clone <this-repo>
cd vowel-optimization
pip install -e .

Dependencies

Python 3.11+
vowel
GEPA
pydantic-ai
logfire (for telemetry)

Quick Start

1. Evaluate Current Prompt

Test the default EVAL_SPEC_CONTEXT against all reference functions:

python -m vowel_optimization eval

Output:

============================================================
Evaluating [current] context (2453 chars)
============================================================

  json_encode... 88% (15/17)
  slugify... 92% (23/25)
  levenshtein... 100% (12/12)
  ...

------------------------------------------------------------
Average pass rate: 91%
Total: 143/157 cases passed

Failure breakdown:
  WRONG_EXPECTED: 8
  INVENTED_RAISES: 4
  FORMAT_MISMATCH: 2

2. Run Optimization

Use GEPA to improve the prompt over 50 metric evaluations:

python -m vowel_optimization optimize --max-calls 50

What happens:

GEPA starts with EVAL_SPEC_CONTEXT as seed
Evaluates on reference functions
Proposes improved versions via LLM reflection
Saves best candidate to optimized_context.txt

Options:

--max-calls N: Maximum GEPA iterations (default: 50)
--output path/to/file.txt: Save location for optimized prompt
--model MODEL: Override eval model (default: openrouter:google/gemini-3-flash-preview)
--proposer-model MODEL: Override proposer model
--seed-file path/to/seed.txt: Start from a previously optimized prompt instead of default

Example:

# Start from previous best, run 100 more iterations
python -m vowel_optimization optimize \
  --max-calls 100 \
  --seed-file optimized_context.txt \
  --output optimized_context_v2.txt

3. Compare Results

Compare prompts against each other or against the default:

# Compare one file vs default EVAL_SPEC_CONTEXT
python -m vowel_optimization compare optimized_context.txt

# Compare two specific files
python -m vowel_optimization compare optimized_context.txt optimized_context_v3.txt

# Compare multiple files
python -m vowel_optimization compare v1.txt v2.txt v3.txt

Output (1 file vs default):

1. Current EVAL_SPEC_CONTEXT:
============================================================
Average pass rate: 91%
...

2. optimized_context.txt:
============================================================
Average pass rate: 96%
...

============================================================
Comparison:
  EVAL_SPEC_CONTEXT: 91%
  optimized_context.txt: 96%
  Δ: +5%

Output (2+ files):

1. optimized_context.txt:
============================================================
Average pass rate: 85%
...

2. optimized_context_v3.txt:
============================================================
Average pass rate: 99%
...

============================================================
Comparison:
  optimized_context.txt: 85%
  optimized_context_v3.txt: 99%
  Δ: +14%

4. Evaluate Custom Prompts

Test any saved prompt file:

python -m vowel_optimization eval --context-file my_experiments/prompt_v3.txt

Changing Models

All models use the format provider:model-name. Default: openrouter:google/gemini-3-flash-preview

Eval Model

Controls the LLM that generates eval specs during optimization:

python -m vowel_optimization eval --model openrouter:anthropic/claude-3.5-sonnet

python -m vowel_optimization optimize --model openai:gpt-4o

Proposer Model

Controls the LLM that suggests prompt improvements (GEPA's reflection step):

python -m vowel_optimization optimize \
  --model openrouter:google/gemini-3-flash-preview \
  --proposer-model openrouter:anthropic/claude-3.5-sonnet

Tip: Use a stronger model for the proposer (e.g., Claude Sonnet) and a faster/cheaper model for eval generation (e.g., Gemini Flash).

Project Structure

vowel_optimization/
├── .git/
├── .gitignore
├── LICENSE
├── README.md
├── pyproject.toml
├── results/                   # Optimization outputs
└── src/
    └── vowel_optimization/
        ├── __init__.py
        ├── __main__.py        # CLI entry point
        ├── run_optimization.py # Main commands (eval, optimize, compare)
        ├── adapter.py         # GEPA adapter implementation
        ├── task.py            # Core: generate + score eval specs
        ├── functions.py       # Reference function wrappers
        └── definitions.py     # Ground truth implementations

Key Files

run_optimization.py: CLI commands and orchestration
adapter.py: Bridges GEPA's optimization loop with vowel's eval generation
task.py: Implements generate_and_score() — generates YAML, runs tests, diagnoses failures
definitions.py: Reference functions (json_encode, slugify, levenshtein, etc.)

How Optimization Works

GEPA Adapter

The VowelGEPAAdapter implements three key methods:

evaluate(): Generate eval specs for each function → run → return pass rates
make_reflective_dataset(): Convert failures into structured feedback (e.g., "Expected value wrong for case X")
propose_new_texts(): Call proposer LLM with failure diagnostics → get improved prompt

Failure Categories

task.py classifies failures into:

WRONG_EXPECTED: Generated expected value doesn't match actual output
INVENTED_RAISES: Spec expects exception but function returns normally
FORMAT_MISMATCH: String formatting details wrong (separators, whitespace)
OVER_STRICT_ASSERTION: Assertion too brittle (e.g., exact float equality)
BAD_INPUT: Invalid input in generated test case

These drive prompt refinements like:

"Add guidance to trace algorithm logic for expected values"
"Only test raises for code paths that actually throw"
"Use lenient type checking for bool/int compatibility"

Telemetry

Uses Logfire for observability. Logs:

Each evaluation run with scores and failure breakdown
GEPA optimization progress (candidates, scores, best context)
Individual function case results

Configure with LOGFIRE_TOKEN env var to enable cloud logging.

Adding Reference Functions

Edit definitions.py:

# definitions.py
def my_new_function(x: int) -> int:
    """Double the input."""
    return x * 2

# Add to FUNCTIONS dict at bottom
FUNCTIONS = {
    # ... existing functions ...
    "my_new_function": {
        "func": my_new_function,
        "description": "Doubles integer input",
    },
}

The optimizer will automatically include it in the next run.

Examples

Baseline Evaluation

$ python -m vowel_optimization eval

============================================================
Evaluating [current] context (2453 chars)
============================================================

  json_encode... 88% (15/17)
  slugify... 92% (23/25)
  levenshtein... 100% (12/12)

------------------------------------------------------------
Average pass rate: 91%
Total: 143/157 cases passed

Full Optimization Run

$ python -m vowel_optimization optimize --max-calls 30

Starting GEPA prompt optimization...
  Eval model: openrouter:google/gemini-3-flash-preview
  Proposer model: openrouter:google/gemini-3-flash-preview
  Max metric calls: 30
  Seed: default EVAL_SPEC_CONTEXT (2453 chars)
============================================================

[GEPA progress bar with candidate evaluation]

============================================================
Optimization Complete!
============================================================
Best validation score: 96.18%
Context length: 3127 chars
Saved to: optimized_context.txt

Iterative Refinement

# First round
python -m vowel_optimization optimize --max-calls 50 --output opt_v1.txt

# Second round starting from v1
python -m vowel_optimization optimize --max-calls 50 \
  --seed-file opt_v1.txt \
  --output opt_v2.txt

# Compare all versions
python -m vowel_optimization compare opt_v1.txt         # vs default
python -m vowel_optimization compare opt_v2.txt         # vs default
python -m vowel_optimization compare opt_v1.txt opt_v2.txt  # direct comparison

License

MIT

vowel: YAML-based evaluation framework
GEPA: Genetic Pareto for meta-optimization
pydantic-ai: Type-safe AI agent framework

Reference

Used @dmontagu's pydantic-ai-gepa-example as seed repository.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Feb 13, 2026

0.1.0

Feb 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vowel_optimization-0.2.0.tar.gz (24.2 kB view details)

Uploaded Feb 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vowel_optimization-0.2.0-py3-none-any.whl (23.3 kB view details)

Uploaded Feb 13, 2026 Python 3

File details

Details for the file vowel_optimization-0.2.0.tar.gz.

File metadata

Download URL: vowel_optimization-0.2.0.tar.gz
Upload date: Feb 13, 2026
Size: 24.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for vowel_optimization-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`90f37c5dd0d82091f8ae2e8dd1b27090841c47669273a4a2291adad75f92e22f`
MD5	`8503cebee86ebe68fe26bea62c5c8f71`
BLAKE2b-256	`7c815012bc1858ae9e7bb102611246126a48d7aee1d486cd413579eec9312119`

See more details on using hashes here.

File details

Details for the file vowel_optimization-0.2.0-py3-none-any.whl.

File metadata

Download URL: vowel_optimization-0.2.0-py3-none-any.whl
Upload date: Feb 13, 2026
Size: 23.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for vowel_optimization-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`768d39873fa49801414b0f5567bff2601c34a0be1ebd7988fe9e6cc528d78097`
MD5	`67fccdf1c5be960d451ab81c30d57172`
BLAKE2b-256	`3ae94d8ea0abcedefd48ff8a71e8a4ec978647e307ab09365e3442096f0650f6`

See more details on using hashes here.

vowel-optimization 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

vowel-optimization

Overview

Debug

How It Works

Installation

From source

Dependencies

Quick Start

1. Evaluate Current Prompt

2. Run Optimization

3. Compare Results

4. Evaluate Custom Prompts

Changing Models

Eval Model

Proposer Model

Project Structure

Key Files

How Optimization Works

GEPA Adapter

Failure Categories

Telemetry

Adding Reference Functions

Examples

Baseline Evaluation

Full Optimization Run

Iterative Refinement

License

Related

Reference

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes