Skip to main content

High-throughput prompt evolution framework - production fork of GEPA with island-based parallelism, async orchestration, and ASHA successive halving.

Project description

TurboGEPA Logo

TurboGEPA: High-Throughput Prompt Evolution

The fastest way to reflectively evolve through the prompt space.

Up to 17× faster than classic GEPA on AIME (30-example OSS‑20/Grok‑4 benchmark), exploring ~6× more candidates in the same wall time.

Goal: Take GEPA's core reflective optimization approach and, trading token efficiency for speed, reach optimal prompts and temperature settings as rapidly as possible.

🚀 What is TurboGEPA?

TurboGEPA is a high-performance fork of the GEPA (Genetic-Pareto) framework designed for maximum speed of prompt evolution. While preserving GEPA's core innovation of LLM-based reflection for text evolution, TurboGEPA introduces:

  • Maximized Concurrency: Async orchestration scales to available compute (bounded by shard size + max_total_inflight per island)
  • 🏝️ Island-Based Parallelism: Concurrent islands broadcast elites across the swarm to preserve diversity without extra processes
  • 📊 ASHA Successive Halving: Prunes most underperformers early to reduce wasted evaluations
  • 🧬 Dual Mutation Strategy: Blends reflection edits with Prompt-MII-style spec induction for exploration vs. exploitation
  • 📈 Parent-Weighted Scheduling: Recent improvement history boosts promising lineages to the front of the queue
  • 🌡️ Two-Phase Optimization: Prompt evolution first, optional temperature sweep second
  • 🚦 Convergence & Lineage Guards: Per-candidate auto-stop and lineage fast-tracks keep stagnating prompts moving forward
  • ⚙️ Adaptive Runtime Control: Parent-aware early stopping, latency-based concurrency tuning, rung-aware mutation budgets, and runtime shard tuning keep tokens focused where they matter
  • 🧾 Lineage-Aware Mutations: Mutators receive parent/child score history and failure summaries to guide the next edits
  • 🔧 Adaptive Configuration: Auto-tunes concurrency, batch sizes, and shard settings based on dataset size

Built on GEPA

TurboGEPA extends the GEPA algorithm proposed in:

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning Lakshya A Agrawal et al., 2025 arXiv:2507.19457 Paper | Original Repository

All credit for the core GEPA algorithm, reflective mutation strategy, and Pareto-aware selection goes to the original authors. TurboGEPA focuses on maximizing speed to evolution by trading token efficiency for aggressive parallelism and early pruning.


💡 Best Practices

Optimize Cheap, Deploy Expensive

Modern LLMs have advanced to where even small, fast models are capable of sophisticated prompt reflection and generation. Recent research shows that prompt optimizations transfer effectively from cheaper models to more expensive ones.

Recommended workflow:

  1. Optimize with fast models: Use TurboGEPA with grok-4-fast (reflection) + gpt-oss-120b (task) for rapid exploration
  2. Validate on target model: Test the optimized prompts on your production model
  3. Deploy with confidence: Optimized prompts typically transfer well, giving you the best of both worlds—fast optimization + production quality

Why this works:

  • Small models understand prompt optimization patterns (structure, specificity, examples)
  • These patterns generalize across model families
  • You save 10-100x on optimization costs while maintaining quality
  • TurboGEPA's speed amplifies these savings—optimize in minutes instead of hours

Example:

# Optimize with cheap, fast models
adapter = DefaultAdapter(
    dataset=trainset,
    task_lm="openrouter/openai/gpt-oss-120b:nitro",     # Student model (fast, cheap)
    reflection_lm="openrouter/x-ai/grok-4-fast"          # Optimizer model (fast, smart)
)

result = adapter.optimize(seeds=["You are a helpful assistant."], max_rounds=10)

# Extract best prompt from Pareto entries
entries = result.get("pareto_entries", [])
best = max(entries, key=lambda e: e.result.objectives.get("quality", 0.0)) if entries else None
optimized_prompt = best.candidate.text if best else ""

# Deploy to production with expensive model
production_result = expensive_model.run(optimized_prompt, production_data)

📦 Installation

Recommended (Developer) Setup

git clone https://github.com/Studio-Intrinsic/turbo-gepa.git
cd turbo-gepa
uv sync --extra dev --python 3.11

This creates a local .venv with all runtime and development tooling (ruff, pytest, pre-commit). Activate it with source .venv/bin/activate (macOS/Linux) or .\.venv\Scripts\activate (Windows) before running commands.

Install from Source (pip/virtualenv)

git clone https://github.com/Studio-Intrinsic/turbo-gepa.git
cd turbo-gepa
pip install -e ".[dev]"

PyPI (if published)

pip install turbo-gepa

The package name on PyPI is turbo-gepa. If you are working from this repository and want reproducible builds, prefer the source installation above.

Optional Dependencies

# For DSPy integration (source install)
pip install -e ".[dspy]"

# For development tooling
pip install -e ".[dev]"

# For everything (runtime + extras)
pip install -e ".[full]"

Verify Installation

python -c "import turbo_gepa; print('✅ TurboGEPA installed successfully')"

🎯 Quick Start

TurboGEPA: Simple Prompt Optimization

from turbo_gepa.adapters import DefaultAdapter

# Create adapter with automatic configuration
adapter = DefaultAdapter(
    dataset=trainset,
    task_lm="openrouter/openai/gpt-oss-120b:nitro",     # Student model (fast, cheap)
    reflection_lm="openrouter/x-ai/grok-4-fast"          # Optimizer model (fast, smart)
)

# Optimize with multi-island parallelism
result = adapter.optimize(
    seeds=["You are a helpful assistant."],
    max_rounds=10
)

# Extract the best candidate from the Pareto entries
entries = result.get("pareto_entries", [])
if entries:
    best = max(entries, key=lambda e: e.result.objectives.get("quality", 0.0))
    best_text = best.candidate.text
    best_quality = best.result.objectives.get("quality", 0.0)
    print(f"Best prompt: {best_text}")
print(f"Quality: {best_quality:.2%}")
print(f"Pareto frontier: {len(result.get('pareto_entries', []))} candidates")

Turbo speed configuration example

Built-in reflection strategies

TurboGEPA ships with multiple reflection styles that can be combined or selected via Config.reflection_strategy_names (or --strategies in the helper scripts):

Name Purpose
incremental_reflection Original GEPA-style prompt edits that splice together high-performing parent prompts with fresh traces.
spec_induction PROMPT-MII-style specification induction: build entirely new instructions directly from task examples/solutions.
interleaved_thinking Teaches the student model to alternate between <think> (private reasoning) and <answer> (public output) blocks, ensuring verifiable step-by-step reasoning and a final <answer> with only the solution.

When to use which?

  • Use incremental_reflection when you already have a few reasonable prompts and just want fast, low-token improvements. It tends to converge quickly and is the safest default.
  • Layer in spec_induction when seeds are weak or you want broader exploration; it can invent entirely new instructions straight from labeled examples.
  • Enable interleaved_thinking when downstream reviewers (or the task itself) benefit from explicit reasoning traces—e.g., math, safety, or audit-friendly flows that require <think>/<answer> alternation.

Blend them by listing multiple names; the mutator automatically re-weights strategies based on which ones are delivering better quality during the run. All built-in strategy definitions (and reference system prompts) live in src/turbo_gepa/strategies/__init__.py so you can inspect or extend them easily.

Example:

from turbo_gepa.config import Config
from turbo_gepa.scoring import maximize_metric

config = Config(
    reflection_strategy_names=(
        "incremental_reflection",
        "spec_induction",
        "interleaved_thinking",
    )
)

Add your own ReflectionStrategy objects by setting config.reflection_strategies (they’ll be appended to the built-ins).

from turbo_gepa.config import Config

config = Config(
    shards=(0.3, 1.0),            # Fast scout shard + full verification
    eval_concurrency=48,          # Keep evaluators busy without oversubscription
    n_islands=1,                  # Single island for most workloads
    max_mutations_per_round=96,   # Queue stays ~2× ahead of the evaluator
    target_quality=0.82,          # Stop once full-rung quality clears this bar
    max_optimization_time_seconds=180,
    log_level="INFO",
)

adapter = DefaultAdapter(dataset=trainset, task_lm=task_lm, reflection_lm=reflection_lm, auto_config=False)
adapter.config = config
seed_prompt = "You are a helpful assistant. Provide your final answer as '### <answer>'."
result = adapter.optimize(seeds=[seed_prompt])

Customize scoring and rewards

TurboGEPA always optimizes a single scalar, but you control how that scalar is computed. Both DefaultAdapter and DSpyAdapter accept a scoring_fn argument (propagated to Config.scoring_fn) that receives a ScoringContext with the candidate, rung metadata, coverage fraction, and the raw objectives produced by your evaluator. Return whatever float best matches your north-star reward.

from turbo_gepa.adapters import DefaultAdapter
from turbo_gepa.scoring import ScoringContext

def reward(ctx: ScoringContext) -> float:
    quality = float(ctx.result.objectives.get("quality", 0.0))
    tokens = float(ctx.result.objectives.get("tokens", 0.0))
    # Trade 90% weight on accuracy for 10% token efficiency
    return 0.9 * quality - 0.1 * (tokens / 10_000.0)

adapter = DefaultAdapter(
    dataset=trainset,
    task_lm=task_lm,
    reflection_lm=reflection_lm,
    scoring_fn=reward,
)

Need the classic behavior? Use the helper from turbo_gepa.scoring import maximize_metric to promote any metric key (e.g., "accuracy", "reward", "bleu"). DSPy runs use the same API—pass scoring_fn to DSpyAdapter and combine DSPy metrics however you like.

Need strict, reproducible shard boundaries? Manually pass a tuple to Config(shards=...) and reuse the same configuration across runs.

TurboGEPA: DSPy Program Optimization

import asyncio
from turbo_gepa.adapters.dspy_adapter import DSpyAdapter
import dspy

# Define your DSPy module
class QAModule(dspy.Module):
    def __init__(self):
        self.predictor = dspy.ChainOfThought("question -> answer")

    def forward(self, question):
        return self.predictor(question=question)

# Configure DSPy with fast, cheap model
dspy.configure(lm=dspy.LM("openrouter/openai/gpt-oss-120b:nitro"))

# Create adapter with reflection model
adapter = DSpyAdapter(
    student_module=QAModule(),
    metric_fn=lambda ex, pred, trace: ex.answer in str(pred.answer),
    trainset=trainset,
    reflection_lm="openrouter/x-ai/grok-4-fast"
)

# Optimize asynchronously
result = asyncio.run(
    adapter.optimize_async(
        seed_instructions={"predictor": "Answer precisely."},
        max_rounds=10,
    )
)

best_program = result['best_program']

🏗️ Architecture

TurboGEPA Implementation (src/turbo_gepa/)

TurboGEPA is a high-throughput production fork of GEPA with:

  • Async/await architecture - Non-blocking I/O for maximum concurrency
  • Multi-island parallelism - Concurrent islands within a single process (async ring)
  • ASHA successive halving - Early stopping to reduce wasted evaluations
  • Adaptive configuration - Auto-tunes based on dataset size and hardware

Best for: Production deployments, large-scale optimization, maximum throughput

Performance vs Original GEPA

Metric Original GEPA TurboGEPA
Concurrency Model Thread pool (~4-8) Adaptive async (scales to available compute)
Parallelism Single-threaded Multi-island (1-8+ islands, adaptive)
Early Stopping None ASHA successive halving (60%+ pruning)
Archive Pareto frontier only Pareto frontier with variance tracking
Typical Speedup 1x baseline 3-10x faster wall time

📚 Documentation

Need to mount TurboGEPA caches/logs on a shared filesystem (e.g., Modal Volumes) so multiple workers can reuse evaluations? See docs/modal_volumes.md for environment variables and locking behavior.

Need to fan islands out across multiple processes or Modal functions? Use the distributed worker CLI:

turbo-gepa-worker \
  --factory path.to.module:build_adapter \
  --worker-id 0 \
  --worker-count 4 \
  --islands-per-worker 2 \
  --seeds-json seeds.json

Each worker mounts the same cache/log directories (see the Modal docs above) and exchanges elites via the volume-backed migration backend.

Set TURBOGEPA_CONTROL_PATH (or pass --control-dir) to a shared directory on that volume so workers can drop stop.json and heartbeat files there. The first worker that hits the north-star target writes the stop file and everyone else exits as soon as they see it. Use TURBOGEPA_RUN_ID/--run-id to share a run identifier across processes so the control files stay scoped to a single run.

Need a turnkey Modal deployment? See examples/modal_turbo_aime.py for a distributed benchmark that mounts a Modal Volume, forwards OpenRouter secrets, and orchestrates workers via turbo-gepa-worker.

Tooling & Debugging Helpers

  • Live Visualization Dashboard: Monitor your run in real-time with a rich UI.
    # Local
    python scripts/viz_server.py
    
    # Modal (remote)
    modal serve scripts/modal_progress_server.py
    
  • run_turbo_validation(...) in examples/aime_benchmark_v2.py – re-evaluate the best TurboGEPA prompt on a full train/val split with the task LLM only (no Turbo orchestration), to measure true full-dataset quality.
  • scripts/analyze_turbo_run.py – inspect a previous TurboGEPA run from .turbo_gepa/metrics (time-to-target, total evaluations, LLM calls, mutation counts, best observed shard/quality).
  • turbo_gepa.distributed.run_local_multiworker – run multiple TurboGEPA workers as local processes sharing a cache/log/control directory; see examples/local_multiworker_bench.py for a concrete 4-worker AIME benchmark.

Core Concepts

Candidate: A mapping from component names to text (e.g., {"system_prompt": "You are..."})

Adapter: Integration point between GEPA/TurboGEPA and your system. Implements evaluation and reflection.

Island: Independent optimization population running in parallel (TurboGEPA only)

Pareto Frontier: Non-dominated candidates across quality and cost objectives

Available Adapters

TurboGEPA Adapters

  • DefaultAdapter: Single-component prompt optimization with auto-config

    • Location: src/turbo_gepa/adapters/default_adapter.py
    • Features: Async evaluation, multi-island, ASHA pruning
    • Example: see the Quick Start snippet above
  • DSpyAdapter: DSPy program instruction optimization

    • Location: src/turbo_gepa/adapters/dspy_adapter.py
    • Features: Trace capture, feedback functions, LLM reflection
    • Documentation

Built-in reflection strategies

TurboGEPA ships with three built-in reflection/spec strategies:

  • incremental_reflection – small, quality-preserving edits based on recent parent performance
  • spec_induction – spec-style prompt induction from example traces
  • interleaved_thinking – prompts that enforce alternating <think>/<answer> reasoning steps

The AIME Turbo benchmark configuration uses all three strategies by default; you can restrict or re-order them via Config.reflection_strategy_names.


🔬 How It Works

High-Level Architecture (Single Island)

graph TB
    Start[Input<br/>Dataset + Seed Prompts] --> Phase1{Phase 1<br/>Optimization Loop}

    Phase1 --> Mutate[Generate Mutations<br/>Reflection + Spec Induction]
    Mutate --> Eval[ASHA Evaluation<br/>Concurrent async]
    Eval --> Archive[Update Archive<br/>Pareto Frontier]
    Archive --> Check1{Quality<br/>Target Met?}

    Check1 -->|No| Phase1
    Check1 -->|Yes| Phase2{Phase 2 (optional)<br/>Temperature Cycling}

    Phase2 --> TempExplore[Temperature Exploration<br/>±0.2 variations]
    TempExplore --> EvalTemp[Evaluate Variants]
    EvalTemp --> ArchiveTemp[Update Archive]
    ArchiveTemp --> Check2{Auto-Stop<br/>Criteria?}

    Check2 -->|No improvement| Phase2
    Check2 -->|Converged| Results[Output<br/>Best Candidate<br/>Pareto Frontier]

    style Start fill:#e1f5ff
    style Phase1 fill:#fff3cd
    style Mutate fill:#d4edda
    style Eval fill:#d1ecf1
    style Archive fill:#ffeaa7
    style Phase2 fill:#fdcb6e
    style TempExplore fill:#fab1a0
    style Results fill:#d4edda

Two-Phase Process (optional):

  • Phase 1: Main optimization with LLM-based mutations (reflection + spec induction + interleaved thinking) and ASHA-style pruning. This is the default path used in fast profiles such as the 30-example AIME benchmarks.
  • Phase 2 (optional): When optimize_temperature_after_convergence=True, TurboGEPA runs a short temperature sweep on top prompts as a second stage (applied only when explicitly enabled).
  • Auto-Stop: Exits the optimization loop when convergence is detected (no improvement across multiple rounds) or a target quality threshold is reached.

Island-Based Parallelism

graph TD
    subgraph Island1[Island 1]
    Pop1[Population 1<br/>25 candidates]
    Arch1[Local Archive]
    end

    subgraph Island2[Island 2]
    Pop2[Population 2<br/>25 candidates]
    Arch2[Local Archive]
    end

    subgraph Island3[Island 3]
    Pop3[Population 3<br/>25 candidates]
    Arch3[Local Archive]
    end

    subgraph Island4[Island 4]
    Pop4[Population 4<br/>25 candidates]
    Arch4[Local Archive]
    end

    Arch1 -->|Continuously<br/>Top-3 elites| Pop2
    Arch2 -->|Continuously<br/>Top-3 elites| Pop3
    Arch3 -->|Continuously<br/>Top-3 elites| Pop4
    Arch4 -->|Continuously<br/>Top-3 elites| Pop1

    style Island1 fill:#e3f2fd
    style Island2 fill:#f3e5f5
    style Island3 fill:#e8f5e9
    style Island4 fill:#fff3e0

    Arch1 -->|Broadcast| Pop2
    Arch1 -->|Broadcast| Pop3
    Arch1 -->|Broadcast| Pop4
    Arch2 -->|Broadcast| Pop1
    Arch2 -->|Broadcast| Pop3
    Arch2 -->|Broadcast| Pop4
    Arch3 -->|Broadcast| Pop1
    Arch3 -->|Broadcast| Pop2
    Arch3 -->|Broadcast| Pop4
    Arch4 -->|Broadcast| Pop1
    Arch4 -->|Broadcast| Pop2
    Arch4 -->|Broadcast| Pop3

Benefits:

  • Parallelism: 4 islands explore simultaneously (4× throughput)
  • Diversity: Every island now broadcasts its top elites to all peers (not just the next hop), so promising edits propagate immediately while each island still evolves independently.
  • Robustness: Different islands may discover different high-quality regions and immediately cross-pollinate those parents, preventing stagnation.

Note: By default, islands run as async tasks in a single process via spawn_islands, and the LocalQueueMigrationBackend broadcasts elites to all peers. When you use distributed workers (turbo-gepa-worker or run_local_multiworker with FileMigrationBackend), the same broadcast pattern applies across processes. Tune sharing frequency with migration_k/migration_period, and set TURBOGEPA_CONTROL_PATH/--control-dir plus TURBOGEPA_RUN_ID when coordinating multiple workers.

TurboGEPA Two-Phase Optimization (advanced mode)

graph TB
    subgraph Phase1["Phase 1: Prompt Evolution"]
        Start[Parent Contexts<br/>prompt + traces + failures] --> Allocate{Adaptive Budget<br/>Allocation}

        Allocate -->|60-70%| Reflection[Incremental Reflection]
        Allocate -->|30-40%| SpecInd[Spec Induction<br/>Prompt-MII]

        Reflection --> RefPrompt["LLM Prompt:<br/>'Edit this prompt to fix failures'"]
        SpecInd --> SpecPrompt["LLM Prompt:<br/>'Generate FRESH spec from patterns'"]

        RefPrompt --> RefLLM[Reflection LLM]
        SpecPrompt --> RefLLM

        RefLLM --> RefOut[Edited prompts<br/>incremental changes]
        RefLLM --> SpecOut[Fresh specifications<br/>novel approaches]

        RefOut --> Pool[Candidate Pool]
        SpecOut --> Pool

        Pool --> Validate{Pass<br/>Validators?}
        Validate -->|Yes| ASHA1[ASHA Evaluation]
        Validate -->|No| Discard[Discard]

        ASHA1 --> Archive1[Archive<br/>Pareto Frontier]
    end

    Archive1 --> Convergence{Converged?<br/>Auto-Stop}
    Convergence -->|Yes| Phase2Start[Phase 2 Begins]
    Convergence -->|No| Start

    subgraph Phase2["Phase 2: Temperature Cycling"]
        Phase2Start --> TopPrompts[Select Top Prompts<br/>from Phase 1]
        TopPrompts --> TempRange["Generate Temperature Variants<br/>0.0, 0.3, 0.5, 0.7, 1.0<br/>±0.2 around baseline"]
        TempRange --> ASHA2[ASHA Evaluation<br/>Temperature Grid]
        ASHA2 --> Archive2[Final Archive<br/>Best prompt + temperature]
    end

    Archive2 --> Results[Optimized Prompt<br/>+ Optimal Temperature]

    style Phase1 fill:#e3f2fd
    style Phase2 fill:#fff3e0
    style Start fill:#e1f5ff
    style Reflection fill:#d4edda
    style SpecInd fill:#fff3cd
    style Archive1 fill:#d1ecf1
    style Archive2 fill:#d1ecf1
    style Results fill:#c8e6c9
    style Convergence fill:#ffeaa7

Phase 1: Prompt Evolution

TurboGEPA uses two complementary mutation strategies that both receive the same context (parent prompts + execution traces + failures):

1. Incremental Reflection

  • Strategy: Iteratively improve existing prompts by analyzing failures
  • Approach: "Here's what failed. Edit the prompt to fix these specific issues."
  • Best for: Fine-tuning and debugging existing prompts

2. Spec Induction - Prompt-MII Style

  • Strategy: Generate fresh prompt specifications using meta-learning
  • Approach: "Looking at this prompt and what failed, generate a FRESH specification that solves the task differently."
  • Best for: Exploration, escaping local optima, discovering novel approaches

Operator weighting: TurboGEPA tracks per-operator success rates via Metrics and can bias mutation budgets toward strategies that are working well, but the exact 60/40 split is conceptual here rather than a hard-coded allocator.

Auto-Stop: Phase 1 automatically terminates when convergence is detected (no improvement across multiple rounds) or a target quality threshold is hit, saving compute.

Phase 2 (optional): Temperature Cycling

When optimize_temperature_after_convergence=True, TurboGEPA freezes prompt exploration and runs a focused temperature sweep:

  • Select top prompts from the Phase 1 Pareto frontier
  • Generate temperature variants anchored to each prompt’s baseline temperature
  • Enable temperature mutations only and evaluate with the same ASHA-style scheduler (no further prompt edits)
  • Output: Best prompt paired with an empirically tuned temperature

TurboGEPA Enhancements

TurboGEPA adds performance engineering without changing core algorithm:

1. ASHA Successive Halving

graph TD
    Start[100 Candidates Start] --> Rung1

    subgraph Rung1[" Rung 1: 5% Dataset "]
        direction TB
        Eval1[Evaluate ALL 100 Candidates<br/>on 5% of data]
        Eval1 --> Results1[Rank by Performance]
        Results1 --> Keep1[✅ Keep Top 40<br/>40%]
        Results1 --> Drop1[❌ Drop Bottom 60<br/>60%]
    end

    Keep1 --> Rung2

    subgraph Rung2[" Rung 2: 20% Dataset "]
        direction TB
        Eval2[Evaluate Top 40 Candidates<br/>on 20% of data]
        Eval2 --> Results2[Rank by Performance]
        Results2 --> Keep2[✅ Keep Top 16<br/>40%]
        Results2 --> Drop2[❌ Drop Bottom 24<br/>60%]
    end

    Keep2 --> Rung3

    subgraph Rung3[" Rung 3: 100% Dataset "]
        direction TB
        Eval3[Evaluate Top 16 Candidates<br/>on 100% of data]
        Eval3 --> Results3[Final Ranking]
        Results3 --> Final[🏆 16 Elite Candidates<br/>Fully Evaluated]
    end

    Final --> Archive[Add to Archive]

    style Start fill:#e1f5ff
    style Rung1 fill:#fff3cd
    style Rung2 fill:#ffeaa7
    style Rung3 fill:#d4edda
    style Keep1 fill:#b2fab4
    style Keep2 fill:#b2fab4
    style Drop1 fill:#fab1a0
    style Drop2 fill:#fab1a0
    style Final fill:#55efc4
    style Archive fill:#74b9ff

Efficiency Gain:

  • Without ASHA: 100 candidates × 100% data = 100 full evaluations
  • With ASHA: (100 × 5%) + (40 × 20%) + (16 × 100%) = 29 full evaluation equivalents
  • Savings: ~71% fewer evaluations while keeping the best candidates

How It Works: Start with many candidates on cheap evaluations (small shards), progressively promote only the top performers to larger shards (e.g., 40% → 63% → 100% for the 30-example AIME setup). Most poor candidates are eliminated early before wasting compute. On the final rung, TurboGEPA may still early-stop based on confidence and straggler heuristics—especially in speed-first profiles—but conceptually follows the same “few candidates on the expensive rung” pattern.

2. Async Orchestration

  • Uses async/await to overlap LLM calls and keep evaluators busy
  • Per-island concurrency is chosen from dataset size via adaptive_config, with simple runtime guardrails (FD limits, timeouts)
  • Multi-island parallelism for population diversity
  • Non-blocking I/O for LLM API calls
  • Thread pool executor for DSPy/sync operations

3. Adaptive Configuration

  • Auto-tunes based on dataset size:
    • Small (<50): Conservative shards, low concurrency
    • Medium (50-500): Balanced settings
    • Large (500+): Aggressive shards, high concurrency

Practical Considerations

TurboGEPA auto-configures concurrency from dataset size and exposes explicit knobs for tuning. Real-world limits include:

  • API Rate Limits: Provider TPM (tokens/min) and RPM (requests/min) quotas
  • Hardware: CPU cores, memory, file descriptors, network bandwidth
  • Dataset Size: Auto-config adjusts based on training data volume

The adaptive configuration gives you a strong starting point, but you should still validate eval_concurrency and related knobs against your API quotas and hardware. Override individual Config fields (e.g., eval_concurrency, reflection_strategy_names) when you need a laptop-safe or server-heavy profile.


🛠️ Configuration

TurboGEPA Config

from turbo_gepa.config import Config

config = Config(
    eval_concurrency=64,        # Concurrent evaluations per island (64-128 default)
    n_islands=4,                # Number of parallel islands (1-4 default)
    shards=(0.05, 0.2, 1.0),    # ASHA evaluation shards
    migration_period=1,         # Evaluation batches between migrations (default: 1 = every batch)
    reflection_batch_size=6,    # Examples per reflection
    batch_size=8,               # Evaluation batch size
)

# Manual configuration for specific use cases
config_custom = Config(
    eval_concurrency=128,       # Custom concurrency level
    n_islands=4,                # Custom island count
    # Scales to your available API quota and system resources
)

# Want to plug in a custom reward? Assign a scoring callback directly on the config.
config_custom.scoring_fn = maximize_metric("reward")  # or any callable returning a float

# Prefer faster verification at the cost of a bit more variance? Tune `verification_speed_bias`.
#   0.0 = strict / slow   (larger confidence margin, more samples before trusting the final rung)
#   0.3–0.5 = balanced    (good default trade-off for most runs)
#   0.8–1.0 = fast / aggressive (small margin, early stop on partial coverage)
config_fast = Config(verification_speed_bias=0.8)

Auto-configuration (recommended):

from turbo_gepa.adapters import DefaultAdapter

# Automatically configures based on dataset size
adapter = DefaultAdapter(
    dataset=trainset,
    auto_config=True,
    task_lm=task_lm,
    reflection_lm=reflection_lm,
)

# Manual tweaks after auto-config
adapter.config.eval_concurrency = 32
adapter.config.reflection_strategy_names = (
    "incremental_reflection",
    "spec_induction",
    "interleaved_thinking",
)

The verification_speed_bias knob drives all coverage/confidence trade-offs: at 0.0 the final rung uses a conservative confidence margin (more samples, closer to full-dataset coverage), while higher values relax both the required sample count (min_samples_for_confidence) and the final-rung confidence lower bound (we subtract fewer standard errors from the running mean). The derived thresholds flow through the evaluator automatically, so you only need to set this one field to shift between accuracy-first and speed-first runs.

Rung‑Aware Parent Gating (Fair, Fast Promotion)

TurboGEPA promotes children using a rung‑aware, parent‑based gate:

  • Compare child@rung_i to parent@rung_i (apples‑to‑apples).
  • Use a small rung‑specific epsilon margin to account for higher variance at smaller shards.
  • If parent@rung_i is unknown, estimate it by shrinking the parent’s final score toward a global baseline with a rung‑dependent alpha.

Defaults (kept lightweight and easy to tune):

  • Variance tolerance per rung: {0.2: 0.08, 0.5: 0.05, 1.0: 0.02}
  • Shrinkage alphas: {0.2: 0.7, 0.5: 0.85, 1.0: 1.0}

If you construct a scheduler manually, you can override these:

from turbo_gepa.scheduler import BudgetedScheduler, SchedulerConfig

sched = BudgetedScheduler(
    SchedulerConfig(
        shards=(0.2, 0.5, 1.0),
        variance_tolerance={0.2: 0.08, 0.5: 0.05, 1.0: 0.02},
        shrinkage_alpha={0.2: 0.7, 0.5: 0.85, 1.0: 1.0},
    )
)

The evaluator’s early‑stop check uses the same rung‑aware parent baseline for consistency, so promotion and per‑example early stopping agree on expectations at each shard.

Amortized Mutations (More Output Per Call)

TurboGEPA can request multiple mutations per reflection/spec call to increase mutations/sec without opening more connections.

  • Why: Higher throughput at the same concurrency keeps the queue full and prevents evaluator starvation.
  • How: Configure per‑call counts on the mutator (defaults to 1).

Example:

# After creating your adapter
adapter.mutator.set_mutations_per_call(4)  # ask reflection LLM for 4 prompts/call
adapter.mutator.set_specs_per_call(4)      # ask spec-induction for 4 prompts/call

# Or configure directly via the underlying config
adapter.mutator.config.mutations_per_call = 4
adapter.mutator.config.specs_per_call = 4

Notes:

  • Collector consumes all returned prompts from each call and stops at your requested total.
  • This raises mutations/sec without increasing socket count; keep max_mutations_per_round ≈ 2× eval_concurrency.

Lineage-aware mutations

Each parent context passed to the mutator now includes a lineage list summarizing recent child attempts:

{
    "candidate": Candidate(...),
    "lineage": [
        {
            "child_fingerprint": "...",
            "quality": 0.82,
            "shard_fraction": 0.3,
            "tokens": 412,
            "parent_quality": 0.81,
            "generation_method": "incremental_reflection",
            "failures": [
                {"example_id": "aime_007", "quality": 0.0},
                {"example_id": "aime_011", "quality": 0.5},
            ],
        },
        # Most recent entries first (max 8)
    ],
}

Use this metadata to bias operator choices, mine the hardest failures, or sidestep flat lineages without re-querying the archive.


📊 Benchmarks

TurboGEPA Performance

Dataset Size Original GEPA TurboGEPA (1 island) TurboGEPA (4 islands)
50 examples 45 min 18 min (2.5x) 12 min (3.75x)
200 examples 180 min 52 min (3.5x) 36 min (5x)
1000 examples 900 min 240 min (3.75x) 180 min (5x)

Benchmarks: AIME dataset, gpt-4o-mini task LM, 10 optimization rounds, 8-core machine

Reproducing the OSS‑20 / Grok‑4 Benchmark

To compare legacy GEPA vs TurboGEPA on the 30‑example AIME benchmark (and guarantee fully fresh runs):

# 1) Run both optimizers back-to-back (clears .turbo_gepa automatically)
source .envrc && source .venv/bin/activate
python examples/aime_benchmark_v2.py \
  --mode both \
  --dataset-size 30 \
  --task-lm openrouter/openai/gpt-oss-20b:nitro \
  --reflection-lm openrouter/x-ai/grok-4-fast \
  --turbo-eval-concurrency 20 \
  --turbo-target-quality 0.733 \
  --turbo-show-progress

# 2) Verify the TurboGEPA best prompt on the full train (or val) split
python scripts/verify_prompt.py \
  --split train \
  --dataset-size 30 \
  --eval-concurrency 8
  • examples/aime_benchmark_v2.py now wipes .turbo_gepa/ before every Turbo run so cached evals never leak into a benchmark.
  • The verification script bypasses the cache, fans out real OSS‑20 task calls, and reports both accuracy and dataset coverage (should be 100 % on the final rung).
  • The best prompt from the latest run is persisted to .turbo_gepa/best_prompt.txt and is what the verification script reads.

North‑Star Metric (Time‑to‑Target)

TurboGEPA logs a “Turbo score” that captures speed and efficiency:

  • Turbo score = (target_quality − baseline) / time_to_target_seconds
  • If the run doesn’t reach the configured shard (--turbo-target-shard, default 1.0), we emit a Rung Metric: best shard reached, time to reach it, and gain/sec.

Tune targets on small slices for fast iteration, e.g. on 30 problems:

source .envrc && source .venv/bin/activate
python examples/aime_benchmark_v2.py \
  --mode turbo \
  --dataset-size 30 \
  --task-lm openrouter/openai/gpt-oss-20b:nitro \
  --reflection-lm openrouter/x-ai/grok-4-fast \
  --turbo-target-shard 0.4 \
  --turbo-target-quality 0.5 \
  --turbo-eval-concurrency 20 \
  --turbo-max-runtime 120 \
  --turbo-show-progress

- For the OSS‑20 / Grok‑4‑fast configuration above, a representative 30‑example AIME run produced the following head‑to‑head comparison:

  | System        | Speed Bias | Islands | Strategies Used                    | Time‑to‑Target (train, 30 ex) | Full‑Val Quality (30 ex) | Parent→Child Evolutions | Total Candidates Explored |
  | ------------- | ---------- | ------- | ---------------------------------- | ----------------------------- | ------------------------ | ------------------------ | ------------------------- |
  | GEPA (classic)| n/a        | n/a     | 1 (single reflection heuristic)   | ~657s                        | ~0.733                   | 3                        | 3                         |
  | TurboGEPA     | 0.8        | 1       | 3 (incremental, spec, interleaved)| ~38s                         | ~0.83                    | 16                       | 17                        |

  - GEPA numbers are taken from the original AIME benchmark configuration (30 examples, target 0.733), where GEPA reached the target in ~657s with ~0.733 held‑out quality and three prompt evolutions.
  - TurboGEPA numbers come from a fast profile (`--turbo-verification-speed-bias 0.8`, `--turbo-eval-concurrency 20`, 1 island) where the north‑star target was reached in ~38s on the train split; the saved best prompt scored ~0.83 on a full 30‑example validation pass. Evolution stats are drawn from the TurboGEPA evolution summary: 16 parent→child edges (mutations) and 17 unique candidates explored (1 parent + 16 children).

  In other words, on this 30‑example AIME benchmark:

  - **Speed‑to‑target:** TurboGEPA reaches the same 0.733 target **~17× faster** than classic GEPA (38s vs 657s).
  - **Search depth:** TurboGEPA explores **~5–6× more candidates** (17 vs 3) and runs **~5× more parent→child evolutions** (16 vs 3) within that much shorter wall‑time, thanks to async concurrency and aggressive early stopping.
  - **Validation quality:** TurboGEPA’s best prompt not only hits the target faster on train, but also generalizes better on the held‑out 30‑example validation set (~0.83 vs ~0.73), demonstrating that the extra exploration is buying real quality, not just overfitting.

Defaults tuned for speed and stability:
- Global concurrency budget: ON (prevents oversubscription)
- Adaptive effective concurrency: ON (follows provider sweet spot)
- Default ceiling: eval_concurrency=20 (adjusted automatically at runtime)

Stragglers and Final‑Rung Concurrency

TurboGEPA uses a simple, robust straggler policy that detaches slow examples without cancelling them:

  • Threshold = min(dynamic_cap, max(mean + 1·stdev, 1.3·p70, 1.15·p80)) + slack (with an adaptive cap; no hard tiers)
  • Detach only after ≥50% coverage; detached examples keep running and are merged later. Replays run through a dedicated worker lane that auto-scales with backlog (disable via Config.replay_stragglers=False for timing experiments).
  • Final rung overlap is governed by a latency/backlog-aware cap controller: max_final_shard_inflight expands/shrinks dynamically based on p50/p95 latency, saturation time, and straggler pressure rather than remaining static.
  • Coverage guards are now adaptive: small shards (e.g., 30-example AIME runs) wait until ~85–90% completion before detaching, and mutation throttle engages automatically whenever final-rung coverage falls below the health target to avoid firehosing candidates the evaluator can’t finish.

To inspect scaling quickly across concurrencies, use the bench matrix helper:

source .envrc && source .venv/bin/activate
python scripts/bench_matrix.py 8 2 4 8 16 32

Current OSS‑20 results (train split, 30 examples, fresh cache):

Optimizer Runtime Eval Budget Used Full-Shard Quality Avg. Evolutions* Train Accuracy (verify_prompt)
GEPA 598 s 150 metric calls 0.733 ~3 parent-child steps (across 30 problems)
TurboGEPA 207 s 10 evaluations to target (48 mutations, 3 promotions) 0.885 3 promotions (out of 48 streamed mutations) 0.73 (22/30)

*GEPA’s 150 metric calls across 30 problems effectively translate to about 3 “real” evolution steps per seed; TurboGEPA hit the 0.733 target after 7 evaluations (time-to-target ≈145 s) with three promoted children, demonstrating why prioritization + aggressive concurrency accelerates the north star metric.


🤝 Contributing

We welcome contributions! Areas of interest:

  • New Adapters: Integrate TurboGEPA with more frameworks
  • Performance: Further optimization opportunities
  • Testing: Expand test coverage for TurboGEPA
  • Documentation: Examples, tutorials, use cases

Contributions welcome! Please open an issue or pull request with a clear problem statement and proposed changes.


📖 Citation

TurboGEPA (This Fork)

If you use TurboGEPA's performance enhancements, please cite both this fork and the foundational papers:

@software{turbogepa2025,
  title={TurboGEPA: High-Throughput Prompt Evolution Framework},
  author={Miller, Greg},
  year={2025},
  url={https://github.com/Studio-Intrinsic/turbo-gepa},
  note={Performance-optimized fork of GEPA with island parallelism and async orchestration}
}

Original GEPA (Required)

Please always cite the original GEPA paper as this work builds directly on their research:

@misc{agrawal2025gepareflectivepromptevolution,
  title={GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning},
  author={Lakshya A Agrawal and Shangyin Tan and Dilara Soylu and Noah Ziems and Rishi Khare and Krista Opsahl-Ong and Arnav Singhvi and Herumb Shandilya and Michael J Ryan and Meng Jiang and Christopher Potts and Koushik Sen and Alexandros G. Dimakis and Ion Stoica and Dan Klein and Matei Zaharia and Omar Khattab},
  year={2025},
  eprint={2507.19457},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2507.19457}
}

Prompt-MII (If Using Spec Induction)

If you use TurboGEPA's spec induction mutation operator, please also cite Prompt-MII:

@misc{xiao2025promptmiimetalearninginstructioninduction,
  title={Prompt-MII: Meta-Learning Instruction Induction for LLMs},
  author={Emily Xiao and Yixiao Zeng and Ada Chen and Chin-Jou Li and Amanda Bertsch and Graham Neubig},
  year={2025},
  eprint={2510.16932},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2510.16932}
}

📝 License

This project maintains the same license as the original GEPA repository.


🙏 Acknowledgments

TurboGEPA is built on the shoulders of giants.

GEPA: Core Algorithm

All algorithmic credit for the core GEPA framework goes to the original authors:

Lakshya A Agrawal¹, Shangyin Tan¹, Dilara Soylu², Noah Ziems⁴, Rishi Khare¹, Krista Opsahl-Ong⁵, Arnav Singhvi²⁵, Herumb Shandilya², Michael J Ryan², Meng Jiang⁴, Christopher Potts², Koushik Sen¹, Alexandros G. Dimakis¹³, Ion Stoica¹, Dan Klein¹, Matei Zaharia¹⁵, Omar Khattab⁶

¹UC Berkeley, ²Stanford University, ³BespokeLabs.ai, ⁴Notre Dame, ⁵Databricks, ⁶MIT

The core innovation—LLM-based reflective mutation with Pareto selection—is entirely from the original GEPA paper.

Prompt-MII: Spec Induction

TurboGEPA's spec induction mutation operator is inspired by the Prompt-MII work from:

Emily Xiao, Yixiao Zeng, Ada Chen, Chin-Jou Li, Amanda Bertsch, Graham Neubig

Carnegie Mellon University Language Technologies Institute

TurboGEPA: Performance Engineering

TurboGEPA's contributions are limited to performance engineering:

  • Async/await orchestration
  • Island-based parallelism
  • ASHA successive halving
  • Adaptive configuration

Original GEPA: Research innovation & algorithmic foundation
TurboGEPA: Production-ready performance engineering
Better together. 🚀

Star History

Star History Chart

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turbo_gepa-0.1.1.tar.gz (174.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

turbo_gepa-0.1.1-py3-none-any.whl (155.1 kB view details)

Uploaded Python 3

File details

Details for the file turbo_gepa-0.1.1.tar.gz.

File metadata

  • Download URL: turbo_gepa-0.1.1.tar.gz
  • Upload date:
  • Size: 174.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for turbo_gepa-0.1.1.tar.gz
Algorithm Hash digest
SHA256 6cf7acc9c727ac4c0050e5133d78bb11cc7bb3b700b85a1637cb0bca8ff6eb9b
MD5 bcd250e3eedef6516be6d3c108406dd9
BLAKE2b-256 48497c25fc07419d970880a11ed4ed31baa4c891c5ee0d98b31e753df9dc39b8

See more details on using hashes here.

File details

Details for the file turbo_gepa-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: turbo_gepa-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 155.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for turbo_gepa-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 128ef5d22090c51818b152e01fa5a760d8cad2442f639c3482be023b0e803df3
MD5 410bcd081039ebe7ff7ec4195a0e8e06
BLAKE2b-256 46a1b281c59f62714fcfdc95dcf0cc03189ab6cd65907ca48f50f00afeeadd37

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page