Skip to main content

Prompt simulation and autonomous-agent effectiveness benchmarking framework

Project description

ai-prompt-simulation

A professional Python framework for testing prompt strength, quality, and real autonomous-agent effectiveness.

This repository helps you answer practical questions before deploying prompts in production:

  • Is this prompt strong enough for autonomous execution?
  • Which quality dimensions are weak (clarity, specificity, robustness, consistency, efficiency)?
  • How does prompt A compare to prompt B under repeatable conditions?
  • Can I customize scoring, scenarios, and evaluator logic for my domain?

Why This Project Exists

Prompt quality is often judged subjectively. This project provides a repeatable simulation pipeline with transparent scoring, configurable weighting, and benchmark workflows that can be run from both Python API and CLI.

Core Capabilities

  • Deterministic simulation engine with seed-based runs and retries
  • Hybrid quality model:
    • Deterministic heuristics (clarity, specificity, robustness, consistency, efficiency)
    • Optional LLM-as-judge dimensions (reasoning, goal completion)
  • Benchmark mode for multi-case prompt suites
  • Side-by-side prompt comparison
  • Extensible plugin registry for custom evaluators and scenario factories
  • JSON report output for automation and CI pipelines

Architecture

High-level module map:

  • core: Typed schemas, config validation, report contracts
  • providers: LLM provider abstraction and deterministic mock provider
  • scoring: Dimension evaluators and weighted aggregation
  • engine: Prompt simulation and benchmark orchestration
  • plugins: Custom evaluator and scenario registration
  • api: Python-first public interface
  • cli: Terminal commands for automation and team workflows

For deeper details see docs/architecture.md.

Scoring Model

Base Dimensions (always available)

  • clarity: readability and structural guidance
  • specificity: explicit constraints and output requirements
  • robustness: edge-case and failure-handling guidance
  • consistency: output stability across runs
  • efficiency: verbosity and likely token/latency pressure

Optional Judge Dimensions

  • reasoning: quality of chain-of-thought style structure
  • goal_completion: likelihood that prompt drives task completion

Overall Score

The framework computes weighted components and a final score band:

  • production-ready: 80-100
  • good: 65-79
  • developing: 50-64
  • failing: 0-49

For formulas and rationale see docs/scoring.md.

Installation

Local development

git clone https://github.com/zaber-dev/ai-prompt-simulation.git
cd ai-prompt-simulation
python -m venv .venv
# Windows PowerShell
.venv\Scripts\Activate.ps1
pip install -e ".[dev]"

Verify

pytest

Optional Real LLM Providers

By default, the framework uses the deterministic mock provider for reproducible testing.

You can optionally use real model providers:

  • openai (default model: gpt-4o-mini)
  • gemini (default model: gemini-2.0-flash)

Set API keys via environment variables:

# Windows PowerShell
$env:OPENAI_API_KEY = "your-openai-key"
$env:GEMINI_API_KEY = "your-gemini-key"

Quick Start (Python API)

from ai_prompt_simulation.api.public import run_simulation
from ai_prompt_simulation.core.models import SimulationConfig

config = SimulationConfig(
    runs=4,
    judge={
        "enabled": True,
        "reasoning_weight": 0.1,
        "goal_completion_weight": 0.1,
    },
)

result = run_simulation(
    "You are an autonomous planning agent. Output JSON with fields plan, risks, and next_action. "
    "Include one fallback if required data is missing.",
    case_id="quickstart-1",
    config=config,
  provider_name="openai",
  model="gpt-4o-mini",
)

print(result.report.summary.overall_score, result.report.summary.band)
for d in result.report.dimensions:
    print(d.name, d.score)

Quick Start (CLI)

Simulate one prompt

prompt-sim simulate \
  --prompt "You must output JSON with keys action and status. Include one fallback." \
  --provider openai \
  --model gpt-4o-mini \
  --runs 4 \
  --config configs/default.yaml \
  --output out/sim_result.json

Use Gemini instead:

prompt-sim simulate \
  --prompt "You must output JSON with keys action and status. Include one fallback." \
  --provider gemini \
  --model gemini-2.0-flash

Run benchmark suite

prompt-sim benchmark \
  --name "core-suite" \
  --cases-file examples/benchmark_cases.yaml \
  --config configs/default.yaml \
  --output out/benchmark_result.json

Compare two prompts

prompt-sim compare \
  --prompt-a "Summarize this issue." \
  --prompt-b "Summarize in exactly 3 bullets, include assumptions, output JSON."

Explain a saved score

prompt-sim explain-score --result-file out/sim_result.json

Validate config

prompt-sim validate-config --config configs/default.yaml

Customization

Register custom evaluator

from ai_prompt_simulation.core.models import DimensionScore
from ai_prompt_simulation.engine.simulator import PromptSimulator

def domain_evaluator(prompt, outputs, _config, _provider):
    hits = sum(k in prompt.lower() for k in ["goal", "constraints", "fallback", "verify"])
    return DimensionScore(
        name="autonomy_readiness",
        score=min(100.0, 30 + hits * 15),
        rationale="Domain-specific autonomous readiness score",
        evidence={"marker_hits": hits},
    )

sim = PromptSimulator()
sim.register_evaluator("autonomy_readiness", domain_evaluator)

See examples/custom_evaluator.py.

Input File Format

Benchmark case files (.yaml or .json) must be a list of prompt cases:

- id: case-1
  task: qa
  prompt: |
    You are an autonomous support agent.
    Answer in exactly 3 bullet points.
  variables:
    locale: en-US

Project Structure

.
|-- configs/
|-- docs/
|-- examples/
|-- src/ai_prompt_simulation/
|   |-- api/
|   |-- cli/
|   |-- core/
|   |-- engine/
|   |-- plugins/
|   |-- providers/
|   `-- scoring/
|-- tests/
|-- LEARN.md
|-- LICENSE.md
`-- README.md

Documentation Index

  • LEARN.md: progressive learning path and usage curriculum
  • docs/architecture.md: design and extension points
  • docs/scoring.md: scoring methodology and formulas
  • docs/testing.md: testing and validation practices
  • CONTRIBUTING.md: contribution standards and workflow
  • SECURITY.md: vulnerability disclosure policy

Quality Standards

  • Typed Pydantic contracts for all major data flows
  • Deterministic mock provider for reproducible test runs
  • CI-ready test, lint, and type-check configuration
  • Structured JSON reports for automation and traceability

Versioning and Releases

  • Versioning follows semantic versioning (MAJOR.MINOR.PATCH)
  • Initial target release: 0.1.0 (alpha)
  • Release notes are tracked in CHANGELOG.md

Contributing

Contributions are welcome. See CONTRIBUTING.md for branch naming, tests, and review requirements.

License

This project is licensed under MIT. See LICENSE.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai_prompt_simulation-0.1.0.tar.gz (23.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ai_prompt_simulation-0.1.0-py3-none-any.whl (22.1 kB view details)

Uploaded Python 3

File details

Details for the file ai_prompt_simulation-0.1.0.tar.gz.

File metadata

  • Download URL: ai_prompt_simulation-0.1.0.tar.gz
  • Upload date:
  • Size: 23.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ai_prompt_simulation-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3cd49c20476ac850242a2ceea367e49cb3b64f393708320ec48c040924c02c72
MD5 bb10872aa44f6eb28017341b870ca6df
BLAKE2b-256 3c2f5ef2860139d2684d5bcb8833a3840fcf33b179ee215150993495d590d9e0

See more details on using hashes here.

Provenance

The following attestation bundles were made for ai_prompt_simulation-0.1.0.tar.gz:

Publisher: release.yml on zaber-dev/ai-prompt-simulation

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ai_prompt_simulation-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for ai_prompt_simulation-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 778829f06de5d4d565eead49be109330e6cf5a8280d09f9f166492488b10f6e5
MD5 948355aa1c59946908f41d7054dacb03
BLAKE2b-256 51c4a49c96fab11c5aab0ef07ebdc9556a1391a14cd5cf10ea281efd00b7e637

See more details on using hashes here.

Provenance

The following attestation bundles were made for ai_prompt_simulation-0.1.0-py3-none-any.whl:

Publisher: release.yml on zaber-dev/ai-prompt-simulation

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page