Prompt simulation and autonomous-agent effectiveness benchmarking framework
Project description
ai-prompt-simulation
A professional Python framework for testing prompt strength, quality, and real autonomous-agent effectiveness.
This repository helps you answer practical questions before deploying prompts in production:
- Is this prompt strong enough for autonomous execution?
- Which quality dimensions are weak (clarity, specificity, robustness, consistency, efficiency)?
- How does prompt A compare to prompt B under repeatable conditions?
- Can I customize scoring, scenarios, and evaluator logic for my domain?
Why This Project Exists
Prompt quality is often judged subjectively. This project provides a repeatable simulation pipeline with transparent scoring, configurable weighting, and benchmark workflows that can be run from both Python API and CLI.
Core Capabilities
- Deterministic simulation engine with seed-based runs and retries
- Hybrid quality model:
- Deterministic heuristics (clarity, specificity, robustness, consistency, efficiency)
- Optional LLM-as-judge dimensions (reasoning, goal completion)
- Benchmark mode for multi-case prompt suites
- Side-by-side prompt comparison
- Extensible plugin registry for custom evaluators and scenario factories
- JSON report output for automation and CI pipelines
Architecture
High-level module map:
core: Typed schemas, config validation, report contractsproviders: LLM provider abstraction and deterministic mock providerscoring: Dimension evaluators and weighted aggregationengine: Prompt simulation and benchmark orchestrationplugins: Custom evaluator and scenario registrationapi: Python-first public interfacecli: Terminal commands for automation and team workflows
For deeper details see docs/architecture.md.
Scoring Model
Base Dimensions (always available)
clarity: readability and structural guidancespecificity: explicit constraints and output requirementsrobustness: edge-case and failure-handling guidanceconsistency: output stability across runsefficiency: verbosity and likely token/latency pressure
Optional Judge Dimensions
reasoning: quality of chain-of-thought style structuregoal_completion: likelihood that prompt drives task completion
Overall Score
The framework computes weighted components and a final score band:
production-ready: 80-100good: 65-79developing: 50-64failing: 0-49
For formulas and rationale see docs/scoring.md.
Installation
Local development
git clone https://github.com/zaber-dev/ai-prompt-simulation.git
cd ai-prompt-simulation
python -m venv .venv
# Windows PowerShell
.venv\Scripts\Activate.ps1
pip install -e ".[dev]"
Verify
pytest
Optional Real LLM Providers
By default, the framework uses the deterministic mock provider for reproducible testing.
You can optionally use real model providers:
openai(default model:gpt-4o-mini)gemini(default model:gemini-2.0-flash)
Set API keys via environment variables:
# Windows PowerShell
$env:OPENAI_API_KEY = "your-openai-key"
$env:GEMINI_API_KEY = "your-gemini-key"
Quick Start (Python API)
from ai_prompt_simulation.api.public import run_simulation
from ai_prompt_simulation.core.models import SimulationConfig
config = SimulationConfig(
runs=4,
judge={
"enabled": True,
"reasoning_weight": 0.1,
"goal_completion_weight": 0.1,
},
)
result = run_simulation(
"You are an autonomous planning agent. Output JSON with fields plan, risks, and next_action. "
"Include one fallback if required data is missing.",
case_id="quickstart-1",
config=config,
provider_name="openai",
model="gpt-4o-mini",
)
print(result.report.summary.overall_score, result.report.summary.band)
for d in result.report.dimensions:
print(d.name, d.score)
Quick Start (CLI)
Simulate one prompt
prompt-sim simulate \
--prompt "You must output JSON with keys action and status. Include one fallback." \
--provider openai \
--model gpt-4o-mini \
--runs 4 \
--config configs/default.yaml \
--output out/sim_result.json
Use Gemini instead:
prompt-sim simulate \
--prompt "You must output JSON with keys action and status. Include one fallback." \
--provider gemini \
--model gemini-2.0-flash
Run benchmark suite
prompt-sim benchmark \
--name "core-suite" \
--cases-file examples/benchmark_cases.yaml \
--config configs/default.yaml \
--output out/benchmark_result.json
Compare two prompts
prompt-sim compare \
--prompt-a "Summarize this issue." \
--prompt-b "Summarize in exactly 3 bullets, include assumptions, output JSON."
Explain a saved score
prompt-sim explain-score --result-file out/sim_result.json
Validate config
prompt-sim validate-config --config configs/default.yaml
Customization
Register custom evaluator
from ai_prompt_simulation.core.models import DimensionScore
from ai_prompt_simulation.engine.simulator import PromptSimulator
def domain_evaluator(prompt, outputs, _config, _provider):
hits = sum(k in prompt.lower() for k in ["goal", "constraints", "fallback", "verify"])
return DimensionScore(
name="autonomy_readiness",
score=min(100.0, 30 + hits * 15),
rationale="Domain-specific autonomous readiness score",
evidence={"marker_hits": hits},
)
sim = PromptSimulator()
sim.register_evaluator("autonomy_readiness", domain_evaluator)
See examples/custom_evaluator.py.
Input File Format
Benchmark case files (.yaml or .json) must be a list of prompt cases:
- id: case-1
task: qa
prompt: |
You are an autonomous support agent.
Answer in exactly 3 bullet points.
variables:
locale: en-US
Project Structure
.
|-- configs/
|-- docs/
|-- examples/
|-- src/ai_prompt_simulation/
| |-- api/
| |-- cli/
| |-- core/
| |-- engine/
| |-- plugins/
| |-- providers/
| `-- scoring/
|-- tests/
|-- LEARN.md
|-- LICENSE.md
`-- README.md
Documentation Index
LEARN.md: progressive learning path and usage curriculumdocs/architecture.md: design and extension pointsdocs/scoring.md: scoring methodology and formulasdocs/testing.md: testing and validation practicesCONTRIBUTING.md: contribution standards and workflowSECURITY.md: vulnerability disclosure policy
Quality Standards
- Typed Pydantic contracts for all major data flows
- Deterministic mock provider for reproducible test runs
- CI-ready test, lint, and type-check configuration
- Structured JSON reports for automation and traceability
Versioning and Releases
- Versioning follows semantic versioning (MAJOR.MINOR.PATCH)
- Initial target release:
0.1.0(alpha) - Release notes are tracked in
CHANGELOG.md
Contributing
Contributions are welcome. See CONTRIBUTING.md for branch naming, tests, and review requirements.
License
This project is licensed under MIT. See LICENSE.md.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ai_prompt_simulation-0.1.0.tar.gz.
File metadata
- Download URL: ai_prompt_simulation-0.1.0.tar.gz
- Upload date:
- Size: 23.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3cd49c20476ac850242a2ceea367e49cb3b64f393708320ec48c040924c02c72
|
|
| MD5 |
bb10872aa44f6eb28017341b870ca6df
|
|
| BLAKE2b-256 |
3c2f5ef2860139d2684d5bcb8833a3840fcf33b179ee215150993495d590d9e0
|
Provenance
The following attestation bundles were made for ai_prompt_simulation-0.1.0.tar.gz:
Publisher:
release.yml on zaber-dev/ai-prompt-simulation
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ai_prompt_simulation-0.1.0.tar.gz -
Subject digest:
3cd49c20476ac850242a2ceea367e49cb3b64f393708320ec48c040924c02c72 - Sigstore transparency entry: 1251553535
- Sigstore integration time:
-
Permalink:
zaber-dev/ai-prompt-simulation@b140f2700c3efb761abd5f324ff168bb07be8cdc -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/zaber-dev
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@b140f2700c3efb761abd5f324ff168bb07be8cdc -
Trigger Event:
push
-
Statement type:
File details
Details for the file ai_prompt_simulation-0.1.0-py3-none-any.whl.
File metadata
- Download URL: ai_prompt_simulation-0.1.0-py3-none-any.whl
- Upload date:
- Size: 22.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
778829f06de5d4d565eead49be109330e6cf5a8280d09f9f166492488b10f6e5
|
|
| MD5 |
948355aa1c59946908f41d7054dacb03
|
|
| BLAKE2b-256 |
51c4a49c96fab11c5aab0ef07ebdc9556a1391a14cd5cf10ea281efd00b7e637
|
Provenance
The following attestation bundles were made for ai_prompt_simulation-0.1.0-py3-none-any.whl:
Publisher:
release.yml on zaber-dev/ai-prompt-simulation
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ai_prompt_simulation-0.1.0-py3-none-any.whl -
Subject digest:
778829f06de5d4d565eead49be109330e6cf5a8280d09f9f166492488b10f6e5 - Sigstore transparency entry: 1251553536
- Sigstore integration time:
-
Permalink:
zaber-dev/ai-prompt-simulation@b140f2700c3efb761abd5f324ff168bb07be8cdc -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/zaber-dev
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@b140f2700c3efb761abd5f324ff168bb07be8cdc -
Trigger Event:
push
-
Statement type: