Prompt simulation and autonomous-agent effectiveness benchmarking framework

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

zaberdev

These details have not been verified by PyPI

Project description

ai-prompt-simulation

A professional Python framework for testing prompt strength, quality, and real autonomous-agent effectiveness.

This repository helps you answer practical questions before deploying prompts in production:

Is this prompt strong enough for autonomous execution?
Which quality dimensions are weak (clarity, specificity, robustness, consistency, efficiency)?
How does prompt A compare to prompt B under repeatable conditions?
Can I customize scoring, scenarios, and evaluator logic for my domain?

Why This Project Exists

Prompt quality is often judged subjectively. This project provides a repeatable simulation pipeline with transparent scoring, configurable weighting, and benchmark workflows that can be run from both Python API and CLI.

Core Capabilities

Deterministic simulation engine with seed-based runs and retries
Hybrid quality model:
- Deterministic heuristics (clarity, specificity, robustness, consistency, efficiency)
- Optional LLM-as-judge dimensions (reasoning, goal completion)
Benchmark mode for multi-case prompt suites
Side-by-side prompt comparison
Extensible plugin registry for custom evaluators and scenario factories
JSON report output for automation and CI pipelines

Architecture

High-level module map:

core: Typed schemas, config validation, report contracts
providers: LLM provider abstraction and deterministic mock provider
scoring: Dimension evaluators and weighted aggregation
engine: Prompt simulation and benchmark orchestration
plugins: Custom evaluator and scenario registration
api: Python-first public interface
cli: Terminal commands for automation and team workflows

For deeper details see docs/architecture.md.

Scoring Model

Base Dimensions (always available)

clarity: readability and structural guidance
specificity: explicit constraints and output requirements
robustness: edge-case and failure-handling guidance
consistency: output stability across runs
efficiency: verbosity and likely token/latency pressure

Optional Judge Dimensions

reasoning: quality of chain-of-thought style structure
goal_completion: likelihood that prompt drives task completion

Overall Score

The framework computes weighted components and a final score band:

production-ready: 80-100
good: 65-79
developing: 50-64
failing: 0-49

For formulas and rationale see docs/scoring.md.

Installation

Local development

git clone https://github.com/zaber-dev/ai-prompt-simulation.git
cd ai-prompt-simulation
python -m venv .venv
# Windows PowerShell
.venv\Scripts\Activate.ps1
pip install -e ".[dev]"

Verify

pytest

Optional Real LLM Providers

By default, the framework uses the deterministic mock provider for reproducible testing.

You can optionally use real model providers:

openai (default model: gpt-4o-mini)
gemini (default model: gemini-2.0-flash)

Set API keys via environment variables:

# Windows PowerShell
$env:OPENAI_API_KEY = "your-openai-key"
$env:GEMINI_API_KEY = "your-gemini-key"

Quick Start (Python API)

from ai_prompt_simulation.api.public import run_simulation
from ai_prompt_simulation.core.models import SimulationConfig

config = SimulationConfig(
    runs=4,
    judge={
        "enabled": True,
        "reasoning_weight": 0.1,
        "goal_completion_weight": 0.1,
    },
)

result = run_simulation(
    "You are an autonomous planning agent. Output JSON with fields plan, risks, and next_action. "
    "Include one fallback if required data is missing.",
    case_id="quickstart-1",
    config=config,
  provider_name="openai",
  model="gpt-4o-mini",
)

print(result.report.summary.overall_score, result.report.summary.band)
for d in result.report.dimensions:
    print(d.name, d.score)

Quick Start (CLI)

Simulate one prompt

prompt-sim simulate \
  --prompt "You must output JSON with keys action and status. Include one fallback." \
  --provider openai \
  --model gpt-4o-mini \
  --runs 4 \
  --config configs/default.yaml \
  --output out/sim_result.json

Use Gemini instead:

prompt-sim simulate \
  --prompt "You must output JSON with keys action and status. Include one fallback." \
  --provider gemini \
  --model gemini-2.0-flash

Run benchmark suite

prompt-sim benchmark \
  --name "core-suite" \
  --cases-file examples/benchmark_cases.yaml \
  --config configs/default.yaml \
  --output out/benchmark_result.json

Compare two prompts

prompt-sim compare \
  --prompt-a "Summarize this issue." \
  --prompt-b "Summarize in exactly 3 bullets, include assumptions, output JSON."

Explain a saved score

prompt-sim explain-score --result-file out/sim_result.json

Validate config

prompt-sim validate-config --config configs/default.yaml

Customization

Register custom evaluator

from ai_prompt_simulation.core.models import DimensionScore
from ai_prompt_simulation.engine.simulator import PromptSimulator

def domain_evaluator(prompt, outputs, _config, _provider):
    hits = sum(k in prompt.lower() for k in ["goal", "constraints", "fallback", "verify"])
    return DimensionScore(
        name="autonomy_readiness",
        score=min(100.0, 30 + hits * 15),
        rationale="Domain-specific autonomous readiness score",
        evidence={"marker_hits": hits},
    )

sim = PromptSimulator()
sim.register_evaluator("autonomy_readiness", domain_evaluator)

See examples/custom_evaluator.py.

Input File Format

Benchmark case files (.yaml or .json) must be a list of prompt cases:

- id: case-1
  task: qa
  prompt: |
    You are an autonomous support agent.
    Answer in exactly 3 bullet points.
  variables:
    locale: en-US

Project Structure

.
|-- configs/
|-- docs/
|-- examples/
|-- src/ai_prompt_simulation/
|   |-- api/
|   |-- cli/
|   |-- core/
|   |-- engine/
|   |-- plugins/
|   |-- providers/
|   `-- scoring/
|-- tests/
|-- LEARN.md
|-- LICENSE.md
`-- README.md

Documentation Index

LEARN.md: progressive learning path and usage curriculum
docs/architecture.md: design and extension points
docs/scoring.md: scoring methodology and formulas
docs/testing.md: testing and validation practices
CONTRIBUTING.md: contribution standards and workflow
SECURITY.md: vulnerability disclosure policy

Quality Standards

Typed Pydantic contracts for all major data flows
Deterministic mock provider for reproducible test runs
CI-ready test, lint, and type-check configuration
Structured JSON reports for automation and traceability

Versioning and Releases

Versioning follows semantic versioning (MAJOR.MINOR.PATCH)
Initial target release: 0.1.0 (alpha)
Release notes are tracked in CHANGELOG.md

Contributing

Contributions are welcome. See CONTRIBUTING.md for branch naming, tests, and review requirements.

License

This project is licensed under MIT. See LICENSE.md.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

zaberdev

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Apr 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai_prompt_simulation-0.1.0.tar.gz (23.6 kB view details)

Uploaded Apr 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ai_prompt_simulation-0.1.0-py3-none-any.whl (22.1 kB view details)

Uploaded Apr 8, 2026 Python 3

File details

Details for the file ai_prompt_simulation-0.1.0.tar.gz.

File metadata

Download URL: ai_prompt_simulation-0.1.0.tar.gz
Upload date: Apr 8, 2026
Size: 23.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ai_prompt_simulation-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`3cd49c20476ac850242a2ceea367e49cb3b64f393708320ec48c040924c02c72`
MD5	`bb10872aa44f6eb28017341b870ca6df`
BLAKE2b-256	`3c2f5ef2860139d2684d5bcb8833a3840fcf33b179ee215150993495d590d9e0`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ai_prompt_simulation-0.1.0.tar.gz:

Publisher: release.yml on zaber-dev/ai-prompt-simulation

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ai_prompt_simulation-0.1.0.tar.gz
- Subject digest: 3cd49c20476ac850242a2ceea367e49cb3b64f393708320ec48c040924c02c72
- Sigstore transparency entry: 1251553535
- Sigstore integration time: Apr 8, 2026
Source repository:
- Permalink: zaber-dev/ai-prompt-simulation@b140f2700c3efb761abd5f324ff168bb07be8cdc
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/zaber-dev
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@b140f2700c3efb761abd5f324ff168bb07be8cdc
- Trigger Event: push

File details

Details for the file ai_prompt_simulation-0.1.0-py3-none-any.whl.

File metadata

Download URL: ai_prompt_simulation-0.1.0-py3-none-any.whl
Upload date: Apr 8, 2026
Size: 22.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ai_prompt_simulation-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`778829f06de5d4d565eead49be109330e6cf5a8280d09f9f166492488b10f6e5`
MD5	`948355aa1c59946908f41d7054dacb03`
BLAKE2b-256	`51c4a49c96fab11c5aab0ef07ebdc9556a1391a14cd5cf10ea281efd00b7e637`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ai_prompt_simulation-0.1.0-py3-none-any.whl:

Publisher: release.yml on zaber-dev/ai-prompt-simulation

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ai_prompt_simulation-0.1.0-py3-none-any.whl
- Subject digest: 778829f06de5d4d565eead49be109330e6cf5a8280d09f9f166492488b10f6e5
- Sigstore transparency entry: 1251553536
- Sigstore integration time: Apr 8, 2026
Source repository:
- Permalink: zaber-dev/ai-prompt-simulation@b140f2700c3efb761abd5f324ff168bb07be8cdc
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/zaber-dev
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@b140f2700c3efb761abd5f324ff168bb07be8cdc
- Trigger Event: push

ai-prompt-simulation 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

ai-prompt-simulation

Why This Project Exists

Core Capabilities

Architecture

Scoring Model

Base Dimensions (always available)

Optional Judge Dimensions

Overall Score

Installation

Local development

Verify

Optional Real LLM Providers

Quick Start (Python API)

Quick Start (CLI)

Simulate one prompt

Run benchmark suite

Compare two prompts

Explain a saved score

Validate config

Customization

Register custom evaluator

Input File Format

Project Structure

Documentation Index

Quality Standards

Versioning and Releases

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance