Python-native prompt evaluation tool using PydanticAI

These details have not been verified by PyPI

Project description

PromptDev

Python-native prompt evaluation tool using PydanticAI

PromptDev is a modern prompt evaluation framework that provides comprehensive testing for AI agents across multiple providers. Built on PydanticAI, it combines type safety with powerful evaluation capabilities.

[!WARNING]

promptdev is in preview and is not ready for production use.

We're working hard to make it stable and feature-complete, but until then, expect to encounter bugs, missing features, and fatal errors.

Features

🔒 Type Safe - Full Pydantic validation for inputs, outputs, and configurations
🤖 PydanticAI Integration - Native support for PydanticAI agents and evaluation framework
📊 Multi-Provider Testing - Test across OpenAI, Together.ai, Ollama, Bedrock, and more
⚡ Performance Optimized - File-based caching with TTL for faster repeated evaluations
📈 Rich Reporting - Beautiful console output with detailed failure analysis and provider comparisons
🧪 Promptfoo Compatible - Works with (some) existing promptfoo YAML configs and datasets
🎯 Comprehensive Assertions - Built-in evaluators plus custom Python assertion support

Quick Start

Installation

From PyPI (when available)

pip install promptdev

From Source

git clone https://github.com/artefactop/promptdev.git
cd promptdev
pip install -e .

For Development

git clone https://github.com/artefactop/promptdev.git
cd promptdev
uv sync
uv run promptdev --help

Basic Usage

# Run evaluation
promptdev eval examples/calendar_event_summary.yaml

# Override provider  
promptdev eval examples/calendar_event_summary.yaml --provider pydantic-ai:openai

# Disable caching for a run
promptdev eval examples/calendar_event_summary.yaml --no-cache

# Export results
promptdev eval examples/calendar_event_summary.yaml --output json
promptdev eval examples/calendar_event_summary.yaml --output html

# Validate configuration
promptdev validate examples/calendar_event_summary.yaml

# Cache management
promptdev cache stats
promptdev cache clear

Assertion Types

PromptDev supports a comprehensive set of evaluators for different testing scenarios:

Type	Status	Description	Example Usage
Core PydanticAI Evaluators
`exact`	✅	Exact string/value matching	`type: exact`
`is_instance`	✅	Type checking	`type: is_instance, value: "str"`
`llm_judge`	✅	LLM-based semantic evaluation	`type: llm_judge, rubric: "Evaluate accuracy"`
PromptDev Custom Evaluators
`json_schema`	✅	JSON schema validation	`type: json_schema, value: {schema}`
`python`	✅	Custom Python assertions	`type: python, value: "./assert.py"`
`contains`	✅	Substring matching	`type: contains, value: "expected text"`
Promptfoo Compatibility
`contains-json`	✅ (Deprecated)	JSON schema validation (use `json_schema`)	`type: contains-json, value: {schema}`
`llm-rubric`	✅ (Deprecated)	LLM evaluation (use `llm_judge`)	`type: llm-rubric, value: "rubric text"`
`g-eval`	✅ (Deprecated)	G-Eval methodology (use `llm_judge`)	`type: g-eval, value: "criteria"`

Promptfoo Compatibility

PromptDev maintains compatibility with promptfoo configurations to ease migration:

YAML configs - Most promptfoo YAML configs work with minimal changes
JSONL datasets - Existing test datasets are fully supported
Python assertions - Custom get_assert functions work without modification
JSON schemas - Schema validation uses the same format

Migration Notes:

Use json_schema instead of contains-json for new projects
Use llm_judge instead of llm-rubric or g-eval for better performance
Provider IDs use pydantic-ai: prefix (e.g., pydantic-ai:openai)
Model names follow PydanticAI format (e.g., openai:gpt-4)

Configuration

PromptDev uses YAML configuration files compatible with promptfoo format:

description: "Calendar event summary evaluation"

prompts:
  - file://./prompts/calendar_event_summary.yaml

providers:
  - id: "pydantic-ai:openai"
    model: "openai:gpt-4"
    config:
      temperature: 0.0
  - id: "pydantic-ai:ollama"
    model: "ollama:llama3.2:3b"

tests:
  - file: "./datasets/calendar_events.jsonl"

default_test:
  assert:
    - type: "json_schema"
      value:
        type: "object"
        required: ["name", "event_type", "out_of_office"]
        properties:
          name: {type: "string"}
          event_type: {type: "string"}
          out_of_office: {type: "boolean"}
    - type: "python" 
      value: "./assertions/calendar_assert.py"
    - type: "llm_judge"
      rubric: "Evaluate if the output correctly extracts calendar event information"
      model: "openai:gpt-4"

Advanced Features

PydanticAI Evals Integration

PromptDev leverages PydanticAI's pydantic_evals system for robust, type-safe evaluations:

LLMJudge Evaluator: Advanced semantic evaluation using LLMs with customizable rubrics
Type-safe Evaluation: Built on Pydantic's validation framework for reliable results
Schema Resolution: Comprehensive $ref resolution for assertion templates and schemas
Error Collection: Structured error reporting with detailed context and stack traces

Custom Python Assertions

Create powerful custom evaluators:

# examples/assertions/calendar_assert.py
def get_assert():
    def assert_expected(output, context):
        import json
        
        try:
            # Parse JSON from LLM output
            data = json.loads(output)
            
            # Get expected values from test variables
            expected_name = context['vars']['expected_name']
            expected_event_type = context['vars']['expected_event_type']
            
            # Detailed field-by-field validation
            details = []
            score = 0
            total_fields = 2
            
            # Validate name
            if data.get('name') == expected_name:
                details.append({'field': 'Name', 'actual': data.get('name'), 'expected': expected_name, 'passed': True})
                score += 1
            else:
                details.append({'field': 'Name', 'actual': data.get('name'), 'expected': expected_name, 'passed': False})
            
            # Validate event type
            if data.get('event_type', '').lower() == expected_event_type.lower():
                details.append({'field': 'Event Type', 'actual': data.get('event_type'), 'expected': expected_event_type, 'passed': True})
                score += 1
            else:
                details.append({'field': 'Event Type', 'actual': data.get('event_type'), 'expected': expected_event_type, 'passed': False})
            
            return {
                'pass': score == total_fields,
                'score': score / total_fields,
                'reason': f'Field validation results: {total_fields - score} failed checks' if score < total_fields else 'All fields validated successfully',
                'details': details
            }
            
        except Exception as e:
            return {
                'pass': False,
                'score': 0.0,
                'reason': f'JSON parsing failed: {str(e)}',
                'details': []
            }
    
    return assert_expected

Caching System

PromptDev includes a high-performance file-based cache:

Automatic Caching: Caches agent outputs based on model, prompt, and inputs
TTL Support: Configurable time-to-live for cache entries
Thread-Safe: Concurrent evaluation support with atomic file operations
Cache Management: CLI commands for stats and cleanup

Rich Reporting

Comprehensive evaluation reports include:

Provider Comparison: Side-by-side performance across multiple providers
Detailed Failure Analysis: Field-level breakdowns for failed assertions
Hierarchical Test Display: Tree view of failures organized by provider
Performance Metrics: Pass rates, scores, and timing information
Error Summary: Collected evaluation errors with full context

Development

# Setup development environment
uv sync --extra dev

# Run tests
uv run pytest

# Format code
uv run ruff check promptdev/

# Type checking
uv run mypy promptdev/

Roadmap

Core evaluation engine with PydanticAI integration
Multi-provider support for major AI platforms
YAML configuration loading with promptfoo compatibility
Comprehensive assertion types (JSON schema, Python, LLM-based)
File-based caching system with TTL support
Rich console reporting with failure analysis
Simple file disk cache
Better integration with PydanticAI, do not reinvent the wheel
Testing
Concurrent execution for faster large-scale evaluations
Code cleanup
Testing pensero promptfoo files
Add support to run multiple test_cases
CI/CD integration helpers with change detection
Red team security testing capabilities
Turso persistence for evaluation history and analytics
Performance benchmarking and regression detection
Distributed evaluation across multiple machines

Contributing

We welcome contributions! Here's how to get started:

Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Install development dependencies: uv sync
Make your changes and add tests
Run tests: uv run pytest
Commit your changes: git commit -m 'Add amazing feature'
Push to the branch: git push origin feature/amazing-feature
Open a Pull Request

Development Setup

git clone https://github.com/artefactop/promptdev.git
cd promptdev
uv sync
uv run pytest  # Run tests
uv run promptdev --help  # Test CLI

Code Style

We use ruff for code formatting and linting, and pytest for testing. Please ensure your code follows these standards:

uv run ruff check .       # Lint code
uv run ruff format .      # Format code
uv run pytest           # Run tests

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Built on PydanticAI for type-safe AI agent development
Inspired by promptfoo for evaluation concepts
Uses Rich for beautiful console output

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.0.2a1 pre-release

Sep 22, 2025

This version

0.0.1

Sep 5, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

promptdev-0.0.1.tar.gz (37.2 kB view details)

Uploaded Sep 5, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

promptdev-0.0.1-py3-none-any.whl (43.4 kB view details)

Uploaded Sep 5, 2025 Python 3

File details

Details for the file promptdev-0.0.1.tar.gz.

File metadata

Download URL: promptdev-0.0.1.tar.gz
Upload date: Sep 5, 2025
Size: 37.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.15

File hashes

Hashes for promptdev-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`bc5b5cb7e7cfc81b7a0a51ac666ea8f77edef3de54e8d8c2bcd412ea0afa1d30`
MD5	`ee6198ad90996b904ed26ebcce3e2d97`
BLAKE2b-256	`8183d70fe232ec56cc54d68a3b7d134cd3c493e1b4a97607fd65326c6abd37d1`

See more details on using hashes here.

File details

Details for the file promptdev-0.0.1-py3-none-any.whl.

File metadata

Download URL: promptdev-0.0.1-py3-none-any.whl
Upload date: Sep 5, 2025
Size: 43.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.15

File hashes

Hashes for promptdev-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cb254f13be5c9baf9f4d426e0bdfc1d7a95fcddbd367fabee25fa9a43d223913`
MD5	`80445758232173c180cbef0f7947956c`
BLAKE2b-256	`6665d17a037b4026092d5e0dab04cfc9655e9d08a6b9a93ebd95460780845fab`

See more details on using hashes here.

promptdev 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

PromptDev

Features

Quick Start

Installation

From PyPI (when available)

From Source

For Development

Basic Usage

Assertion Types

Promptfoo Compatibility

Configuration

Advanced Features

PydanticAI Evals Integration

Custom Python Assertions

Caching System

Rich Reporting

Development

Roadmap

Contributing

Development Setup

Code Style

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes