Python-native prompt evaluation tool using PydanticAI
Project description
PromptDev
Python-native prompt evaluation tool using PydanticAI
PromptDev is a modern prompt evaluation framework that provides comprehensive testing for AI agents across multiple providers. Built on PydanticAI, it combines type safety with powerful evaluation capabilities.
[!WARNING]
promptdev is in preview and is not ready for production use.
We're working hard to make it stable and feature-complete, but until then, expect to encounter bugs, missing features, and fatal errors.
Features
- 🔒 Type Safe - Full Pydantic validation for inputs, outputs, and configurations
- 🤖 PydanticAI Integration - Native support for PydanticAI agents and evaluation framework
- 📊 Multi-Provider Testing - Test across OpenAI, Together.ai, Ollama, Bedrock, and more
- ⚡ Performance Optimized - File-based caching with TTL for faster repeated evaluations
- 📈 Rich Reporting - Beautiful console output with detailed failure analysis and provider comparisons
- 🧪 Promptfoo Compatible - Works with (some) existing promptfoo YAML configs and datasets
- 🎯 Comprehensive Assertions - Built-in evaluators plus custom Python assertion support
Quick Start
Installation
From PyPI (when available)
pip install promptdev
From Source
git clone https://github.com/artefactop/promptdev.git
cd promptdev
pip install -e .
For Development
git clone https://github.com/artefactop/promptdev.git
cd promptdev
uv sync
uv run promptdev --help
Basic Usage
# Run evaluation
promptdev eval examples/calendar_event_summary.yaml
# Override provider
promptdev eval examples/calendar_event_summary.yaml --provider pydantic-ai:openai
# Disable caching for a run
promptdev eval examples/calendar_event_summary.yaml --no-cache
# Export results
promptdev eval examples/calendar_event_summary.yaml --output json
promptdev eval examples/calendar_event_summary.yaml --output html
# Validate configuration
promptdev validate examples/calendar_event_summary.yaml
# Cache management
promptdev cache stats
promptdev cache clear
Assertion Types
PromptDev supports a comprehensive set of evaluators for different testing scenarios:
| Type | Status | Description | Example Usage |
|---|---|---|---|
| Core PydanticAI Evaluators | |||
exact |
✅ | Exact string/value matching | type: exact |
is_instance |
✅ | Type checking | type: is_instance, value: "str" |
llm_judge |
✅ | LLM-based semantic evaluation | type: llm_judge, rubric: "Evaluate accuracy" |
| PromptDev Custom Evaluators | |||
json_schema |
✅ | JSON schema validation | type: json_schema, value: {schema} |
python |
✅ | Custom Python assertions | type: python, value: "./assert.py" |
contains |
✅ | Substring matching | type: contains, value: "expected text" |
| Promptfoo Compatibility | |||
contains-json |
✅ (Deprecated) | JSON schema validation (use json_schema) |
type: contains-json, value: {schema} |
llm-rubric |
✅ (Deprecated) | LLM evaluation (use llm_judge) |
type: llm-rubric, value: "rubric text" |
g-eval |
✅ (Deprecated) | G-Eval methodology (use llm_judge) |
type: g-eval, value: "criteria" |
Promptfoo Compatibility
PromptDev maintains compatibility with promptfoo configurations to ease migration:
- YAML configs - Most promptfoo YAML configs work with minimal changes
- JSONL datasets - Existing test datasets are fully supported
- Python assertions - Custom
get_assertfunctions work without modification - JSON schemas - Schema validation uses the same format
Migration Notes:
- Use
json_schemainstead ofcontains-jsonfor new projects - Use
llm_judgeinstead ofllm-rubricorg-evalfor better performance - Provider IDs use
pydantic-ai:prefix (e.g.,pydantic-ai:openai) - Model names follow PydanticAI format (e.g.,
openai:gpt-4)
Configuration
PromptDev uses YAML configuration files compatible with promptfoo format:
description: "Calendar event summary evaluation"
prompts:
- file://./prompts/calendar_event_summary.yaml
providers:
- id: "pydantic-ai:openai"
model: "openai:gpt-4"
config:
temperature: 0.0
- id: "pydantic-ai:ollama"
model: "ollama:llama3.2:3b"
tests:
- file: "./datasets/calendar_events.jsonl"
default_test:
assert:
- type: "json_schema"
value:
type: "object"
required: ["name", "event_type", "out_of_office"]
properties:
name: {type: "string"}
event_type: {type: "string"}
out_of_office: {type: "boolean"}
- type: "python"
value: "./assertions/calendar_assert.py"
- type: "llm_judge"
rubric: "Evaluate if the output correctly extracts calendar event information"
model: "openai:gpt-4"
Advanced Features
PydanticAI Evals Integration
PromptDev leverages PydanticAI's pydantic_evals system for robust, type-safe evaluations:
- LLMJudge Evaluator: Advanced semantic evaluation using LLMs with customizable rubrics
- Type-safe Evaluation: Built on Pydantic's validation framework for reliable results
- Schema Resolution: Comprehensive
$refresolution for assertion templates and schemas - Error Collection: Structured error reporting with detailed context and stack traces
Custom Python Assertions
Create powerful custom evaluators:
# examples/assertions/calendar_assert.py
def get_assert():
def assert_expected(output, context):
import json
try:
# Parse JSON from LLM output
data = json.loads(output)
# Get expected values from test variables
expected_name = context['vars']['expected_name']
expected_event_type = context['vars']['expected_event_type']
# Detailed field-by-field validation
details = []
score = 0
total_fields = 2
# Validate name
if data.get('name') == expected_name:
details.append({'field': 'Name', 'actual': data.get('name'), 'expected': expected_name, 'passed': True})
score += 1
else:
details.append({'field': 'Name', 'actual': data.get('name'), 'expected': expected_name, 'passed': False})
# Validate event type
if data.get('event_type', '').lower() == expected_event_type.lower():
details.append({'field': 'Event Type', 'actual': data.get('event_type'), 'expected': expected_event_type, 'passed': True})
score += 1
else:
details.append({'field': 'Event Type', 'actual': data.get('event_type'), 'expected': expected_event_type, 'passed': False})
return {
'pass': score == total_fields,
'score': score / total_fields,
'reason': f'Field validation results: {total_fields - score} failed checks' if score < total_fields else 'All fields validated successfully',
'details': details
}
except Exception as e:
return {
'pass': False,
'score': 0.0,
'reason': f'JSON parsing failed: {str(e)}',
'details': []
}
return assert_expected
Caching System
PromptDev includes a high-performance file-based cache:
- Automatic Caching: Caches agent outputs based on model, prompt, and inputs
- TTL Support: Configurable time-to-live for cache entries
- Thread-Safe: Concurrent evaluation support with atomic file operations
- Cache Management: CLI commands for stats and cleanup
Rich Reporting
Comprehensive evaluation reports include:
- Provider Comparison: Side-by-side performance across multiple providers
- Detailed Failure Analysis: Field-level breakdowns for failed assertions
- Hierarchical Test Display: Tree view of failures organized by provider
- Performance Metrics: Pass rates, scores, and timing information
- Error Summary: Collected evaluation errors with full context
Development
# Setup development environment
uv sync --extra dev
# Run tests
uv run pytest
# Format code
uv run ruff check promptdev/
# Type checking
uv run mypy promptdev/
Roadmap
- Core evaluation engine with PydanticAI integration
- Multi-provider support for major AI platforms
- YAML configuration loading with promptfoo compatibility
- Comprehensive assertion types (JSON schema, Python, LLM-based)
- File-based caching system with TTL support
- Rich console reporting with failure analysis
- Simple file disk cache
- Better integration with PydanticAI, do not reinvent the wheel
- Testing
- Concurrent execution for faster large-scale evaluations
- Code cleanup
- Testing pensero promptfoo files
- Add support to run multiple test_cases
- CI/CD integration helpers with change detection
- Red team security testing capabilities
- Turso persistence for evaluation history and analytics
- Performance benchmarking and regression detection
- Distributed evaluation across multiple machines
Contributing
We welcome contributions! Here's how to get started:
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Install development dependencies:
uv sync - Make your changes and add tests
- Run tests:
uv run pytest - Commit your changes:
git commit -m 'Add amazing feature' - Push to the branch:
git push origin feature/amazing-feature - Open a Pull Request
Development Setup
git clone https://github.com/artefactop/promptdev.git
cd promptdev
uv sync
uv run pytest # Run tests
uv run promptdev --help # Test CLI
Code Style
We use ruff for code formatting and linting, and pytest for testing. Please ensure your code follows these standards:
uv run ruff check . # Lint code
uv run ruff format . # Format code
uv run pytest # Run tests
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- Built on PydanticAI for type-safe AI agent development
- Inspired by promptfoo for evaluation concepts
- Uses Rich for beautiful console output
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file promptdev-0.0.1.tar.gz.
File metadata
- Download URL: promptdev-0.0.1.tar.gz
- Upload date:
- Size: 37.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bc5b5cb7e7cfc81b7a0a51ac666ea8f77edef3de54e8d8c2bcd412ea0afa1d30
|
|
| MD5 |
ee6198ad90996b904ed26ebcce3e2d97
|
|
| BLAKE2b-256 |
8183d70fe232ec56cc54d68a3b7d134cd3c493e1b4a97607fd65326c6abd37d1
|
File details
Details for the file promptdev-0.0.1-py3-none-any.whl.
File metadata
- Download URL: promptdev-0.0.1-py3-none-any.whl
- Upload date:
- Size: 43.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cb254f13be5c9baf9f4d426e0bdfc1d7a95fcddbd367fabee25fa9a43d223913
|
|
| MD5 |
80445758232173c180cbef0f7947956c
|
|
| BLAKE2b-256 |
6665d17a037b4026092d5e0dab04cfc9655e9d08a6b9a93ebd95460780845fab
|