Python code evaluator plugin for PraisonAI Bench
Project description
PraisonAI Bench Python Evaluator Plugin
๐ A comprehensive Python code evaluation plugin for PraisonAI Bench
๐ฏ Overview
The Python Evaluator Plugin enables PraisonAI Bench to evaluate Python code through comprehensive multi-stage assessment:
- โ Syntax Validation (30 points) - AST-based Python syntax checking
- โ Code Execution (40 points) - Safe subprocess execution with timeout protection
- โ Output Comparison (30 points) - Fuzzy matching with expected results
Total Score: 0-100 | Pass Threshold: โฅ70
๐ Quick Start
Installation
Using uv (Recommended)
# Clone or download the plugin
cd praisonaibench-python
# Install with uv
uv pip install -e .
Using pip
# Install from directory
cd praisonaibench-python
pip install -e .
Verify Installation
# Check that the plugin is registered
python -c "from praisonaibench_python import PythonEvaluator; print('Plugin loaded successfully!')"
Configuration
Create a .env file (or copy from .env.example):
# OpenAI API Key for LLM-based benchmarking
OPENAI_API_KEY=your_api_key_here
# Default model
DEFAULT_MODEL=gpt-4o-mini
# Execution timeout (seconds)
PYTHON_EXECUTION_TIMEOUT=5
Basic Usage
Create a test suite file tests.yaml:
tests:
- name: "hello_world"
language: "python"
prompt: "Write Python code that prints 'Hello World'"
expected: "Hello World"
- name: "calculate_factorial"
language: "python"
prompt: "Write a Python function that calculates factorial of 5"
expected: "120"
Run the benchmarks:
praisonaibench --suite tests.yaml --model gpt-4o-mini
๐ Evaluation System
Scoring Breakdown
The evaluator uses a three-stage assessment system:
| Stage | Points | Description |
|---|---|---|
| Syntax Validation | 30 | AST parsing, import detection |
| Code Execution | 40 | Safe subprocess execution, error capture |
| Output Comparison | 30 | Fuzzy matching with expected output |
| Total | 100 | Combined score |
Pass Threshold: 70/100 points
Scoring Examples
Example 1: Perfect Score (100/100)
# Code: print("Hello World")
# Expected: "Hello World"
โ
Syntax: 30 points (valid Python)
โ
Execution: 40 points (runs successfully)
โ
Output: 30 points (exact match)
โโโโโโโโโโโโโโโโโโโโโโโโโ
Total: 100/100 โ
PASSED
Example 2: Partial Score (70/100)
# Code: print("Hello")
# Expected: "Hello World"
โ
Syntax: 30 points (valid Python)
โ
Execution: 40 points (runs successfully)
โ ๏ธ Output: 0 points (different output)
โโโโโโโโโโโโโโโโโโโโโโโโโ
Total: 70/100 โ
PASSED
Example 3: Failure (30/100)
# Code: print(undefined_variable)
# Expected: "Hello World"
โ
Syntax: 30 points (valid syntax)
โ Execution: 0 points (NameError)
โ Output: 0 points (didn't execute)
โโโโโโโโโโโโโโโโโโโโโโโโโ
Total: 30/100 โ FAILED
๐ Usage Guide
Python API
from praisonaibench_python import PythonEvaluator
# Create evaluator
evaluator = PythonEvaluator(timeout=5)
# Evaluate code
result = evaluator.evaluate(
code='print("Hello World")',
test_name="hello_test",
prompt="Write Python code that prints Hello World",
expected="Hello World"
)
# Check results
print(f"Score: {result['score']}/100")
print(f"Passed: {result['passed']}")
# View feedback
for item in result['feedback']:
print(f"{item['level']}: {item['message']}")
# Access details
print(f"Output: {result['details']['output']}")
print(f"Score breakdown: {result['details']['score_breakdown']}")
Test Suite Format
Simple Test
tests:
- name: "basic_math"
language: "python"
prompt: "Calculate 15 * 23 and print the result"
expected: "345"
Advanced Test
tests:
- name: "fibonacci"
language: "python"
prompt: |
Write a Python function that calculates the nth Fibonacci number.
Calculate and print the 10th Fibonacci number.
expected: "55"
Test Without Expected Output
tests:
- name: "creative_code"
language: "python"
prompt: "Write a Python class for a simple calculator"
# No expected field - evaluation based on syntax and execution only
Command Line Interface
# Run single test suite
praisonaibench --suite examples/simple_tests.yaml --model gpt-4o-mini
# Run with specific model
praisonaibench --suite examples/advanced_tests.yaml --model gpt-4o
# Run with custom configuration
praisonaibench --suite tests.yaml --config custom_config.yaml
๐จ Features
Security Features
- โ Subprocess Isolation - Code runs in separate process
- โ Timeout Protection - Configurable execution timeout (default: 5s)
- โ Resource Limits - Prevents infinite loops and resource exhaustion
- โ Error Handling - Graceful handling of all error types
Code Extraction
Automatically extracts code from various formats:
# Supports markdown code blocks
"""
```python
print("Hello")
"""
Supports generic code blocks
"""
print("Hello")
"""
Supports raw code
"print('Hello')"
### Output Comparison
Smart fuzzy matching algorithm:
- **Exact match**: 30/30 points
- **High similarity** (>80%): 25-29 points
- **Medium similarity** (50-80%): 15-24 points
- **Low similarity** (<50%): 0-14 points
Features:
- Case-insensitive comparison
- Whitespace normalisation
- Substring matching (e.g., "345" in "The answer is 345")
### Detailed Feedback
```python
{
"score": 85,
"passed": True,
"feedback": [
{"level": "success", "message": "โ
Valid Python syntax"},
{"level": "info", "message": "๐ฆ Imports: math, sys"},
{"level": "success", "message": "โ
Code executed successfully"},
{"level": "info", "message": "๐ค Output: Hello World"},
{"level": "warning", "message": "โ ๏ธ Output partially matches expected"}
],
"details": {
"extracted_code": "print('Hello World')",
"executed": True,
"output": "Hello World",
"similarity": 0.95,
"score_breakdown": {
"syntax": 30,
"execution": 40,
"output_match": 28
}
}
}
๐ Examples
Example 1: Hello World
tests:
- name: "hello_world"
language: "python"
prompt: "Write Python code that prints 'Hello World'"
expected: "Hello World"
Example 2: Factorial Function
tests:
- name: "factorial"
language: "python"
prompt: |
Write a Python function that calculates the factorial of a number.
Calculate factorial(5) and print the result.
expected: "120"
Example 3: List Operations
tests:
- name: "list_sum"
language: "python"
prompt: |
Create a list [1, 2, 3, 4, 5], calculate the sum, and print it.
expected: "15"
More examples available in:
examples/simple_tests.yaml- Basic Python testsexamples/advanced_tests.yaml- Complex Python challengesexamples/algorithm_tests.yaml- Algorithm implementations
๐งช Testing
Run Unit Tests
# Install development dependencies
uv pip install -e ".[dev]"
# Run all tests
pytest tests/ -v
# Run specific test file
pytest tests/test_evaluator.py -v
# Run with coverage
pytest tests/ --cov=praisonaibench_python --cov-report=html
Test Coverage
The plugin includes comprehensive tests:
-
โ Unit Tests (
tests/test_evaluator.py)- Code extraction
- Syntax validation
- Code execution
- Output comparison
- Error handling
- Timeout protection
-
โ Integration Tests (
tests/test_integration.py)- Plugin interface compatibility
- Multiple test scenarios
- Concurrent evaluations
- Large output handling
- Import support
๐ง Configuration
Environment Variables
# Required
OPENAI_API_KEY=your_api_key_here
# Optional
DEFAULT_MODEL=gpt-4o-mini
PYTHON_EXECUTION_TIMEOUT=5
PYTHON_EXECUTABLE=/path/to/python # Leave empty for system default
Programmatic Configuration
from praisonaibench_python import PythonEvaluator
# Custom timeout
evaluator = PythonEvaluator(timeout=10)
# Custom Python executable
evaluator = PythonEvaluator(
timeout=5,
python_executable="/usr/bin/python3.11"
)
๐๏ธ Architecture
Plugin Structure
praisonaibench-python/
โโโ src/praisonaibench_python/
โ โโโ __init__.py # Plugin exports
โ โโโ evaluator.py # Main evaluator class
โ โโโ version.py # Version info
โโโ tests/
โ โโโ test_evaluator.py # Unit tests
โ โโโ test_integration.py # Integration tests
โโโ examples/
โ โโโ simple_tests.yaml
โ โโโ advanced_tests.yaml
โ โโโ algorithm_tests.yaml
โโโ pyproject.toml # Project configuration
โโโ .env # Configuration
โโโ README.md # This file
Class Hierarchy
BaseEvaluator (from praisonaibench)
โโโ PythonEvaluator
โโโ get_language() โ 'python'
โโโ get_file_extension() โ 'py'
โโโ evaluate(code, test_name, prompt, expected) โ dict
๐ค Contributing
Contributions are welcome! Here's how:
- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Make your changes
- Run tests:
pytest tests/ -v - Format code:
black src/ tests/ - Submit a pull request
Development Setup
# Clone repository
git clone https://github.com/YourUsername/praisonaibench-python
cd praisonaibench-python
# Install in development mode
uv pip install -e ".[dev]"
# Run tests
pytest tests/ -v
# Format code
black src/ tests/
๐ License
MIT License - see LICENSE file for details.
๐ Links
- PraisonAI Bench - Main project
- Plugin System Documentation
- Issue Tracker
๐ Support
- Issues: GitHub Issues
- Documentation: PraisonAI Bench Docs
- Community: Join the discussion on GitHub
๐ Acknowledgements
Built with โค๏ธ for the PraisonAI Bench community.
Special thanks to:
- PraisonAI - For the amazing benchmarking framework
- Contributors and testers
- The Python community
Ready to benchmark Python code generation? Install the plugin and start testing! ๐
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file praisonaibench_python-0.1.0.tar.gz.
File metadata
- Download URL: praisonaibench_python-0.1.0.tar.gz
- Upload date:
- Size: 14.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.22
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
38ac910ee0e5f3b58f595468fc064535b1aafc8a89995228be404611b9c58ba6
|
|
| MD5 |
56b144af0974c7640bb1df54cae7df5e
|
|
| BLAKE2b-256 |
feac44c701111a26d3b7f39eaa5445fee5a740033fd6b58e38b11b5428cb6a4c
|
File details
Details for the file praisonaibench_python-0.1.0-py3-none-any.whl.
File metadata
- Download URL: praisonaibench_python-0.1.0-py3-none-any.whl
- Upload date:
- Size: 10.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.22
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4997b653ae7fcc8e64158e8451d6aa9c6332018d3d637b00c61f4ddd82a3506c
|
|
| MD5 |
8c81f0e520ab12f0e8077945a94a6b82
|
|
| BLAKE2b-256 |
a6bdf35042fcb66f768b19410e1a56516e173350b223171407bd4266d737af85
|