Python code evaluator plugin for PraisonAI Bench

These details have not been verified by PyPI

Project links

Project description

PraisonAI Bench Python Evaluator Plugin

🐍 A comprehensive Python code evaluation plugin for PraisonAI Bench

🎯 Overview

The Python Evaluator Plugin enables PraisonAI Bench to evaluate Python code through comprehensive multi-stage assessment:

✅ Syntax Validation (30 points) - AST-based Python syntax checking
✅ Code Execution (40 points) - Safe subprocess execution with timeout protection
✅ Output Comparison (30 points) - Fuzzy matching with expected results

Total Score: 0-100 | Pass Threshold: ≥70

🚀 Quick Start

Installation

Using uv (Recommended)

# Clone or download the plugin
cd praisonaibench-python

# Install with uv
uv pip install -e .

Using pip

# Install from directory
cd praisonaibench-python
pip install -e .

Verify Installation

# Check that the plugin is registered
python -c "from praisonaibench_python import PythonEvaluator; print('Plugin loaded successfully!')"

Configuration

Create a .env file (or copy from .env.example):

# OpenAI API Key for LLM-based benchmarking
OPENAI_API_KEY=your_api_key_here

# Default model
DEFAULT_MODEL=gpt-4o-mini

# Execution timeout (seconds)
PYTHON_EXECUTION_TIMEOUT=5

Basic Usage

Create a test suite file tests.yaml:

tests:
  - name: "hello_world"
    language: "python"
    prompt: "Write Python code that prints 'Hello World'"
    expected: "Hello World"
  
  - name: "calculate_factorial"
    language: "python"
    prompt: "Write a Python function that calculates factorial of 5"
    expected: "120"

Run the benchmarks:

praisonaibench --suite tests.yaml --model gpt-4o-mini

📊 Evaluation System

Scoring Breakdown

The evaluator uses a three-stage assessment system:

Stage	Points	Description
Syntax Validation	30	AST parsing, import detection
Code Execution	40	Safe subprocess execution, error capture
Output Comparison	30	Fuzzy matching with expected output
Total	100	Combined score

Pass Threshold: 70/100 points

Scoring Examples

Example 1: Perfect Score (100/100)

# Code: print("Hello World")
# Expected: "Hello World"

✅ Syntax: 30 points (valid Python)
✅ Execution: 40 points (runs successfully)
✅ Output: 30 points (exact match)
─────────────────────────
Total: 100/100 ✅ PASSED

Example 2: Partial Score (70/100)

# Code: print("Hello")
# Expected: "Hello World"

✅ Syntax: 30 points (valid Python)
✅ Execution: 40 points (runs successfully)
⚠️  Output: 0 points (different output)
─────────────────────────
Total: 70/100 ✅ PASSED

Example 3: Failure (30/100)

# Code: print(undefined_variable)
# Expected: "Hello World"

✅ Syntax: 30 points (valid syntax)
❌ Execution: 0 points (NameError)
❌ Output: 0 points (didn't execute)
─────────────────────────
Total: 30/100 ❌ FAILED

📖 Usage Guide

Python API

from praisonaibench_python import PythonEvaluator

# Create evaluator
evaluator = PythonEvaluator(timeout=5)

# Evaluate code
result = evaluator.evaluate(
    code='print("Hello World")',
    test_name="hello_test",
    prompt="Write Python code that prints Hello World",
    expected="Hello World"
)

# Check results
print(f"Score: {result['score']}/100")
print(f"Passed: {result['passed']}")

# View feedback
for item in result['feedback']:
    print(f"{item['level']}: {item['message']}")

# Access details
print(f"Output: {result['details']['output']}")
print(f"Score breakdown: {result['details']['score_breakdown']}")

Test Suite Format

Simple Test

tests:
  - name: "basic_math"
    language: "python"
    prompt: "Calculate 15 * 23 and print the result"
    expected: "345"

Advanced Test

tests:
  - name: "fibonacci"
    language: "python"
    prompt: |
      Write a Python function that calculates the nth Fibonacci number.
      Calculate and print the 10th Fibonacci number.
    expected: "55"

Test Without Expected Output

tests:
  - name: "creative_code"
    language: "python"
    prompt: "Write a Python class for a simple calculator"
    # No expected field - evaluation based on syntax and execution only

Command Line Interface

# Run single test suite
praisonaibench --suite examples/simple_tests.yaml --model gpt-4o-mini

# Run with specific model
praisonaibench --suite examples/advanced_tests.yaml --model gpt-4o

# Run with custom configuration
praisonaibench --suite tests.yaml --config custom_config.yaml

🎨 Features

Security Features

✅ Subprocess Isolation - Code runs in separate process
✅ Timeout Protection - Configurable execution timeout (default: 5s)
✅ Resource Limits - Prevents infinite loops and resource exhaustion
✅ Error Handling - Graceful handling of all error types

Code Extraction

Automatically extracts code from various formats:

# Supports markdown code blocks
"""
```python
print("Hello")

"""

Supports generic code blocks

"""

print("Hello")

"""

Supports raw code

"print('Hello')"


### Output Comparison

Smart fuzzy matching algorithm:

- **Exact match**: 30/30 points
- **High similarity** (>80%): 25-29 points
- **Medium similarity** (50-80%): 15-24 points
- **Low similarity** (<50%): 0-14 points

Features:
- Case-insensitive comparison
- Whitespace normalisation
- Substring matching (e.g., "345" in "The answer is 345")

### Detailed Feedback

```python
{
  "score": 85,
  "passed": True,
  "feedback": [
    {"level": "success", "message": "✅ Valid Python syntax"},
    {"level": "info", "message": "📦 Imports: math, sys"},
    {"level": "success", "message": "✅ Code executed successfully"},
    {"level": "info", "message": "📤 Output: Hello World"},
    {"level": "warning", "message": "⚠️  Output partially matches expected"}
  ],
  "details": {
    "extracted_code": "print('Hello World')",
    "executed": True,
    "output": "Hello World",
    "similarity": 0.95,
    "score_breakdown": {
      "syntax": 30,
      "execution": 40,
      "output_match": 28
    }
  }
}

📚 Examples

Example 1: Hello World

tests:
  - name: "hello_world"
    language: "python"
    prompt: "Write Python code that prints 'Hello World'"
    expected: "Hello World"

Example 2: Factorial Function

tests:
  - name: "factorial"
    language: "python"
    prompt: |
      Write a Python function that calculates the factorial of a number.
      Calculate factorial(5) and print the result.
    expected: "120"

Example 3: List Operations

tests:
  - name: "list_sum"
    language: "python"
    prompt: |
      Create a list [1, 2, 3, 4, 5], calculate the sum, and print it.
    expected: "15"

More examples available in:

examples/simple_tests.yaml - Basic Python tests
examples/advanced_tests.yaml - Complex Python challenges
examples/algorithm_tests.yaml - Algorithm implementations

🧪 Testing

Run Unit Tests

# Install development dependencies
uv pip install -e ".[dev]"

# Run all tests
pytest tests/ -v

# Run specific test file
pytest tests/test_evaluator.py -v

# Run with coverage
pytest tests/ --cov=praisonaibench_python --cov-report=html

Test Coverage

The plugin includes comprehensive tests:

✅ Unit Tests (tests/test_evaluator.py)
- Code extraction
- Syntax validation
- Code execution
- Output comparison
- Error handling
- Timeout protection
✅ Integration Tests (tests/test_integration.py)
- Plugin interface compatibility
- Multiple test scenarios
- Concurrent evaluations
- Large output handling
- Import support

🔧 Configuration

Environment Variables

# Required
OPENAI_API_KEY=your_api_key_here

# Optional
DEFAULT_MODEL=gpt-4o-mini
PYTHON_EXECUTION_TIMEOUT=5
PYTHON_EXECUTABLE=/path/to/python  # Leave empty for system default

Programmatic Configuration

from praisonaibench_python import PythonEvaluator

# Custom timeout
evaluator = PythonEvaluator(timeout=10)

# Custom Python executable
evaluator = PythonEvaluator(
    timeout=5,
    python_executable="/usr/bin/python3.11"
)

🏗️ Architecture

Plugin Structure

praisonaibench-python/
├── src/praisonaibench_python/
│   ├── __init__.py          # Plugin exports
│   ├── evaluator.py         # Main evaluator class
│   └── version.py           # Version info
├── tests/
│   ├── test_evaluator.py    # Unit tests
│   └── test_integration.py  # Integration tests
├── examples/
│   ├── simple_tests.yaml
│   ├── advanced_tests.yaml
│   └── algorithm_tests.yaml
├── pyproject.toml           # Project configuration
├── .env                     # Configuration
└── README.md               # This file

Class Hierarchy

BaseEvaluator (from praisonaibench)
    └── PythonEvaluator
        ├── get_language() → 'python'
        ├── get_file_extension() → 'py'
        └── evaluate(code, test_name, prompt, expected) → dict

🤝 Contributing

Contributions are welcome! Here's how:

Fork the repository
Create a feature branch: git checkout -b feature-name
Make your changes
Run tests: pytest tests/ -v
Format code: black src/ tests/
Submit a pull request

Development Setup

# Clone repository
git clone https://github.com/YourUsername/praisonaibench-python
cd praisonaibench-python

# Install in development mode
uv pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Format code
black src/ tests/

📄 License

MIT License - see LICENSE file for details.

🔗 Links

📞 Support

Issues: GitHub Issues
Documentation: PraisonAI Bench Docs
Community: Join the discussion on GitHub

🎉 Acknowledgements

Built with ❤️ for the PraisonAI Bench community.

Special thanks to:

PraisonAI - For the amazing benchmarking framework
Contributors and testers
The Python community

Ready to benchmark Python code generation? Install the plugin and start testing! 🚀

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Dec 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

praisonaibench_python-0.1.0.tar.gz (14.2 kB view details)

Uploaded Dec 11, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

praisonaibench_python-0.1.0-py3-none-any.whl (10.3 kB view details)

Uploaded Dec 11, 2025 Python 3

File details

Details for the file praisonaibench_python-0.1.0.tar.gz.

File metadata

Download URL: praisonaibench_python-0.1.0.tar.gz
Upload date: Dec 11, 2025
Size: 14.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.22

File hashes

Hashes for praisonaibench_python-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`38ac910ee0e5f3b58f595468fc064535b1aafc8a89995228be404611b9c58ba6`
MD5	`56b144af0974c7640bb1df54cae7df5e`
BLAKE2b-256	`feac44c701111a26d3b7f39eaa5445fee5a740033fd6b58e38b11b5428cb6a4c`

See more details on using hashes here.

File details

Details for the file praisonaibench_python-0.1.0-py3-none-any.whl.

File metadata

Download URL: praisonaibench_python-0.1.0-py3-none-any.whl
Upload date: Dec 11, 2025
Size: 10.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.22

File hashes

Hashes for praisonaibench_python-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4997b653ae7fcc8e64158e8451d6aa9c6332018d3d637b00c61f4ddd82a3506c`
MD5	`8c81f0e520ab12f0e8077945a94a6b82`
BLAKE2b-256	`a6bdf35042fcb66f768b19410e1a56516e173350b223171407bd4266d737af85`

See more details on using hashes here.

praisonaibench-python 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PraisonAI Bench Python Evaluator Plugin

🎯 Overview

🚀 Quick Start

Installation

Using uv (Recommended)

Using pip

Verify Installation

Configuration

Basic Usage

📊 Evaluation System

Scoring Breakdown

Scoring Examples

Example 1: Perfect Score (100/100)

Example 2: Partial Score (70/100)

Example 3: Failure (30/100)

📖 Usage Guide

Python API

Test Suite Format

Simple Test

Advanced Test

Test Without Expected Output

Command Line Interface

🎨 Features

Security Features

Code Extraction

Supports generic code blocks

Supports raw code

📚 Examples

Example 1: Hello World

Example 2: Factorial Function

Example 3: List Operations

🧪 Testing

Run Unit Tests

Test Coverage

🔧 Configuration

Environment Variables

Programmatic Configuration

🏗️ Architecture

Plugin Structure

Class Hierarchy

🤝 Contributing

Development Setup

📄 License

🔗 Links

📞 Support

🎉 Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes