Skip to main content

Python code evaluator plugin for PraisonAI Bench

Project description

PraisonAI Bench Python Evaluator Plugin

๐Ÿ A comprehensive Python code evaluation plugin for PraisonAI Bench

Python 3.8+ License: MIT Code style: black

๐ŸŽฏ Overview

The Python Evaluator Plugin enables PraisonAI Bench to evaluate Python code through comprehensive multi-stage assessment:

  • โœ… Syntax Validation (30 points) - AST-based Python syntax checking
  • โœ… Code Execution (40 points) - Safe subprocess execution with timeout protection
  • โœ… Output Comparison (30 points) - Fuzzy matching with expected results

Total Score: 0-100 | Pass Threshold: โ‰ฅ70

๐Ÿš€ Quick Start

Installation

Using uv (Recommended)

# Clone or download the plugin
cd praisonaibench-python

# Install with uv
uv pip install -e .

Using pip

# Install from directory
cd praisonaibench-python
pip install -e .

Verify Installation

# Check that the plugin is registered
python -c "from praisonaibench_python import PythonEvaluator; print('Plugin loaded successfully!')"

Configuration

Create a .env file (or copy from .env.example):

# OpenAI API Key for LLM-based benchmarking
OPENAI_API_KEY=your_api_key_here

# Default model
DEFAULT_MODEL=gpt-4o-mini

# Execution timeout (seconds)
PYTHON_EXECUTION_TIMEOUT=5

Basic Usage

Create a test suite file tests.yaml:

tests:
  - name: "hello_world"
    language: "python"
    prompt: "Write Python code that prints 'Hello World'"
    expected: "Hello World"
  
  - name: "calculate_factorial"
    language: "python"
    prompt: "Write a Python function that calculates factorial of 5"
    expected: "120"

Run the benchmarks:

praisonaibench --suite tests.yaml --model gpt-4o-mini

๐Ÿ“Š Evaluation System

Scoring Breakdown

The evaluator uses a three-stage assessment system:

Stage Points Description
Syntax Validation 30 AST parsing, import detection
Code Execution 40 Safe subprocess execution, error capture
Output Comparison 30 Fuzzy matching with expected output
Total 100 Combined score

Pass Threshold: 70/100 points

Scoring Examples

Example 1: Perfect Score (100/100)

# Code: print("Hello World")
# Expected: "Hello World"

โœ… Syntax: 30 points (valid Python)
โœ… Execution: 40 points (runs successfully)
โœ… Output: 30 points (exact match)
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Total: 100/100 โœ… PASSED

Example 2: Partial Score (70/100)

# Code: print("Hello")
# Expected: "Hello World"

โœ… Syntax: 30 points (valid Python)
โœ… Execution: 40 points (runs successfully)
โš ๏ธ  Output: 0 points (different output)
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Total: 70/100 โœ… PASSED

Example 3: Failure (30/100)

# Code: print(undefined_variable)
# Expected: "Hello World"

โœ… Syntax: 30 points (valid syntax)
โŒ Execution: 0 points (NameError)
โŒ Output: 0 points (didn't execute)
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Total: 30/100 โŒ FAILED

๐Ÿ“– Usage Guide

Python API

from praisonaibench_python import PythonEvaluator

# Create evaluator
evaluator = PythonEvaluator(timeout=5)

# Evaluate code
result = evaluator.evaluate(
    code='print("Hello World")',
    test_name="hello_test",
    prompt="Write Python code that prints Hello World",
    expected="Hello World"
)

# Check results
print(f"Score: {result['score']}/100")
print(f"Passed: {result['passed']}")

# View feedback
for item in result['feedback']:
    print(f"{item['level']}: {item['message']}")

# Access details
print(f"Output: {result['details']['output']}")
print(f"Score breakdown: {result['details']['score_breakdown']}")

Test Suite Format

Simple Test

tests:
  - name: "basic_math"
    language: "python"
    prompt: "Calculate 15 * 23 and print the result"
    expected: "345"

Advanced Test

tests:
  - name: "fibonacci"
    language: "python"
    prompt: |
      Write a Python function that calculates the nth Fibonacci number.
      Calculate and print the 10th Fibonacci number.
    expected: "55"

Test Without Expected Output

tests:
  - name: "creative_code"
    language: "python"
    prompt: "Write a Python class for a simple calculator"
    # No expected field - evaluation based on syntax and execution only

Command Line Interface

# Run single test suite
praisonaibench --suite examples/simple_tests.yaml --model gpt-4o-mini

# Run with specific model
praisonaibench --suite examples/advanced_tests.yaml --model gpt-4o

# Run with custom configuration
praisonaibench --suite tests.yaml --config custom_config.yaml

๐ŸŽจ Features

Security Features

  • โœ… Subprocess Isolation - Code runs in separate process
  • โœ… Timeout Protection - Configurable execution timeout (default: 5s)
  • โœ… Resource Limits - Prevents infinite loops and resource exhaustion
  • โœ… Error Handling - Graceful handling of all error types

Code Extraction

Automatically extracts code from various formats:

# Supports markdown code blocks
"""
```python
print("Hello")

"""

Supports generic code blocks

"""

print("Hello")

"""

Supports raw code

"print('Hello')"


### Output Comparison

Smart fuzzy matching algorithm:

- **Exact match**: 30/30 points
- **High similarity** (>80%): 25-29 points
- **Medium similarity** (50-80%): 15-24 points
- **Low similarity** (<50%): 0-14 points

Features:
- Case-insensitive comparison
- Whitespace normalisation
- Substring matching (e.g., "345" in "The answer is 345")

### Detailed Feedback

```python
{
  "score": 85,
  "passed": True,
  "feedback": [
    {"level": "success", "message": "โœ… Valid Python syntax"},
    {"level": "info", "message": "๐Ÿ“ฆ Imports: math, sys"},
    {"level": "success", "message": "โœ… Code executed successfully"},
    {"level": "info", "message": "๐Ÿ“ค Output: Hello World"},
    {"level": "warning", "message": "โš ๏ธ  Output partially matches expected"}
  ],
  "details": {
    "extracted_code": "print('Hello World')",
    "executed": True,
    "output": "Hello World",
    "similarity": 0.95,
    "score_breakdown": {
      "syntax": 30,
      "execution": 40,
      "output_match": 28
    }
  }
}

๐Ÿ“š Examples

Example 1: Hello World

tests:
  - name: "hello_world"
    language: "python"
    prompt: "Write Python code that prints 'Hello World'"
    expected: "Hello World"

Example 2: Factorial Function

tests:
  - name: "factorial"
    language: "python"
    prompt: |
      Write a Python function that calculates the factorial of a number.
      Calculate factorial(5) and print the result.
    expected: "120"

Example 3: List Operations

tests:
  - name: "list_sum"
    language: "python"
    prompt: |
      Create a list [1, 2, 3, 4, 5], calculate the sum, and print it.
    expected: "15"

More examples available in:

  • examples/simple_tests.yaml - Basic Python tests
  • examples/advanced_tests.yaml - Complex Python challenges
  • examples/algorithm_tests.yaml - Algorithm implementations

๐Ÿงช Testing

Run Unit Tests

# Install development dependencies
uv pip install -e ".[dev]"

# Run all tests
pytest tests/ -v

# Run specific test file
pytest tests/test_evaluator.py -v

# Run with coverage
pytest tests/ --cov=praisonaibench_python --cov-report=html

Test Coverage

The plugin includes comprehensive tests:

  • โœ… Unit Tests (tests/test_evaluator.py)

    • Code extraction
    • Syntax validation
    • Code execution
    • Output comparison
    • Error handling
    • Timeout protection
  • โœ… Integration Tests (tests/test_integration.py)

    • Plugin interface compatibility
    • Multiple test scenarios
    • Concurrent evaluations
    • Large output handling
    • Import support

๐Ÿ”ง Configuration

Environment Variables

# Required
OPENAI_API_KEY=your_api_key_here

# Optional
DEFAULT_MODEL=gpt-4o-mini
PYTHON_EXECUTION_TIMEOUT=5
PYTHON_EXECUTABLE=/path/to/python  # Leave empty for system default

Programmatic Configuration

from praisonaibench_python import PythonEvaluator

# Custom timeout
evaluator = PythonEvaluator(timeout=10)

# Custom Python executable
evaluator = PythonEvaluator(
    timeout=5,
    python_executable="/usr/bin/python3.11"
)

๐Ÿ—๏ธ Architecture

Plugin Structure

praisonaibench-python/
โ”œโ”€โ”€ src/praisonaibench_python/
โ”‚   โ”œโ”€โ”€ __init__.py          # Plugin exports
โ”‚   โ”œโ”€โ”€ evaluator.py         # Main evaluator class
โ”‚   โ””โ”€โ”€ version.py           # Version info
โ”œโ”€โ”€ tests/
โ”‚   โ”œโ”€โ”€ test_evaluator.py    # Unit tests
โ”‚   โ””โ”€โ”€ test_integration.py  # Integration tests
โ”œโ”€โ”€ examples/
โ”‚   โ”œโ”€โ”€ simple_tests.yaml
โ”‚   โ”œโ”€โ”€ advanced_tests.yaml
โ”‚   โ””โ”€โ”€ algorithm_tests.yaml
โ”œโ”€โ”€ pyproject.toml           # Project configuration
โ”œโ”€โ”€ .env                     # Configuration
โ””โ”€โ”€ README.md               # This file

Class Hierarchy

BaseEvaluator (from praisonaibench)
    โ””โ”€โ”€ PythonEvaluator
        โ”œโ”€โ”€ get_language() โ†’ 'python'
        โ”œโ”€โ”€ get_file_extension() โ†’ 'py'
        โ””โ”€โ”€ evaluate(code, test_name, prompt, expected) โ†’ dict

๐Ÿค Contributing

Contributions are welcome! Here's how:

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature-name
  3. Make your changes
  4. Run tests: pytest tests/ -v
  5. Format code: black src/ tests/
  6. Submit a pull request

Development Setup

# Clone repository
git clone https://github.com/YourUsername/praisonaibench-python
cd praisonaibench-python

# Install in development mode
uv pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Format code
black src/ tests/

๐Ÿ“„ License

MIT License - see LICENSE file for details.

๐Ÿ”— Links

๐Ÿ“ž Support

๐ŸŽ‰ Acknowledgements

Built with โค๏ธ for the PraisonAI Bench community.

Special thanks to:

  • PraisonAI - For the amazing benchmarking framework
  • Contributors and testers
  • The Python community

Ready to benchmark Python code generation? Install the plugin and start testing! ๐Ÿš€

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

praisonaibench_python-0.1.0.tar.gz (14.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

praisonaibench_python-0.1.0-py3-none-any.whl (10.3 kB view details)

Uploaded Python 3

File details

Details for the file praisonaibench_python-0.1.0.tar.gz.

File metadata

File hashes

Hashes for praisonaibench_python-0.1.0.tar.gz
Algorithm Hash digest
SHA256 38ac910ee0e5f3b58f595468fc064535b1aafc8a89995228be404611b9c58ba6
MD5 56b144af0974c7640bb1df54cae7df5e
BLAKE2b-256 feac44c701111a26d3b7f39eaa5445fee5a740033fd6b58e38b11b5428cb6a4c

See more details on using hashes here.

File details

Details for the file praisonaibench_python-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for praisonaibench_python-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4997b653ae7fcc8e64158e8451d6aa9c6332018d3d637b00c61f4ddd82a3506c
MD5 8c81f0e520ab12f0e8077945a94a6b82
BLAKE2b-256 a6bdf35042fcb66f768b19410e1a56516e173350b223171407bd4266d737af85

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page