Simple LLM Benchmarking Tool using PraisonAI Agents

These details have not been verified by PyPI

Project links

Project description

PraisonAI Bench

🚀 A simple, powerful LLM benchmarking tool built with PraisonAI Agents

Benchmark any LiteLLM-compatible model with automatic HTML extraction, model-specific output organization, and flexible test suite management.

🎯 Testing Modes

Feature	Single Test	Test Suite (YAML)
📝 Description	Run one prompt	Run multiple tests from YAML file
🔧 Command	`praisonaibench --test "prompt"`	`praisonaibench --suite tests.yaml`
📊 Evaluation	✅ Enabled (Browser + LLM Judge)	✅ Enabled (Browser + LLM Judge)
🎨 HTML Extraction	✅ Auto-extracted	✅ Auto-extracted
📁 Output	Single JSON result	Batch JSON results
🖼️ Screenshots	✅ Generated	✅ Generated
⚡ Console Errors	✅ Detected	✅ Detected
🤖 LLM Judge	✅ gpt-5.1 quality scoring	✅ gpt-5.1 quality scoring
🔄 Retry Logic	✅ 3 attempts	✅ 3 attempts
⚡ Parallel Execution	N/A	✅ `--concurrent N`
💰 Cost Tracking	✅ Token & cost per test	✅ Cumulative cost summary
📊 Export Formats	JSON	JSON, CSV
📈 HTML Reports	N/A	✅ `--report`
🎯 Use Case	Quick testing	Comprehensive benchmarking

🔍 What's Included in Evaluation?

Our research-backed hybrid evaluation system provides comprehensive quality assessment:

Component	What It Does	Score Weight
📝 HTML Validation	Static structure validation, DOCTYPE, required tags	15%
🌐 Functional	Browser rendering, console errors, render time	40%
🎯 Expected Result	Objective comparison (optional, for factual tasks)	20%*
🎨 Quality (LLM)	Code quality, completeness, best practices	25%
📊 Overall	Combined score (0-100) with pass/fail (≥70)	100%

*When expected field is not provided, weights are automatically normalized (HTML: 18.75%, Functional: 50%, LLM: 31.25%)

Example Output (with expected):

HTML Validation: 90/100 ✅ Valid structure
Functional: 85/100 (renders ✅, 1 error, <1s)
Expected: 95/100 (95% similarity with expected result)
Quality: 80/100 (good structure, minor issues)
Overall: 87/100 ✅ PASSED

Example Output (without expected):

HTML Validation: 90/100 ✅ Valid structure
Functional: 85/100 (renders ✅, 1 error, <1s)
Expected: N/A (not provided)
Quality: 80/100 (good structure, minor issues)
Overall: 85/100 ✅ PASSED

✨ Key Features

🎯 Any LLM Model - OpenAI, Anthropic, Google, XAI, local models via LiteLLM
🔄 Single Agent Design - Your prompt becomes the instruction (no complex configs)
💾 Auto HTML Extraction - Automatically saves HTML code from responses
📁 Smart Organization - Model-specific output folders (output/gpt-4o/, output/xai/grok-code-fast-1/)
🎛️ Flexible Testing - Run single tests, full suites, or filter specific tests
⚡ Parallel Execution - Run tests concurrently with --concurrent N for faster benchmarking
💰 Cost & Token Tracking - Automatic token usage and cost calculation for all supported models
📊 Multiple Export Formats - Export results as JSON or CSV for easy analysis
📈 HTML Dashboard Reports - Beautiful visual reports with interactive charts using --report
🛠️ Modern Tooling - Built with pyproject.toml and uv package manager
📋 Comprehensive Results - Complete metrics with timing, success rates, costs, and metadata
🔌 Plugin System - Extensible evaluators for any language (Python, TypeScript, Go, etc.) via plugins

🚀 Quick Start

1. Install from PyPI (Recommended)

pip install praisonaibench

2. Install with uv

# Clone the repository
git clone https://github.com/MervinPraison/praisonaibench
cd praisonaibench

# Install with uv
uv sync

# Or install in development mode
uv pip install -e .

3. Alternative: Install with pip (from source)

git clone https://github.com/MervinPraison/praisonaibench
cd praisonaibench
pip install -e .

4. Set Your API Keys

# OpenAI
export OPENAI_API_KEY=your_openai_key

# XAI (Grok)
export XAI_API_KEY=your_xai_key

# Anthropic
export ANTHROPIC_API_KEY=your_anthropic_key

# Google
export GOOGLE_API_KEY=your_google_key

4. Run Your First Benchmark

from praisonaibench import Bench

# Create benchmark suite
bench = Bench()

# Run a simple test
result = bench.run_single_test("What is 2+2?")
print(result['response'])

# Run with specific model
result = bench.run_single_test(
    "Create a rotating cube HTML file", 
    model="xai/grok-code-fast-1"
)

# Get summary
summary = bench.get_summary()
print(summary)

📁 Project Structure

praisonaibench/
├── pyproject.toml           # Modern Python packaging
├── src/praisonaibench/      # Source code
│   ├── __init__.py          # Main imports
│   ├── bench.py             # Core benchmarking engine
│   ├── agent.py             # LLM agent wrapper
│   ├── cli.py               # Command line interface
│   └── version.py           # Version info
├── examples/                # Example configurations
│   ├── threejs_simulation_suite.yaml
│   └── config_example.yaml
└── output/                  # Generated results
    ├── gpt-4o/             # Model-specific HTML files
    ├── xai/grok-code-fast-1/
    └── benchmark_results_*.json

💻 CLI Usage

Basic Commands

# Single test with default model
praisonaibench --test "Explain quantum computing"

# Single test with specific model
praisonaibench --test "Write a poem" --model gpt-4o

# Use any LiteLLM-compatible model
praisonaibench --test "Create HTML" --model xai/grok-code-fast-1
praisonaibench --test "Write code" --model gemini/gemini-1.5-flash-8b
praisonaibench --test "Analyze data" --model claude-3-sonnet-20240229

Test Suites

# Run entire test suite
praisonaibench --suite examples/threejs_simulation_suite.yaml

# Run specific test from suite
praisonaibench --suite examples/threejs_simulation_suite.yaml --test-name "rotating_cube_simulation"

# Run suite with specific model (overrides individual test models)
praisonaibench --suite tests.yaml --model xai/grok-code-fast-1

# Run tests in parallel (3 concurrent workers)
praisonaibench --suite tests.yaml --concurrent 3

Cross-Model Comparison

# Compare across multiple models
praisonaibench --cross-model "Write a poem" --models gpt-4o,gpt-3.5-turbo,xai/grok-code-fast-1

Extract HTML from Results

# Extract HTML from existing benchmark results
praisonaibench --extract output/benchmark_results_20250829_160426.json
# → Processes JSON file and saves any HTML content to .html files

# Works with any benchmark results JSON file
praisonaibench --extract my_results.json

HTML Generation Examples

# Generate Three.js simulation (auto-saves HTML)
praisonaibench --test "Create a rotating cube HTML with Three.js" --model gpt-4o
# → Saves to: output/gpt-4o/test_cube.html

# Run Three.js test suite
praisonaibench --suite examples/threejs_simulation_suite.yaml --model xai/grok-code-fast-1
# → Saves to: output/xai/grok-code-fast-1/rotating_cube_simulation.html

Cost & Token Tracking

Automatically track token usage and costs for all LLM API calls:

# Run tests with automatic cost tracking
praisonaibench --suite tests.yaml --model gpt-4o

# Output includes per-test costs:
💰 Cost: $0.002400 (1250 tokens)

# Summary shows total costs:
📊 Summary:
   Total tests: 4
   Success rate: 100.0%
   Average time: 8.42s

💰 Cost Summary:
   Total tokens: 5,420
   Total cost: $0.0124

   By model:
     gpt-4o: $0.0124 (5,420 tokens)

Supported models include accurate pricing for:

OpenAI (GPT-4o, GPT-4, GPT-3.5, O1)
Anthropic (Claude 3 family)
Google (Gemini 1.5 family)
XAI (Grok models)
Groq (optimised models)

Token usage is extracted from API responses when available, or estimated from text length. Costs are calculated using official provider pricing (updated December 2024).

CSV Export

Export benchmark results to CSV for spreadsheet analysis:

# Export to CSV format
praisonaibench --suite tests.yaml --format csv

# Results saved to: output/csv/benchmark_results_20241211_123456.csv

CSV includes:

Test names and status
Model information
Execution times
Token usage (input/output/total)
Costs per test
Evaluation scores
Prompts and response lengths
Error messages (if any)

Perfect for:

Spreadsheet analysis in Excel/Google Sheets
Data visualization tools
Statistical analysis
Sharing results with non-technical stakeholders

HTML Dashboard Reports

Generate beautiful interactive reports with comprehensive visualizations inspired by the React UI:

# Generate enhanced HTML report after running tests
praisonaibench --suite tests.yaml --report

# Generate report from existing results (without re-running tests)
praisonaibench --report-from output/json/benchmark_results_20241211_123456.json

# Compare multiple test results
praisonaibench --compare result1.json result2.json result3.json

# Reports saved to: output/reports/

Enhanced Report Includes:

📊 Dashboard Tab:

Summary cards with key metrics (tests, models, success rate, avg time, cost, tokens)
Interactive charts:
- Status distribution (success/failure)
- Execution time by model
- Evaluation scores (radar chart)
- Errors & warnings

🏆 Leaderboard Tab:

Model rankings with multiple criteria:
- Overall Score (default)
- Functional Score
- Quality Score
- Pass Rate
- Speed (fastest first)
Top 3 models highlighted with medals
Detailed metrics per model (functional, quality, pass rate, time)
Click criteria to re-rank dynamically

⚖️ Comparison Tab:

Detailed side-by-side model comparison
Comprehensive metrics table:
- Overall score, functional score, quality score
- Pass rate with color coding
- Average execution time
- Total errors and warnings count
Full model names and stats

📋 Results Tab:

Complete test results table
Individual test status, scores, time, tokens, cost
Sortable columns
Status indicators

Features:

🎨 Modern UI with gradient headers and smooth transitions
📱 Fully responsive design
⚡ Fast and lightweight (no external dependencies)
🔄 Tab navigation for organized viewing
📊 Chart.js powered visualizations
🎯 Based on praisonaibench-ui React application
💾 Standalone HTML - works offline
📧 Easy to share via email or host on web

Comparison Reports: Multi-run comparison shows:

Side-by-side success rates
Performance trends
Cost and token usage evolution
Model improvements over time

Open any generated HTML file in any modern browser!

📋 Test Suite Format

Basic Test Suite (`tests.yaml`)

tests:
  - name: "math_test"
    prompt: "What is 15 * 23?"
    expected: "345"  # Optional: for objective comparison
  
  - name: "python_test"
    language: "python"  # Use plugin evaluator
    prompt: "Write Python factorial function"
    expected: "120"
  
  - name: "creative_test"
    prompt: "Write a short story about a robot"
    # No expected field - subjective task
  
  - name: "model_specific_test"
    prompt: "Explain quantum physics"
    model: "gpt-4o"

Using the expected field:

✅ Use for: Factual questions, math problems, code output, deterministic tasks
❌ Skip for: Creative tasks, open-ended questions, visual/interactive content
When provided: Adds 20% objective scoring based on similarity
When omitted: Weights automatically normalize (no penalty)

Advanced Test Suite with Full Config Support

# Global LLM configuration (applies to all tests)
config:
  max_tokens: 4000
  temperature: 0.7
  top_p: 0.9
  frequency_penalty: 0.0
  presence_penalty: 0.0
  # Any LiteLLM-compatible parameter is supported!

tests:
  - name: "creative_writing"
    prompt: "Write a detailed sci-fi story"
    model: "gpt-4o"
  
  - name: "code_generation"
    prompt: "Create a Python web scraper"
    model: "xai/grok-code-fast-1"

Three.js HTML Generation Suite

# examples/threejs_simulation_suite.yaml
tests:
  - name: "rotating_cube_simulation"
    prompt: |
      Create a complete HTML file with Three.js that displays a rotating 3D cube.
      The cube should have different colored faces, rotate continuously, and include proper lighting.
      The HTML file should be self-contained with Three.js loaded from CDN.
      Include camera controls for user interaction.
      Save the output as 'rotating_cube.html'.
    
  - name: "particle_system"
    prompt: |
      Create an HTML file with Three.js showing an animated particle system.
      Include 1000+ particles with random colors, movement, and physics.
      Add mouse interaction to influence particle behavior.
      
  - name: "terrain_simulation"
    prompt: |
      Create a Three.js HTML file with a procedurally generated terrain landscape.
      Include realistic textures, lighting, and a first-person camera.
      Add fog effects and animated elements.
      
  - name: "solar_system"
    prompt: |
      Create a Three.js solar system simulation in HTML.
      Include the sun, planets with realistic orbits, textures, and lighting.
      Add controls to speed up/slow down time.

🔧 Configuration

Basic Configuration (`config.yaml`)

# Default model (can be overridden per test)
default_model: "gpt-4o"

# Output settings
output_format: "json"
save_results: true
output_dir: "output"

# Performance settings
max_retries: 3
timeout: 60

Supported Models

PraisonAI Bench supports any LiteLLM-compatible model:

# OpenAI Models
- gpt-4o
- gpt-4o-mini
- gpt-3.5-turbo

# Anthropic Models
- claude-3-opus-20240229
- claude-3-sonnet-20240229
- claude-3-haiku-20240307

# Google Models
- gemini/gemini-1.5-pro
- gemini/gemini-1.5-flash
- gemini/gemini-1.5-flash-8b

# XAI Models
- xai/grok-beta
- xai/grok-code-fast-1

# Local Models (via LM Studio, Ollama, etc.)
- ollama/llama2
- openai/gpt-3.5-turbo  # with OPENAI_API_BASE set

📊 Results & Output

Automatic HTML Extraction

When LLM responses contain HTML code blocks, they're automatically extracted and saved:

output/
├── gpt-4o/
│   ├── rotating_cube_simulation.html
│   └── particle_system.html
├── xai/
│   └── grok-code-fast-1/
│       ├── terrain_simulation.html
│       └── solar_system.html
└── benchmark_results_20250829_160426.json

JSON Results Format

[
  {
    "test_name": "rotating_cube_simulation",
    "prompt": "Create a complete HTML file with Three.js...",
    "response": "<!DOCTYPE html>\n<html>\n...",
    "model": "xai/grok-code-fast-1",
    "agent_name": "BenchAgent",
    "execution_time": 8.24,
    "status": "success",
    "timestamp": "2025-08-29 16:04:26"
  }
]

Summary Statistics

📊 Summary:
   Total tests: 4
   Success rate: 100.0%
   Average time: 12.34s
Results saved to: output/benchmark_results_20250829_160426.json

🎯 Advanced Features

🔄 Universal Model Support

Works with any LiteLLM-compatible model
No hardcoded model restrictions
Automatic API key detection

💾 Smart HTML Handling

Auto-detects HTML in multiple formats:
- Markdown-wrapped HTML (html...)
- Truncated HTML blocks (incomplete responses)
- Raw HTML content (direct HTML responses)
Extracts and saves as .html files automatically
Organizes by model in separate folders
Extract HTML from existing benchmark results with --extract
Perfect for Three.js, React, or any web development benchmarks

🎛️ Flexible Test Management

Run entire suites or filter specific tests
Override models per test or globally
Cross-model comparisons with detailed metrics

⚡ Modern Development

Built with pyproject.toml (no legacy setup.py)
Optimized for uv package manager
Fast dependency resolution and installation

🏗️ Simple Architecture

Single Agent Design - Your prompt becomes the instruction
No Complex Configs - Just write your test prompts
Minimal Dependencies - Only what you need

🔌 Plugin System

Extensible evaluators for any language or task - Create plugins in one file.

Create Plugin (One File)

from praisonaibench import BaseEvaluator

class MyEvaluator(BaseEvaluator):
    def get_language(self):
        return 'mylang'  # e.g., 'python', 'typescript', 'go'
    
    def evaluate(self, code, test_name, prompt, expected=None):
        return {
            'score': 85,      # 0-100
            'passed': True,   # score >= 70
            'feedback': [{'level': 'success', 'message': '✅ Works!'}],
            'details': {}
        }

Setup (pyproject.toml):

[project]
name = "praisonaibench-mylang"
version = "0.1.0"
dependencies = ["praisonaibench>=0.1.0"]

[project.entry-points."praisonaibench.evaluators"]
mylang = "my_evaluator:MyEvaluator"

Install: pip install -e . or uv pip install -e .

Use Plugin

# tests.yaml
tests:
  - name: "python_test"
    language: "python"  # Auto-discovered
    prompt: "Write Python hello world"
    expected: "Hello World"

Run: praisonaibench --suite tests.yaml

Features

✅ One file (~50 lines) per plugin
✅ Auto-discovery - No config needed
✅ Backwards compatible - HTML evaluation unchanged
✅ Language detection - Auto-detects from code blocks or explicit language field
✅ Any task - Programming languages, text summarization, translation, etc.

Example: examples/plugins/python_evaluator.py

🚀 Use Cases

Web Development Benchmarking

# Test HTML/CSS/JS generation across models
praisonaibench --suite web_dev_suite.yaml --model gpt-4o

Code Generation Comparison

# Compare coding abilities
praisonaibench --cross-model "Write a Python web scraper" --models gpt-4o,claude-3-sonnet-20240229,xai/grok-code-fast-1

Creative Content Testing

# Test creative writing
praisonaibench --test "Write a sci-fi short story" --model gemini/gemini-1.5-pro

🤝 Contributing

Fork the repository
Create a feature branch: git checkout -b feature-name
Install dependencies: uv sync
Make your changes
Run tests: uv run pytest
Submit a pull request

📄 License

MIT License - see LICENSE file for details.

Perfect for developers who need powerful, flexible LLM benchmarking with zero complexity! 🚀

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

Dec 18, 2025

0.1.0

Dec 11, 2025

0.0.13

Dec 11, 2025

0.0.12

Dec 11, 2025

0.0.11

Dec 11, 2025

0.0.10

Dec 11, 2025

0.0.9

Dec 11, 2025

0.0.7

Nov 20, 2025

0.0.6

Nov 20, 2025

0.0.5

Nov 19, 2025

0.0.4

Sep 29, 2025

0.0.3

Aug 30, 2025

0.0.2

Aug 29, 2025

0.0.1

Aug 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

praisonaibench-0.1.1.tar.gz (59.8 kB view details)

Uploaded Dec 18, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

praisonaibench-0.1.1-py3-none-any.whl (67.1 kB view details)

Uploaded Dec 18, 2025 Python 3

File details

Details for the file praisonaibench-0.1.1.tar.gz.

File metadata

Download URL: praisonaibench-0.1.1.tar.gz
Upload date: Dec 18, 2025
Size: 59.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.22

File hashes

Hashes for praisonaibench-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`2dc67f0e27127206cc921dabce2d0a0ac08feeaaf9b46c881d73c887a0f5ad5f`
MD5	`45fbd830a4c23f87b23712d3f43b01a9`
BLAKE2b-256	`5d7e2ce8a65f4c33034b1179e60281f5ee0b0ed279cad58402e9226f6af17a18`

See more details on using hashes here.

File details

Details for the file praisonaibench-0.1.1-py3-none-any.whl.

File metadata

Download URL: praisonaibench-0.1.1-py3-none-any.whl
Upload date: Dec 18, 2025
Size: 67.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.22

File hashes

Hashes for praisonaibench-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5d66035cc6c8765248e7aca9f409a6d4dd93d0250aaf267f77108404f2807e38`
MD5	`2dc4a0db2a5e8178b8ba07f7d2e0a6aa`
BLAKE2b-256	`4c87905b525ce077740c3b83918b2c9562e6dfebb6c83d893eb3db9832e81f9e`

See more details on using hashes here.

praisonaibench 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PraisonAI Bench

🎯 Testing Modes

🔍 What's Included in Evaluation?

✨ Key Features

🚀 Quick Start

1. Install from PyPI (Recommended)

2. Install with uv

3. Alternative: Install with pip (from source)

4. Set Your API Keys

4. Run Your First Benchmark

📁 Project Structure

💻 CLI Usage

Basic Commands

Test Suites

Cross-Model Comparison

Extract HTML from Results

HTML Generation Examples

Cost & Token Tracking

CSV Export

HTML Dashboard Reports

📋 Test Suite Format

Basic Test Suite (tests.yaml)

Advanced Test Suite with Full Config Support

Three.js HTML Generation Suite

🔧 Configuration

Basic Configuration (config.yaml)

Supported Models

📊 Results & Output

Automatic HTML Extraction

JSON Results Format

Summary Statistics

🎯 Advanced Features

🔄 Universal Model Support

💾 Smart HTML Handling

🎛️ Flexible Test Management

⚡ Modern Development

🏗️ Simple Architecture

🔌 Plugin System

Create Plugin (One File)

Use Plugin

Features

🚀 Use Cases

Web Development Benchmarking

Code Generation Comparison

Creative Content Testing

🤝 Contributing

📄 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Basic Test Suite (`tests.yaml`)

Basic Configuration (`config.yaml`)