Skip to main content

Simple LLM Benchmarking Tool using PraisonAI Agents

Project description

PraisonAI Bench

๐Ÿš€ A simple, powerful LLM benchmarking tool built with PraisonAI Agents

Benchmark any LiteLLM-compatible model with automatic HTML extraction, model-specific output organization, and flexible test suite management.

๐ŸŽฏ Testing Modes

Feature Single Test Test Suite (YAML)
๐Ÿ“ Description Run one prompt Run multiple tests from YAML file
๐Ÿ”ง Command praisonaibench --test "prompt" praisonaibench --suite tests.yaml
๐Ÿ“Š Evaluation โœ… Enabled (Browser + LLM Judge) โœ… Enabled (Browser + LLM Judge)
๐ŸŽจ HTML Extraction โœ… Auto-extracted โœ… Auto-extracted
๐Ÿ“ Output Single JSON result Batch JSON results
๐Ÿ–ผ๏ธ Screenshots โœ… Generated โœ… Generated
โšก Console Errors โœ… Detected โœ… Detected
๐Ÿค– LLM Judge โœ… gpt-5.1 quality scoring โœ… gpt-5.1 quality scoring
๐Ÿ”„ Retry Logic โœ… 3 attempts โœ… 3 attempts
โšก Parallel Execution N/A โœ… --concurrent N
๐Ÿ’ฐ Cost Tracking โœ… Token & cost per test โœ… Cumulative cost summary
๐Ÿ“Š Export Formats JSON JSON, CSV
๐Ÿ“ˆ HTML Reports N/A โœ… --report
๐ŸŽฏ Use Case Quick testing Comprehensive benchmarking

๐Ÿ” What's Included in Evaluation?

Our research-backed hybrid evaluation system provides comprehensive quality assessment:

Component What It Does Score Weight
๐Ÿ“ HTML Validation Static structure validation, DOCTYPE, required tags 15%
๐ŸŒ Functional Browser rendering, console errors, render time 40%
๐ŸŽฏ Expected Result Objective comparison (optional, for factual tasks) 20%*
๐ŸŽจ Quality (LLM) Code quality, completeness, best practices 25%
๐Ÿ“Š Overall Combined score (0-100) with pass/fail (โ‰ฅ70) 100%

*When expected field is not provided, weights are automatically normalized (HTML: 18.75%, Functional: 50%, LLM: 31.25%)

Example Output (with expected):

HTML Validation: 90/100 โœ… Valid structure
Functional: 85/100 (renders โœ…, 1 error, <1s)
Expected: 95/100 (95% similarity with expected result)
Quality: 80/100 (good structure, minor issues)
Overall: 87/100 โœ… PASSED

Example Output (without expected):

HTML Validation: 90/100 โœ… Valid structure
Functional: 85/100 (renders โœ…, 1 error, <1s)
Expected: N/A (not provided)
Quality: 80/100 (good structure, minor issues)
Overall: 85/100 โœ… PASSED

โœจ Key Features

  • ๐ŸŽฏ Any LLM Model - OpenAI, Anthropic, Google, XAI, local models via LiteLLM
  • ๐Ÿ”„ Single Agent Design - Your prompt becomes the instruction (no complex configs)
  • ๐Ÿ’พ Auto HTML Extraction - Automatically saves HTML code from responses
  • ๐Ÿ“ Smart Organization - Model-specific output folders (output/gpt-4o/, output/xai/grok-code-fast-1/)
  • ๐ŸŽ›๏ธ Flexible Testing - Run single tests, full suites, or filter specific tests
  • โšก Parallel Execution - Run tests concurrently with --concurrent N for faster benchmarking
  • ๐Ÿ’ฐ Cost & Token Tracking - Automatic token usage and cost calculation for all supported models
  • ๐Ÿ“Š Multiple Export Formats - Export results as JSON or CSV for easy analysis
  • ๐Ÿ“ˆ HTML Dashboard Reports - Beautiful visual reports with interactive charts using --report
  • ๐Ÿ› ๏ธ Modern Tooling - Built with pyproject.toml and uv package manager
  • ๐Ÿ“‹ Comprehensive Results - Complete metrics with timing, success rates, costs, and metadata

๐Ÿš€ Quick Start

1. Install from PyPI (Recommended)

pip install praisonaibench

PyPI

2. Install with uv

# Clone the repository
git clone https://github.com/MervinPraison/praisonaibench
cd praisonaibench

# Install with uv
uv sync

# Or install in development mode
uv pip install -e .

3. Alternative: Install with pip (from source)

git clone https://github.com/MervinPraison/praisonaibench
cd praisonaibench
pip install -e .

4. Set Your API Keys

# OpenAI
export OPENAI_API_KEY=your_openai_key

# XAI (Grok)
export XAI_API_KEY=your_xai_key

# Anthropic
export ANTHROPIC_API_KEY=your_anthropic_key

# Google
export GOOGLE_API_KEY=your_google_key

4. Run Your First Benchmark

from praisonaibench import Bench

# Create benchmark suite
bench = Bench()

# Run a simple test
result = bench.run_single_test("What is 2+2?")
print(result['response'])

# Run with specific model
result = bench.run_single_test(
    "Create a rotating cube HTML file", 
    model="xai/grok-code-fast-1"
)

# Get summary
summary = bench.get_summary()
print(summary)

๐Ÿ“ Project Structure

praisonaibench/
โ”œโ”€โ”€ pyproject.toml           # Modern Python packaging
โ”œโ”€โ”€ src/praisonaibench/      # Source code
โ”‚   โ”œโ”€โ”€ __init__.py          # Main imports
โ”‚   โ”œโ”€โ”€ bench.py             # Core benchmarking engine
โ”‚   โ”œโ”€โ”€ agent.py             # LLM agent wrapper
โ”‚   โ”œโ”€โ”€ cli.py               # Command line interface
โ”‚   โ””โ”€โ”€ version.py           # Version info
โ”œโ”€โ”€ examples/                # Example configurations
โ”‚   โ”œโ”€โ”€ threejs_simulation_suite.yaml
โ”‚   โ””โ”€โ”€ config_example.yaml
โ””โ”€โ”€ output/                  # Generated results
    โ”œโ”€โ”€ gpt-4o/             # Model-specific HTML files
    โ”œโ”€โ”€ xai/grok-code-fast-1/
    โ””โ”€โ”€ benchmark_results_*.json

๐Ÿ’ป CLI Usage

Basic Commands

# Single test with default model
praisonaibench --test "Explain quantum computing"

# Single test with specific model
praisonaibench --test "Write a poem" --model gpt-4o

# Use any LiteLLM-compatible model
praisonaibench --test "Create HTML" --model xai/grok-code-fast-1
praisonaibench --test "Write code" --model gemini/gemini-1.5-flash-8b
praisonaibench --test "Analyze data" --model claude-3-sonnet-20240229

Test Suites

# Run entire test suite
praisonaibench --suite examples/threejs_simulation_suite.yaml

# Run specific test from suite
praisonaibench --suite examples/threejs_simulation_suite.yaml --test-name "rotating_cube_simulation"

# Run suite with specific model (overrides individual test models)
praisonaibench --suite tests.yaml --model xai/grok-code-fast-1

# Run tests in parallel (3 concurrent workers)
praisonaibench --suite tests.yaml --concurrent 3

Cross-Model Comparison

# Compare across multiple models
praisonaibench --cross-model "Write a poem" --models gpt-4o,gpt-3.5-turbo,xai/grok-code-fast-1

Extract HTML from Results

# Extract HTML from existing benchmark results
praisonaibench --extract output/benchmark_results_20250829_160426.json
# โ†’ Processes JSON file and saves any HTML content to .html files

# Works with any benchmark results JSON file
praisonaibench --extract my_results.json

HTML Generation Examples

# Generate Three.js simulation (auto-saves HTML)
praisonaibench --test "Create a rotating cube HTML with Three.js" --model gpt-4o
# โ†’ Saves to: output/gpt-4o/test_cube.html

# Run Three.js test suite
praisonaibench --suite examples/threejs_simulation_suite.yaml --model xai/grok-code-fast-1
# โ†’ Saves to: output/xai/grok-code-fast-1/rotating_cube_simulation.html

Cost & Token Tracking

Automatically track token usage and costs for all LLM API calls:

# Run tests with automatic cost tracking
praisonaibench --suite tests.yaml --model gpt-4o

# Output includes per-test costs:
๐Ÿ’ฐ Cost: $0.002400 (1250 tokens)

# Summary shows total costs:
๐Ÿ“Š Summary:
   Total tests: 4
   Success rate: 100.0%
   Average time: 8.42s

๐Ÿ’ฐ Cost Summary:
   Total tokens: 5,420
   Total cost: $0.0124

   By model:
     gpt-4o: $0.0124 (5,420 tokens)

Supported models include accurate pricing for:

  • OpenAI (GPT-4o, GPT-4, GPT-3.5, O1)
  • Anthropic (Claude 3 family)
  • Google (Gemini 1.5 family)
  • XAI (Grok models)
  • Groq (optimised models)

Token usage is extracted from API responses when available, or estimated from text length. Costs are calculated using official provider pricing (updated December 2024).

CSV Export

Export benchmark results to CSV for spreadsheet analysis:

# Export to CSV format
praisonaibench --suite tests.yaml --format csv

# Results saved to: output/csv/benchmark_results_20241211_123456.csv

CSV includes:

  • Test names and status
  • Model information
  • Execution times
  • Token usage (input/output/total)
  • Costs per test
  • Evaluation scores
  • Prompts and response lengths
  • Error messages (if any)

Perfect for:

  • Spreadsheet analysis in Excel/Google Sheets
  • Data visualization tools
  • Statistical analysis
  • Sharing results with non-technical stakeholders

HTML Dashboard Reports

Generate beautiful interactive reports with comprehensive visualizations inspired by the React UI:

# Generate enhanced HTML report after running tests
praisonaibench --suite tests.yaml --report

# Generate report from existing results (without re-running tests)
praisonaibench --report-from output/json/benchmark_results_20241211_123456.json

# Compare multiple test results
praisonaibench --compare result1.json result2.json result3.json

# Reports saved to: output/reports/

Enhanced Report Includes:

๐Ÿ“Š Dashboard Tab:

  • Summary cards with key metrics (tests, models, success rate, avg time, cost, tokens)
  • Interactive charts:
    • Status distribution (success/failure)
    • Execution time by model
    • Evaluation scores (radar chart)
    • Errors & warnings

๐Ÿ† Leaderboard Tab:

  • Model rankings with multiple criteria:
    • Overall Score (default)
    • Functional Score
    • Quality Score
    • Pass Rate
    • Speed (fastest first)
  • Top 3 models highlighted with medals
  • Detailed metrics per model (functional, quality, pass rate, time)
  • Click criteria to re-rank dynamically

โš–๏ธ Comparison Tab:

  • Detailed side-by-side model comparison
  • Comprehensive metrics table:
    • Overall score, functional score, quality score
    • Pass rate with color coding
    • Average execution time
    • Total errors and warnings count
  • Full model names and stats

๐Ÿ“‹ Results Tab:

  • Complete test results table
  • Individual test status, scores, time, tokens, cost
  • Sortable columns
  • Status indicators

Features:

  • ๐ŸŽจ Modern UI with gradient headers and smooth transitions
  • ๐Ÿ“ฑ Fully responsive design
  • โšก Fast and lightweight (no external dependencies)
  • ๐Ÿ”„ Tab navigation for organized viewing
  • ๐Ÿ“Š Chart.js powered visualizations
  • ๐ŸŽฏ Based on praisonaibench-ui React application
  • ๐Ÿ’พ Standalone HTML - works offline
  • ๐Ÿ“ง Easy to share via email or host on web

Comparison Reports: Multi-run comparison shows:

  • Side-by-side success rates
  • Performance trends
  • Cost and token usage evolution
  • Model improvements over time

Open any generated HTML file in any modern browser!

๐Ÿ“‹ Test Suite Format

Basic Test Suite (tests.yaml)

tests:
  - name: "math_test"
    prompt: "What is 15 * 23?"
    expected: "345"  # Optional: for objective comparison
  
  - name: "creative_test"
    prompt: "Write a short story about a robot"
    # No expected field - subjective task
  
  - name: "model_specific_test"
    prompt: "Explain quantum physics"
    model: "gpt-4o"

Using the expected field:

  • โœ… Use for: Factual questions, math problems, code output, deterministic tasks
  • โŒ Skip for: Creative tasks, open-ended questions, visual/interactive content
  • When provided: Adds 20% objective scoring based on similarity
  • When omitted: Weights automatically normalize (no penalty)

Advanced Test Suite with Full Config Support

# Global LLM configuration (applies to all tests)
config:
  max_tokens: 4000
  temperature: 0.7
  top_p: 0.9
  frequency_penalty: 0.0
  presence_penalty: 0.0
  # Any LiteLLM-compatible parameter is supported!

tests:
  - name: "creative_writing"
    prompt: "Write a detailed sci-fi story"
    model: "gpt-4o"
  
  - name: "code_generation"
    prompt: "Create a Python web scraper"
    model: "xai/grok-code-fast-1"

Three.js HTML Generation Suite

# examples/threejs_simulation_suite.yaml
tests:
  - name: "rotating_cube_simulation"
    prompt: |
      Create a complete HTML file with Three.js that displays a rotating 3D cube.
      The cube should have different colored faces, rotate continuously, and include proper lighting.
      The HTML file should be self-contained with Three.js loaded from CDN.
      Include camera controls for user interaction.
      Save the output as 'rotating_cube.html'.
    
  - name: "particle_system"
    prompt: |
      Create an HTML file with Three.js showing an animated particle system.
      Include 1000+ particles with random colors, movement, and physics.
      Add mouse interaction to influence particle behavior.
      
  - name: "terrain_simulation"
    prompt: |
      Create a Three.js HTML file with a procedurally generated terrain landscape.
      Include realistic textures, lighting, and a first-person camera.
      Add fog effects and animated elements.
      
  - name: "solar_system"
    prompt: |
      Create a Three.js solar system simulation in HTML.
      Include the sun, planets with realistic orbits, textures, and lighting.
      Add controls to speed up/slow down time.

๐Ÿ”ง Configuration

Basic Configuration (config.yaml)

# Default model (can be overridden per test)
default_model: "gpt-4o"

# Output settings
output_format: "json"
save_results: true
output_dir: "output"

# Performance settings
max_retries: 3
timeout: 60

Supported Models

PraisonAI Bench supports any LiteLLM-compatible model:

# OpenAI Models
- gpt-4o
- gpt-4o-mini
- gpt-3.5-turbo

# Anthropic Models
- claude-3-opus-20240229
- claude-3-sonnet-20240229
- claude-3-haiku-20240307

# Google Models
- gemini/gemini-1.5-pro
- gemini/gemini-1.5-flash
- gemini/gemini-1.5-flash-8b

# XAI Models
- xai/grok-beta
- xai/grok-code-fast-1

# Local Models (via LM Studio, Ollama, etc.)
- ollama/llama2
- openai/gpt-3.5-turbo  # with OPENAI_API_BASE set

๐Ÿ“Š Results & Output

Automatic HTML Extraction

When LLM responses contain HTML code blocks, they're automatically extracted and saved:

output/
โ”œโ”€โ”€ gpt-4o/
โ”‚   โ”œโ”€โ”€ rotating_cube_simulation.html
โ”‚   โ””โ”€โ”€ particle_system.html
โ”œโ”€โ”€ xai/
โ”‚   โ””โ”€โ”€ grok-code-fast-1/
โ”‚       โ”œโ”€โ”€ terrain_simulation.html
โ”‚       โ””โ”€โ”€ solar_system.html
โ””โ”€โ”€ benchmark_results_20250829_160426.json

JSON Results Format

[
  {
    "test_name": "rotating_cube_simulation",
    "prompt": "Create a complete HTML file with Three.js...",
    "response": "<!DOCTYPE html>\n<html>\n...",
    "model": "xai/grok-code-fast-1",
    "agent_name": "BenchAgent",
    "execution_time": 8.24,
    "status": "success",
    "timestamp": "2025-08-29 16:04:26"
  }
]

Summary Statistics

๐Ÿ“Š Summary:
   Total tests: 4
   Success rate: 100.0%
   Average time: 12.34s
Results saved to: output/benchmark_results_20250829_160426.json

๐ŸŽฏ Advanced Features

๐Ÿ”„ Universal Model Support

  • Works with any LiteLLM-compatible model
  • No hardcoded model restrictions
  • Automatic API key detection

๐Ÿ’พ Smart HTML Handling

  • Auto-detects HTML in multiple formats:
    • Markdown-wrapped HTML (html...)
    • Truncated HTML blocks (incomplete responses)
    • Raw HTML content (direct HTML responses)
  • Extracts and saves as .html files automatically
  • Organizes by model in separate folders
  • Extract HTML from existing benchmark results with --extract
  • Perfect for Three.js, React, or any web development benchmarks

๐ŸŽ›๏ธ Flexible Test Management

  • Run entire suites or filter specific tests
  • Override models per test or globally
  • Cross-model comparisons with detailed metrics

โšก Modern Development

  • Built with pyproject.toml (no legacy setup.py)
  • Optimized for uv package manager
  • Fast dependency resolution and installation

๐Ÿ—๏ธ Simple Architecture

  • Single Agent Design - Your prompt becomes the instruction
  • No Complex Configs - Just write your test prompts
  • Minimal Dependencies - Only what you need

๐Ÿš€ Use Cases

Web Development Benchmarking

# Test HTML/CSS/JS generation across models
praisonaibench --suite web_dev_suite.yaml --model gpt-4o

Code Generation Comparison

# Compare coding abilities
praisonaibench --cross-model "Write a Python web scraper" --models gpt-4o,claude-3-sonnet-20240229,xai/grok-code-fast-1

Creative Content Testing

# Test creative writing
praisonaibench --test "Write a sci-fi short story" --model gemini/gemini-1.5-pro

๐Ÿค Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature-name
  3. Install dependencies: uv sync
  4. Make your changes
  5. Run tests: uv run pytest
  6. Submit a pull request

๐Ÿ“„ License

MIT License - see LICENSE file for details.


Perfect for developers who need powerful, flexible LLM benchmarking with zero complexity! ๐Ÿš€

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

praisonaibench-0.0.13.tar.gz (43.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

praisonaibench-0.0.13-py3-none-any.whl (49.5 kB view details)

Uploaded Python 3

File details

Details for the file praisonaibench-0.0.13.tar.gz.

File metadata

  • Download URL: praisonaibench-0.0.13.tar.gz
  • Upload date:
  • Size: 43.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.22

File hashes

Hashes for praisonaibench-0.0.13.tar.gz
Algorithm Hash digest
SHA256 12346748045a2ecdc64e4cdd49ee0b57b058a03602714a643c3a61a6aaee70d1
MD5 4491593973ec631d635f668c76e6f3e1
BLAKE2b-256 8602ea14d93ba6a2833b6a3fdcc2eb314d272d24e8056ddbf13b0eef40da800f

See more details on using hashes here.

File details

Details for the file praisonaibench-0.0.13-py3-none-any.whl.

File metadata

File hashes

Hashes for praisonaibench-0.0.13-py3-none-any.whl
Algorithm Hash digest
SHA256 dcbff402bd858cb0ce2ac0ae2212d69b94be7c5a5ecf924bf4d643df19f81073
MD5 19f556ac884853a74dcf8c420ab865af
BLAKE2b-256 0bfd03c2626c6d514bf4a491465da0bd1ecbc61c980bfbf23601b281a47d8c27

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page