Skip to main content

Simple LLM Benchmarking Tool using PraisonAI Agents

Project description

PraisonAI Bench

๐Ÿš€ A simple, powerful LLM benchmarking tool built with PraisonAI Agents

Benchmark any LiteLLM-compatible model with automatic HTML extraction, model-specific output organization, and flexible test suite management.

๐ŸŽฏ Testing Modes

Feature Single Test Test Suite (YAML)
๐Ÿ“ Description Run one prompt Run multiple tests from YAML file
๐Ÿ”ง Command praisonaibench --test "prompt" praisonaibench --suite tests.yaml
๐Ÿ“Š Evaluation โœ… Enabled (Browser + LLM Judge) โœ… Enabled (Browser + LLM Judge)
๐ŸŽจ HTML Extraction โœ… Auto-extracted โœ… Auto-extracted
๐Ÿ“ Output Single JSON result Batch JSON results
๐Ÿ–ผ๏ธ Screenshots โœ… Generated โœ… Generated
โšก Console Errors โœ… Detected โœ… Detected
๐Ÿค– LLM Judge โœ… gpt-5.1 quality scoring โœ… gpt-5.1 quality scoring
๐Ÿ”„ Retry Logic โœ… 3 attempts โœ… 3 attempts
๐Ÿ“ˆ Use Case Quick testing Comprehensive benchmarking

๐Ÿ” What's Included in Evaluation?

Our research-backed hybrid evaluation system provides comprehensive quality assessment:

Component What It Does Score Weight
๐Ÿ“ HTML Validation Static structure validation, DOCTYPE, required tags 15%
๐ŸŒ Functional Browser rendering, console errors, render time 40%
๐ŸŽฏ Expected Result Objective comparison (optional, for factual tasks) 20%*
๐ŸŽจ Quality (LLM) Code quality, completeness, best practices 25%
๐Ÿ“Š Overall Combined score (0-100) with pass/fail (โ‰ฅ70) 100%

*When expected field is not provided, weights are automatically normalized (HTML: 18.75%, Functional: 50%, LLM: 31.25%)

Example Output (with expected):

HTML Validation: 90/100 โœ… Valid structure
Functional: 85/100 (renders โœ…, 1 error, <1s)
Expected: 95/100 (95% similarity with expected result)
Quality: 80/100 (good structure, minor issues)
Overall: 87/100 โœ… PASSED

Example Output (without expected):

HTML Validation: 90/100 โœ… Valid structure
Functional: 85/100 (renders โœ…, 1 error, <1s)
Expected: N/A (not provided)
Quality: 80/100 (good structure, minor issues)
Overall: 85/100 โœ… PASSED

โœจ Key Features

  • ๐ŸŽฏ Any LLM Model - OpenAI, Anthropic, Google, XAI, local models via LiteLLM
  • ๐Ÿ”„ Single Agent Design - Your prompt becomes the instruction (no complex configs)
  • ๐Ÿ’พ Auto HTML Extraction - Automatically saves HTML code from responses
  • ๐Ÿ“ Smart Organization - Model-specific output folders (output/gpt-4o/, output/xai/grok-code-fast-1/)
  • ๐ŸŽ›๏ธ Flexible Testing - Run single tests, full suites, or filter specific tests
  • โšก Modern Tooling - Built with pyproject.toml and uv package manager
  • ๐Ÿ“Š Comprehensive Results - JSON metrics with timing, success rates, and metadata

๐Ÿš€ Quick Start

1. Install from PyPI (Recommended)

pip install praisonaibench

PyPI

2. Install with uv

# Clone the repository
git clone https://github.com/MervinPraison/praisonaibench
cd praisonaibench

# Install with uv
uv sync

# Or install in development mode
uv pip install -e .

3. Alternative: Install with pip (from source)

git clone https://github.com/MervinPraison/praisonaibench
cd praisonaibench
pip install -e .

4. Set Your API Keys

# OpenAI
export OPENAI_API_KEY=your_openai_key

# XAI (Grok)
export XAI_API_KEY=your_xai_key

# Anthropic
export ANTHROPIC_API_KEY=your_anthropic_key

# Google
export GOOGLE_API_KEY=your_google_key

4. Run Your First Benchmark

from praisonaibench import Bench

# Create benchmark suite
bench = Bench()

# Run a simple test
result = bench.run_single_test("What is 2+2?")
print(result['response'])

# Run with specific model
result = bench.run_single_test(
    "Create a rotating cube HTML file", 
    model="xai/grok-code-fast-1"
)

# Get summary
summary = bench.get_summary()
print(summary)

๐Ÿ“ Project Structure

praisonaibench/
โ”œโ”€โ”€ pyproject.toml           # Modern Python packaging
โ”œโ”€โ”€ src/praisonaibench/      # Source code
โ”‚   โ”œโ”€โ”€ __init__.py          # Main imports
โ”‚   โ”œโ”€โ”€ bench.py             # Core benchmarking engine
โ”‚   โ”œโ”€โ”€ agent.py             # LLM agent wrapper
โ”‚   โ”œโ”€โ”€ cli.py               # Command line interface
โ”‚   โ””โ”€โ”€ version.py           # Version info
โ”œโ”€โ”€ examples/                # Example configurations
โ”‚   โ”œโ”€โ”€ threejs_simulation_suite.yaml
โ”‚   โ””โ”€โ”€ config_example.yaml
โ””โ”€โ”€ output/                  # Generated results
    โ”œโ”€โ”€ gpt-4o/             # Model-specific HTML files
    โ”œโ”€โ”€ xai/grok-code-fast-1/
    โ””โ”€โ”€ benchmark_results_*.json

๐Ÿ’ป CLI Usage

Basic Commands

# Single test with default model
praisonaibench --test "Explain quantum computing"

# Single test with specific model
praisonaibench --test "Write a poem" --model gpt-4o

# Use any LiteLLM-compatible model
praisonaibench --test "Create HTML" --model xai/grok-code-fast-1
praisonaibench --test "Write code" --model gemini/gemini-1.5-flash-8b
praisonaibench --test "Analyze data" --model claude-3-sonnet-20240229

Test Suites

# Run entire test suite
praisonaibench --suite examples/threejs_simulation_suite.yaml

# Run specific test from suite
praisonaibench --suite examples/threejs_simulation_suite.yaml --test-name "rotating_cube_simulation"

# Run suite with specific model (overrides individual test models)
praisonaibench --suite tests.yaml --model xai/grok-code-fast-1

Cross-Model Comparison

# Compare across multiple models
praisonaibench --cross-model "Write a poem" --models gpt-4o,gpt-3.5-turbo,xai/grok-code-fast-1

Extract HTML from Results

# Extract HTML from existing benchmark results
praisonaibench --extract output/benchmark_results_20250829_160426.json
# โ†’ Processes JSON file and saves any HTML content to .html files

# Works with any benchmark results JSON file
praisonaibench --extract my_results.json

HTML Generation Examples

# Generate Three.js simulation (auto-saves HTML)
praisonaibench --test "Create a rotating cube HTML with Three.js" --model gpt-4o
# โ†’ Saves to: output/gpt-4o/test_cube.html

# Run Three.js test suite
praisonaibench --suite examples/threejs_simulation_suite.yaml --model xai/grok-code-fast-1
# โ†’ Saves to: output/xai/grok-code-fast-1/rotating_cube_simulation.html

๐Ÿ“‹ Test Suite Format

Basic Test Suite (tests.yaml)

tests:
  - name: "math_test"
    prompt: "What is 15 * 23?"
    expected: "345"  # Optional: for objective comparison
  
  - name: "creative_test"
    prompt: "Write a short story about a robot"
    # No expected field - subjective task
  
  - name: "model_specific_test"
    prompt: "Explain quantum physics"
    model: "gpt-4o"

Using the expected field:

  • โœ… Use for: Factual questions, math problems, code output, deterministic tasks
  • โŒ Skip for: Creative tasks, open-ended questions, visual/interactive content
  • When provided: Adds 20% objective scoring based on similarity
  • When omitted: Weights automatically normalize (no penalty)

Advanced Test Suite with Full Config Support

# Global LLM configuration (applies to all tests)
config:
  max_tokens: 4000
  temperature: 0.7
  top_p: 0.9
  frequency_penalty: 0.0
  presence_penalty: 0.0
  # Any LiteLLM-compatible parameter is supported!

tests:
  - name: "creative_writing"
    prompt: "Write a detailed sci-fi story"
    model: "gpt-4o"
  
  - name: "code_generation"
    prompt: "Create a Python web scraper"
    model: "xai/grok-code-fast-1"

Three.js HTML Generation Suite

# examples/threejs_simulation_suite.yaml
tests:
  - name: "rotating_cube_simulation"
    prompt: |
      Create a complete HTML file with Three.js that displays a rotating 3D cube.
      The cube should have different colored faces, rotate continuously, and include proper lighting.
      The HTML file should be self-contained with Three.js loaded from CDN.
      Include camera controls for user interaction.
      Save the output as 'rotating_cube.html'.
    
  - name: "particle_system"
    prompt: |
      Create an HTML file with Three.js showing an animated particle system.
      Include 1000+ particles with random colors, movement, and physics.
      Add mouse interaction to influence particle behavior.
      
  - name: "terrain_simulation"
    prompt: |
      Create a Three.js HTML file with a procedurally generated terrain landscape.
      Include realistic textures, lighting, and a first-person camera.
      Add fog effects and animated elements.
      
  - name: "solar_system"
    prompt: |
      Create a Three.js solar system simulation in HTML.
      Include the sun, planets with realistic orbits, textures, and lighting.
      Add controls to speed up/slow down time.

๐Ÿ”ง Configuration

Basic Configuration (config.yaml)

# Default model (can be overridden per test)
default_model: "gpt-4o"

# Output settings
output_format: "json"
save_results: true
output_dir: "output"

# Performance settings
max_retries: 3
timeout: 60

Supported Models

PraisonAI Bench supports any LiteLLM-compatible model:

# OpenAI Models
- gpt-4o
- gpt-4o-mini
- gpt-3.5-turbo

# Anthropic Models
- claude-3-opus-20240229
- claude-3-sonnet-20240229
- claude-3-haiku-20240307

# Google Models
- gemini/gemini-1.5-pro
- gemini/gemini-1.5-flash
- gemini/gemini-1.5-flash-8b

# XAI Models
- xai/grok-beta
- xai/grok-code-fast-1

# Local Models (via LM Studio, Ollama, etc.)
- ollama/llama2
- openai/gpt-3.5-turbo  # with OPENAI_API_BASE set

๐Ÿ“Š Results & Output

Automatic HTML Extraction

When LLM responses contain HTML code blocks, they're automatically extracted and saved:

output/
โ”œโ”€โ”€ gpt-4o/
โ”‚   โ”œโ”€โ”€ rotating_cube_simulation.html
โ”‚   โ””โ”€โ”€ particle_system.html
โ”œโ”€โ”€ xai/
โ”‚   โ””โ”€โ”€ grok-code-fast-1/
โ”‚       โ”œโ”€โ”€ terrain_simulation.html
โ”‚       โ””โ”€โ”€ solar_system.html
โ””โ”€โ”€ benchmark_results_20250829_160426.json

JSON Results Format

[
  {
    "test_name": "rotating_cube_simulation",
    "prompt": "Create a complete HTML file with Three.js...",
    "response": "<!DOCTYPE html>\n<html>\n...",
    "model": "xai/grok-code-fast-1",
    "agent_name": "BenchAgent",
    "execution_time": 8.24,
    "status": "success",
    "timestamp": "2025-08-29 16:04:26"
  }
]

Summary Statistics

๐Ÿ“Š Summary:
   Total tests: 4
   Success rate: 100.0%
   Average time: 12.34s
Results saved to: output/benchmark_results_20250829_160426.json

๐ŸŽฏ Advanced Features

๐Ÿ”„ Universal Model Support

  • Works with any LiteLLM-compatible model
  • No hardcoded model restrictions
  • Automatic API key detection

๐Ÿ’พ Smart HTML Handling

  • Auto-detects HTML in multiple formats:
    • Markdown-wrapped HTML (html...)
    • Truncated HTML blocks (incomplete responses)
    • Raw HTML content (direct HTML responses)
  • Extracts and saves as .html files automatically
  • Organizes by model in separate folders
  • Extract HTML from existing benchmark results with --extract
  • Perfect for Three.js, React, or any web development benchmarks

๐ŸŽ›๏ธ Flexible Test Management

  • Run entire suites or filter specific tests
  • Override models per test or globally
  • Cross-model comparisons with detailed metrics

โšก Modern Development

  • Built with pyproject.toml (no legacy setup.py)
  • Optimized for uv package manager
  • Fast dependency resolution and installation

๐Ÿ—๏ธ Simple Architecture

  • Single Agent Design - Your prompt becomes the instruction
  • No Complex Configs - Just write your test prompts
  • Minimal Dependencies - Only what you need

๐Ÿš€ Use Cases

Web Development Benchmarking

# Test HTML/CSS/JS generation across models
praisonaibench --suite web_dev_suite.yaml --model gpt-4o

Code Generation Comparison

# Compare coding abilities
praisonaibench --cross-model "Write a Python web scraper" --models gpt-4o,claude-3-sonnet-20240229,xai/grok-code-fast-1

Creative Content Testing

# Test creative writing
praisonaibench --test "Write a sci-fi short story" --model gemini/gemini-1.5-pro

๐Ÿค Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature-name
  3. Install dependencies: uv sync
  4. Make your changes
  5. Run tests: uv run pytest
  6. Submit a pull request

๐Ÿ“„ License

MIT License - see LICENSE file for details.


Perfect for developers who need powerful, flexible LLM benchmarking with zero complexity! ๐Ÿš€

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

praisonaibench-0.0.7.tar.gz (23.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

praisonaibench-0.0.7-py3-none-any.whl (24.1 kB view details)

Uploaded Python 3

File details

Details for the file praisonaibench-0.0.7.tar.gz.

File metadata

  • Download URL: praisonaibench-0.0.7.tar.gz
  • Upload date:
  • Size: 23.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.22

File hashes

Hashes for praisonaibench-0.0.7.tar.gz
Algorithm Hash digest
SHA256 be2ad981dd9f6ad651c179cd500c9a248f56a914baa0fdbe6842ab908f37f560
MD5 594291349722d93af4686f801bae811b
BLAKE2b-256 c06a08776ccf54e77785fec91d77752fbfbcde97a4e3f7fb1fe94fd4a9b06a2e

See more details on using hashes here.

File details

Details for the file praisonaibench-0.0.7-py3-none-any.whl.

File metadata

File hashes

Hashes for praisonaibench-0.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 379d147efa07aa459da74c53c6ca3180b3ef7186f41b1b362771b0c3fc09d6a7
MD5 9a81e705db15228128309fad317b992f
BLAKE2b-256 50261ef957b5504c3e42a1ca4cdccd624c4efb778465d573a8e1735f462e040e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page