Simple LLM Benchmarking Tool using PraisonAI Agents
Project description
PraisonAI Bench
๐ A simple, powerful LLM benchmarking tool built with PraisonAI Agents
Benchmark any LiteLLM-compatible model with automatic HTML extraction, model-specific output organization, and flexible test suite management.
๐ฏ Testing Modes
| Feature | Single Test | Test Suite (YAML) |
|---|---|---|
| ๐ Description | Run one prompt | Run multiple tests from YAML file |
| ๐ง Command | praisonaibench --test "prompt" |
praisonaibench --suite tests.yaml |
| ๐ Evaluation | โ Enabled (Browser + LLM Judge) | โ Enabled (Browser + LLM Judge) |
| ๐จ HTML Extraction | โ Auto-extracted | โ Auto-extracted |
| ๐ Output | Single JSON result | Batch JSON results |
| ๐ผ๏ธ Screenshots | โ Generated | โ Generated |
| โก Console Errors | โ Detected | โ Detected |
| ๐ค LLM Judge | โ gpt-5.1 quality scoring | โ gpt-5.1 quality scoring |
| ๐ Retry Logic | โ 3 attempts | โ 3 attempts |
| โก Parallel Execution | N/A | โ
--concurrent N |
| ๐ฐ Cost Tracking | โ Token & cost per test | โ Cumulative cost summary |
| ๐ Export Formats | JSON | JSON, CSV |
| ๐ HTML Reports | N/A | โ
--report |
| ๐ฏ Use Case | Quick testing | Comprehensive benchmarking |
๐ What's Included in Evaluation?
Our research-backed hybrid evaluation system provides comprehensive quality assessment:
| Component | What It Does | Score Weight |
|---|---|---|
| ๐ HTML Validation | Static structure validation, DOCTYPE, required tags | 15% |
| ๐ Functional | Browser rendering, console errors, render time | 40% |
| ๐ฏ Expected Result | Objective comparison (optional, for factual tasks) | 20%* |
| ๐จ Quality (LLM) | Code quality, completeness, best practices | 25% |
| ๐ Overall | Combined score (0-100) with pass/fail (โฅ70) | 100% |
*When expected field is not provided, weights are automatically normalized (HTML: 18.75%, Functional: 50%, LLM: 31.25%)
Example Output (with expected):
HTML Validation: 90/100 โ
Valid structure
Functional: 85/100 (renders โ
, 1 error, <1s)
Expected: 95/100 (95% similarity with expected result)
Quality: 80/100 (good structure, minor issues)
Overall: 87/100 โ
PASSED
Example Output (without expected):
HTML Validation: 90/100 โ
Valid structure
Functional: 85/100 (renders โ
, 1 error, <1s)
Expected: N/A (not provided)
Quality: 80/100 (good structure, minor issues)
Overall: 85/100 โ
PASSED
โจ Key Features
- ๐ฏ Any LLM Model - OpenAI, Anthropic, Google, XAI, local models via LiteLLM
- ๐ Single Agent Design - Your prompt becomes the instruction (no complex configs)
- ๐พ Auto HTML Extraction - Automatically saves HTML code from responses
- ๐ Smart Organization - Model-specific output folders (
output/gpt-4o/,output/xai/grok-code-fast-1/) - ๐๏ธ Flexible Testing - Run single tests, full suites, or filter specific tests
- โก Parallel Execution - Run tests concurrently with
--concurrent Nfor faster benchmarking - ๐ฐ Cost & Token Tracking - Automatic token usage and cost calculation for all supported models
- ๐ Multiple Export Formats - Export results as JSON or CSV for easy analysis
- ๐ HTML Dashboard Reports - Beautiful visual reports with interactive charts using
--report - ๐ ๏ธ Modern Tooling - Built with
pyproject.tomlanduvpackage manager - ๐ Comprehensive Results - Complete metrics with timing, success rates, costs, and metadata
- ๐ Plugin System - Extensible evaluators for any language (Python, TypeScript, Go, etc.) via plugins
๐ Quick Start
1. Install from PyPI (Recommended)
pip install praisonaibench
2. Install with uv
# Clone the repository
git clone https://github.com/MervinPraison/praisonaibench
cd praisonaibench
# Install with uv
uv sync
# Or install in development mode
uv pip install -e .
3. Alternative: Install with pip (from source)
git clone https://github.com/MervinPraison/praisonaibench
cd praisonaibench
pip install -e .
4. Set Your API Keys
# OpenAI
export OPENAI_API_KEY=your_openai_key
# XAI (Grok)
export XAI_API_KEY=your_xai_key
# Anthropic
export ANTHROPIC_API_KEY=your_anthropic_key
# Google
export GOOGLE_API_KEY=your_google_key
4. Run Your First Benchmark
from praisonaibench import Bench
# Create benchmark suite
bench = Bench()
# Run a simple test
result = bench.run_single_test("What is 2+2?")
print(result['response'])
# Run with specific model
result = bench.run_single_test(
"Create a rotating cube HTML file",
model="xai/grok-code-fast-1"
)
# Get summary
summary = bench.get_summary()
print(summary)
๐ Project Structure
praisonaibench/
โโโ pyproject.toml # Modern Python packaging
โโโ src/praisonaibench/ # Source code
โ โโโ __init__.py # Main imports
โ โโโ bench.py # Core benchmarking engine
โ โโโ agent.py # LLM agent wrapper
โ โโโ cli.py # Command line interface
โ โโโ version.py # Version info
โโโ examples/ # Example configurations
โ โโโ threejs_simulation_suite.yaml
โ โโโ config_example.yaml
โโโ output/ # Generated results
โโโ gpt-4o/ # Model-specific HTML files
โโโ xai/grok-code-fast-1/
โโโ benchmark_results_*.json
๐ป CLI Usage
Basic Commands
# Single test with default model
praisonaibench --test "Explain quantum computing"
# Single test with specific model
praisonaibench --test "Write a poem" --model gpt-4o
# Use any LiteLLM-compatible model
praisonaibench --test "Create HTML" --model xai/grok-code-fast-1
praisonaibench --test "Write code" --model gemini/gemini-1.5-flash-8b
praisonaibench --test "Analyze data" --model claude-3-sonnet-20240229
Test Suites
# Run entire test suite
praisonaibench --suite examples/threejs_simulation_suite.yaml
# Run specific test from suite
praisonaibench --suite examples/threejs_simulation_suite.yaml --test-name "rotating_cube_simulation"
# Run suite with specific model (overrides individual test models)
praisonaibench --suite tests.yaml --model xai/grok-code-fast-1
# Run tests in parallel (3 concurrent workers)
praisonaibench --suite tests.yaml --concurrent 3
Cross-Model Comparison
# Compare across multiple models
praisonaibench --cross-model "Write a poem" --models gpt-4o,gpt-3.5-turbo,xai/grok-code-fast-1
Extract HTML from Results
# Extract HTML from existing benchmark results
praisonaibench --extract output/benchmark_results_20250829_160426.json
# โ Processes JSON file and saves any HTML content to .html files
# Works with any benchmark results JSON file
praisonaibench --extract my_results.json
HTML Generation Examples
# Generate Three.js simulation (auto-saves HTML)
praisonaibench --test "Create a rotating cube HTML with Three.js" --model gpt-4o
# โ Saves to: output/gpt-4o/test_cube.html
# Run Three.js test suite
praisonaibench --suite examples/threejs_simulation_suite.yaml --model xai/grok-code-fast-1
# โ Saves to: output/xai/grok-code-fast-1/rotating_cube_simulation.html
Cost & Token Tracking
Automatically track token usage and costs for all LLM API calls:
# Run tests with automatic cost tracking
praisonaibench --suite tests.yaml --model gpt-4o
# Output includes per-test costs:
๐ฐ Cost: $0.002400 (1250 tokens)
# Summary shows total costs:
๐ Summary:
Total tests: 4
Success rate: 100.0%
Average time: 8.42s
๐ฐ Cost Summary:
Total tokens: 5,420
Total cost: $0.0124
By model:
gpt-4o: $0.0124 (5,420 tokens)
Supported models include accurate pricing for:
- OpenAI (GPT-4o, GPT-4, GPT-3.5, O1)
- Anthropic (Claude 3 family)
- Google (Gemini 1.5 family)
- XAI (Grok models)
- Groq (optimised models)
Token usage is extracted from API responses when available, or estimated from text length. Costs are calculated using official provider pricing (updated December 2024).
CSV Export
Export benchmark results to CSV for spreadsheet analysis:
# Export to CSV format
praisonaibench --suite tests.yaml --format csv
# Results saved to: output/csv/benchmark_results_20241211_123456.csv
CSV includes:
- Test names and status
- Model information
- Execution times
- Token usage (input/output/total)
- Costs per test
- Evaluation scores
- Prompts and response lengths
- Error messages (if any)
Perfect for:
- Spreadsheet analysis in Excel/Google Sheets
- Data visualization tools
- Statistical analysis
- Sharing results with non-technical stakeholders
HTML Dashboard Reports
Generate beautiful interactive reports with comprehensive visualizations inspired by the React UI:
# Generate enhanced HTML report after running tests
praisonaibench --suite tests.yaml --report
# Generate report from existing results (without re-running tests)
praisonaibench --report-from output/json/benchmark_results_20241211_123456.json
# Compare multiple test results
praisonaibench --compare result1.json result2.json result3.json
# Reports saved to: output/reports/
Enhanced Report Includes:
๐ Dashboard Tab:
- Summary cards with key metrics (tests, models, success rate, avg time, cost, tokens)
- Interactive charts:
- Status distribution (success/failure)
- Execution time by model
- Evaluation scores (radar chart)
- Errors & warnings
๐ Leaderboard Tab:
- Model rankings with multiple criteria:
- Overall Score (default)
- Functional Score
- Quality Score
- Pass Rate
- Speed (fastest first)
- Top 3 models highlighted with medals
- Detailed metrics per model (functional, quality, pass rate, time)
- Click criteria to re-rank dynamically
โ๏ธ Comparison Tab:
- Detailed side-by-side model comparison
- Comprehensive metrics table:
- Overall score, functional score, quality score
- Pass rate with color coding
- Average execution time
- Total errors and warnings count
- Full model names and stats
๐ Results Tab:
- Complete test results table
- Individual test status, scores, time, tokens, cost
- Sortable columns
- Status indicators
Features:
- ๐จ Modern UI with gradient headers and smooth transitions
- ๐ฑ Fully responsive design
- โก Fast and lightweight (no external dependencies)
- ๐ Tab navigation for organized viewing
- ๐ Chart.js powered visualizations
- ๐ฏ Based on praisonaibench-ui React application
- ๐พ Standalone HTML - works offline
- ๐ง Easy to share via email or host on web
Comparison Reports: Multi-run comparison shows:
- Side-by-side success rates
- Performance trends
- Cost and token usage evolution
- Model improvements over time
Open any generated HTML file in any modern browser!
๐ Test Suite Format
Basic Test Suite (tests.yaml)
tests:
- name: "math_test"
prompt: "What is 15 * 23?"
expected: "345" # Optional: for objective comparison
- name: "python_test"
language: "python" # Use plugin evaluator
prompt: "Write Python factorial function"
expected: "120"
- name: "creative_test"
prompt: "Write a short story about a robot"
# No expected field - subjective task
- name: "model_specific_test"
prompt: "Explain quantum physics"
model: "gpt-4o"
Using the expected field:
- โ Use for: Factual questions, math problems, code output, deterministic tasks
- โ Skip for: Creative tasks, open-ended questions, visual/interactive content
- When provided: Adds 20% objective scoring based on similarity
- When omitted: Weights automatically normalize (no penalty)
Advanced Test Suite with Full Config Support
# Global LLM configuration (applies to all tests)
config:
max_tokens: 4000
temperature: 0.7
top_p: 0.9
frequency_penalty: 0.0
presence_penalty: 0.0
# Any LiteLLM-compatible parameter is supported!
tests:
- name: "creative_writing"
prompt: "Write a detailed sci-fi story"
model: "gpt-4o"
- name: "code_generation"
prompt: "Create a Python web scraper"
model: "xai/grok-code-fast-1"
Three.js HTML Generation Suite
# examples/threejs_simulation_suite.yaml
tests:
- name: "rotating_cube_simulation"
prompt: |
Create a complete HTML file with Three.js that displays a rotating 3D cube.
The cube should have different colored faces, rotate continuously, and include proper lighting.
The HTML file should be self-contained with Three.js loaded from CDN.
Include camera controls for user interaction.
Save the output as 'rotating_cube.html'.
- name: "particle_system"
prompt: |
Create an HTML file with Three.js showing an animated particle system.
Include 1000+ particles with random colors, movement, and physics.
Add mouse interaction to influence particle behavior.
- name: "terrain_simulation"
prompt: |
Create a Three.js HTML file with a procedurally generated terrain landscape.
Include realistic textures, lighting, and a first-person camera.
Add fog effects and animated elements.
- name: "solar_system"
prompt: |
Create a Three.js solar system simulation in HTML.
Include the sun, planets with realistic orbits, textures, and lighting.
Add controls to speed up/slow down time.
๐ง Configuration
Basic Configuration (config.yaml)
# Default model (can be overridden per test)
default_model: "gpt-4o"
# Output settings
output_format: "json"
save_results: true
output_dir: "output"
# Performance settings
max_retries: 3
timeout: 60
Supported Models
PraisonAI Bench supports any LiteLLM-compatible model:
# OpenAI Models
- gpt-4o
- gpt-4o-mini
- gpt-3.5-turbo
# Anthropic Models
- claude-3-opus-20240229
- claude-3-sonnet-20240229
- claude-3-haiku-20240307
# Google Models
- gemini/gemini-1.5-pro
- gemini/gemini-1.5-flash
- gemini/gemini-1.5-flash-8b
# XAI Models
- xai/grok-beta
- xai/grok-code-fast-1
# Local Models (via LM Studio, Ollama, etc.)
- ollama/llama2
- openai/gpt-3.5-turbo # with OPENAI_API_BASE set
๐ Results & Output
Automatic HTML Extraction
When LLM responses contain HTML code blocks, they're automatically extracted and saved:
output/
โโโ gpt-4o/
โ โโโ rotating_cube_simulation.html
โ โโโ particle_system.html
โโโ xai/
โ โโโ grok-code-fast-1/
โ โโโ terrain_simulation.html
โ โโโ solar_system.html
โโโ benchmark_results_20250829_160426.json
JSON Results Format
[
{
"test_name": "rotating_cube_simulation",
"prompt": "Create a complete HTML file with Three.js...",
"response": "<!DOCTYPE html>\n<html>\n...",
"model": "xai/grok-code-fast-1",
"agent_name": "BenchAgent",
"execution_time": 8.24,
"status": "success",
"timestamp": "2025-08-29 16:04:26"
}
]
Summary Statistics
๐ Summary:
Total tests: 4
Success rate: 100.0%
Average time: 12.34s
Results saved to: output/benchmark_results_20250829_160426.json
๐ฏ Advanced Features
๐ Universal Model Support
- Works with any LiteLLM-compatible model
- No hardcoded model restrictions
- Automatic API key detection
๐พ Smart HTML Handling
- Auto-detects HTML in multiple formats:
- Markdown-wrapped HTML (
html...) - Truncated HTML blocks (incomplete responses)
- Raw HTML content (direct HTML responses)
- Markdown-wrapped HTML (
- Extracts and saves as
.htmlfiles automatically - Organizes by model in separate folders
- Extract HTML from existing benchmark results with
--extract - Perfect for Three.js, React, or any web development benchmarks
๐๏ธ Flexible Test Management
- Run entire suites or filter specific tests
- Override models per test or globally
- Cross-model comparisons with detailed metrics
โก Modern Development
- Built with
pyproject.toml(no legacysetup.py) - Optimized for
uvpackage manager - Fast dependency resolution and installation
๐๏ธ Simple Architecture
- Single Agent Design - Your prompt becomes the instruction
- No Complex Configs - Just write your test prompts
- Minimal Dependencies - Only what you need
๐ Plugin System
Extensible evaluators for any language or task - Create plugins in one file.
Create Plugin (One File)
from praisonaibench import BaseEvaluator
class MyEvaluator(BaseEvaluator):
def get_language(self):
return 'mylang' # e.g., 'python', 'typescript', 'go'
def evaluate(self, code, test_name, prompt, expected=None):
return {
'score': 85, # 0-100
'passed': True, # score >= 70
'feedback': [{'level': 'success', 'message': 'โ
Works!'}],
'details': {}
}
Setup (pyproject.toml):
[project]
name = "praisonaibench-mylang"
version = "0.1.0"
dependencies = ["praisonaibench>=0.1.0"]
[project.entry-points."praisonaibench.evaluators"]
mylang = "my_evaluator:MyEvaluator"
Install: pip install -e . or uv pip install -e .
Use Plugin
# tests.yaml
tests:
- name: "python_test"
language: "python" # Auto-discovered
prompt: "Write Python hello world"
expected: "Hello World"
Run: praisonaibench --suite tests.yaml
Features
- โ One file (~50 lines) per plugin
- โ Auto-discovery - No config needed
- โ Backwards compatible - HTML evaluation unchanged
- โ
Language detection - Auto-detects from code blocks or explicit
languagefield - โ Any task - Programming languages, text summarization, translation, etc.
Example: examples/plugins/python_evaluator.py
๐ Use Cases
Web Development Benchmarking
# Test HTML/CSS/JS generation across models
praisonaibench --suite web_dev_suite.yaml --model gpt-4o
Code Generation Comparison
# Compare coding abilities
praisonaibench --cross-model "Write a Python web scraper" --models gpt-4o,claude-3-sonnet-20240229,xai/grok-code-fast-1
Creative Content Testing
# Test creative writing
praisonaibench --test "Write a sci-fi short story" --model gemini/gemini-1.5-pro
๐ค Contributing
- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Install dependencies:
uv sync - Make your changes
- Run tests:
uv run pytest - Submit a pull request
๐ License
MIT License - see LICENSE file for details.
Perfect for developers who need powerful, flexible LLM benchmarking with zero complexity! ๐
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file praisonaibench-0.1.1.tar.gz.
File metadata
- Download URL: praisonaibench-0.1.1.tar.gz
- Upload date:
- Size: 59.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.22
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2dc67f0e27127206cc921dabce2d0a0ac08feeaaf9b46c881d73c887a0f5ad5f
|
|
| MD5 |
45fbd830a4c23f87b23712d3f43b01a9
|
|
| BLAKE2b-256 |
5d7e2ce8a65f4c33034b1179e60281f5ee0b0ed279cad58402e9226f6af17a18
|
File details
Details for the file praisonaibench-0.1.1-py3-none-any.whl.
File metadata
- Download URL: praisonaibench-0.1.1-py3-none-any.whl
- Upload date:
- Size: 67.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.22
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5d66035cc6c8765248e7aca9f409a6d4dd93d0250aaf267f77108404f2807e38
|
|
| MD5 |
2dc4a0db2a5e8178b8ba07f7d2e0a6aa
|
|
| BLAKE2b-256 |
4c87905b525ce077740c3b83918b2c9562e6dfebb6c83d893eb3db9832e81f9e
|