Simple LLM Benchmarking Tool using PraisonAI Agents
Project description
PraisonAI Bench
๐ A simple, powerful LLM benchmarking tool built with PraisonAI Agents
Benchmark any LiteLLM-compatible model with automatic HTML extraction, model-specific output organization, and flexible test suite management.
โจ Key Features
- ๐ฏ Any LLM Model - OpenAI, Anthropic, Google, XAI, local models via LiteLLM
- ๐ Single Agent Design - Your prompt becomes the instruction (no complex configs)
- ๐พ Auto HTML Extraction - Automatically saves HTML code from responses
- ๐ Smart Organization - Model-specific output folders (
output/gpt-4o/,output/xai/grok-code-fast-1/) - ๐๏ธ Flexible Testing - Run single tests, full suites, or filter specific tests
- โก Modern Tooling - Built with
pyproject.tomlanduvpackage manager - ๐ Comprehensive Results - JSON metrics with timing, success rates, and metadata
๐ Quick Start
1. Install with uv (Recommended)
# Clone the repository
git clone https://github.com/MervinPraison/praisonaibench
cd praisonaibench
# Install with uv
uv sync
# Or install in development mode
uv pip install -e .
2. Alternative: Install with pip
pip install -e .
3. Set Your API Keys
# OpenAI
export OPENAI_API_KEY=your_openai_key
# XAI (Grok)
export XAI_API_KEY=your_xai_key
# Anthropic
export ANTHROPIC_API_KEY=your_anthropic_key
# Google
export GOOGLE_API_KEY=your_google_key
4. Run Your First Benchmark
from praisonaibench import Bench
# Create benchmark suite
bench = Bench()
# Run a simple test
result = bench.run_single_test("What is 2+2?")
print(result['response'])
# Run with specific model
result = bench.run_single_test(
"Create a rotating cube HTML file",
model="xai/grok-code-fast-1"
)
# Get summary
summary = bench.get_summary()
print(summary)
๐ Project Structure
praisonaibench/
โโโ pyproject.toml # Modern Python packaging
โโโ src/praisonaibench/ # Source code
โ โโโ __init__.py # Main imports
โ โโโ bench.py # Core benchmarking engine
โ โโโ agent.py # LLM agent wrapper
โ โโโ cli.py # Command line interface
โ โโโ version.py # Version info
โโโ examples/ # Example configurations
โ โโโ threejs_simulation_suite.yaml
โ โโโ config_example.yaml
โโโ output/ # Generated results
โโโ gpt-4o/ # Model-specific HTML files
โโโ xai/grok-code-fast-1/
โโโ benchmark_results_*.json
๐ป CLI Usage
Basic Commands
# Single test with default model
praisonaibench --test "Explain quantum computing"
# Single test with specific model
praisonaibench --test "Write a poem" --model gpt-4o
# Use any LiteLLM-compatible model
praisonaibench --test "Create HTML" --model xai/grok-code-fast-1
praisonaibench --test "Write code" --model gemini/gemini-1.5-flash-8b
praisonaibench --test "Analyze data" --model claude-3-sonnet-20240229
Test Suites
# Run entire test suite
praisonaibench --suite examples/threejs_simulation_suite.yaml
# Run specific test from suite
praisonaibench --suite examples/threejs_simulation_suite.yaml --test-name "rotating_cube_simulation"
# Run suite with specific model (overrides individual test models)
praisonaibench --suite tests.yaml --model xai/grok-code-fast-1
Cross-Model Comparison
# Compare across multiple models
praisonaibench --cross-model "Write a poem" --models gpt-4o,gpt-3.5-turbo,xai/grok-code-fast-1
Extract HTML from Results
# Extract HTML from existing benchmark results
praisonaibench --extract output/benchmark_results_20250829_160426.json
# โ Processes JSON file and saves any HTML content to .html files
# Works with any benchmark results JSON file
praisonaibench --extract my_results.json
HTML Generation Examples
# Generate Three.js simulation (auto-saves HTML)
praisonaibench --test "Create a rotating cube HTML with Three.js" --model gpt-4o
# โ Saves to: output/gpt-4o/test_cube.html
# Run Three.js test suite
praisonaibench --suite examples/threejs_simulation_suite.yaml --model xai/grok-code-fast-1
# โ Saves to: output/xai/grok-code-fast-1/rotating_cube_simulation.html
๐ Test Suite Format
Basic Test Suite (tests.yaml)
tests:
- name: "math_test"
prompt: "What is 15 * 23?"
- name: "creative_test"
prompt: "Write a short story about a robot"
- name: "model_specific_test"
prompt: "Explain quantum physics"
model: "gpt-4o"
Advanced Test Suite with Full Config Support
# Global LLM configuration (applies to all tests)
config:
max_tokens: 4000
temperature: 0.7
top_p: 0.9
frequency_penalty: 0.0
presence_penalty: 0.0
# Any LiteLLM-compatible parameter is supported!
tests:
- name: "creative_writing"
prompt: "Write a detailed sci-fi story"
model: "gpt-4o"
- name: "code_generation"
prompt: "Create a Python web scraper"
model: "xai/grok-code-fast-1"
Three.js HTML Generation Suite
# examples/threejs_simulation_suite.yaml
tests:
- name: "rotating_cube_simulation"
prompt: |
Create a complete HTML file with Three.js that displays a rotating 3D cube.
The cube should have different colored faces, rotate continuously, and include proper lighting.
The HTML file should be self-contained with Three.js loaded from CDN.
Include camera controls for user interaction.
Save the output as 'rotating_cube.html'.
- name: "particle_system"
prompt: |
Create an HTML file with Three.js showing an animated particle system.
Include 1000+ particles with random colors, movement, and physics.
Add mouse interaction to influence particle behavior.
- name: "terrain_simulation"
prompt: |
Create a Three.js HTML file with a procedurally generated terrain landscape.
Include realistic textures, lighting, and a first-person camera.
Add fog effects and animated elements.
- name: "solar_system"
prompt: |
Create a Three.js solar system simulation in HTML.
Include the sun, planets with realistic orbits, textures, and lighting.
Add controls to speed up/slow down time.
๐ง Configuration
Basic Configuration (config.yaml)
# Default model (can be overridden per test)
default_model: "gpt-4o"
# Output settings
output_format: "json"
save_results: true
output_dir: "output"
# Performance settings
max_retries: 3
timeout: 60
Supported Models
PraisonAI Bench supports any LiteLLM-compatible model:
# OpenAI Models
- gpt-4o
- gpt-4o-mini
- gpt-3.5-turbo
# Anthropic Models
- claude-3-opus-20240229
- claude-3-sonnet-20240229
- claude-3-haiku-20240307
# Google Models
- gemini/gemini-1.5-pro
- gemini/gemini-1.5-flash
- gemini/gemini-1.5-flash-8b
# XAI Models
- xai/grok-beta
- xai/grok-code-fast-1
# Local Models (via LM Studio, Ollama, etc.)
- ollama/llama2
- openai/gpt-3.5-turbo # with OPENAI_API_BASE set
๐ Results & Output
Automatic HTML Extraction
When LLM responses contain HTML code blocks, they're automatically extracted and saved:
output/
โโโ gpt-4o/
โ โโโ rotating_cube_simulation.html
โ โโโ particle_system.html
โโโ xai/
โ โโโ grok-code-fast-1/
โ โโโ terrain_simulation.html
โ โโโ solar_system.html
โโโ benchmark_results_20250829_160426.json
JSON Results Format
[
{
"test_name": "rotating_cube_simulation",
"prompt": "Create a complete HTML file with Three.js...",
"response": "<!DOCTYPE html>\n<html>\n...",
"model": "xai/grok-code-fast-1",
"agent_name": "BenchAgent",
"execution_time": 8.24,
"status": "success",
"timestamp": "2025-08-29 16:04:26"
}
]
Summary Statistics
๐ Summary:
Total tests: 4
Success rate: 100.0%
Average time: 12.34s
Results saved to: output/benchmark_results_20250829_160426.json
๐ฏ Advanced Features
๐ Universal Model Support
- Works with any LiteLLM-compatible model
- No hardcoded model restrictions
- Automatic API key detection
๐พ Smart HTML Handling
- Auto-detects HTML in multiple formats:
- Markdown-wrapped HTML (
html...) - Truncated HTML blocks (incomplete responses)
- Raw HTML content (direct HTML responses)
- Markdown-wrapped HTML (
- Extracts and saves as
.htmlfiles automatically - Organizes by model in separate folders
- Extract HTML from existing benchmark results with
--extract - Perfect for Three.js, React, or any web development benchmarks
๐๏ธ Flexible Test Management
- Run entire suites or filter specific tests
- Override models per test or globally
- Cross-model comparisons with detailed metrics
โก Modern Development
- Built with
pyproject.toml(no legacysetup.py) - Optimized for
uvpackage manager - Fast dependency resolution and installation
๐๏ธ Simple Architecture
- Single Agent Design - Your prompt becomes the instruction
- No Complex Configs - Just write your test prompts
- Minimal Dependencies - Only what you need
๐ Use Cases
Web Development Benchmarking
# Test HTML/CSS/JS generation across models
praisonaibench --suite web_dev_suite.yaml --model gpt-4o
Code Generation Comparison
# Compare coding abilities
praisonaibench --cross-model "Write a Python web scraper" --models gpt-4o,claude-3-sonnet-20240229,xai/grok-code-fast-1
Creative Content Testing
# Test creative writing
praisonaibench --test "Write a sci-fi short story" --model gemini/gemini-1.5-pro
๐ค Contributing
- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Install dependencies:
uv sync - Make your changes
- Run tests:
uv run pytest - Submit a pull request
๐ License
MIT License - see LICENSE file for details.
Perfect for developers who need powerful, flexible LLM benchmarking with zero complexity! ๐
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file praisonaibench-0.0.2.tar.gz.
File metadata
- Download URL: praisonaibench-0.0.2.tar.gz
- Upload date:
- Size: 14.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e85bdd655541ceeb88460bf711531879356a3a217fa1698b607e2d3ee3a8589d
|
|
| MD5 |
795c500aeb29cd879038e428c8b80138
|
|
| BLAKE2b-256 |
42f0e292367429a65148bd946a454f12d01b7a5b5fb65e8ca5761dac9863a270
|
File details
Details for the file praisonaibench-0.0.2-py3-none-any.whl.
File metadata
- Download URL: praisonaibench-0.0.2-py3-none-any.whl
- Upload date:
- Size: 13.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9e57aa9797e9a02200322353b4c6f035f4aef0ee8dac9c89da7c0518d86ea457
|
|
| MD5 |
eb96dbb8063074ea16b1edd3f7601d9f
|
|
| BLAKE2b-256 |
4b8b8cdf6ea2704fbee874d9f543267414720ce6afddb7aeb6d38fccc2b9170f
|