A comprehensive Python package for unified LLM access, benchmarking, evaluation, and reporting
Project description
HeLLMholtz LLM Suite
A comprehensive Python package for unified LLM access, benchmarking, evaluation, and reporting. Built on top of aisuite with specialized support for Helmholtz Blablador models.
Features
- Unified Client: Single interface for OpenAI, Google, Anthropic, Ollama, and Helmholtz Blablador models
- Centralized Configuration: Environment-based configuration for all your projects
- Advanced Benchmarking: Compare model performance across temperatures, replications, and prompt categories
- LLM-as-a-Judge Evaluation: Automated evaluation with comprehensive statistical analysis
- Interactive Reports: HTML reports with Chart.js visualizations and Markdown summaries
- Flexible Prompt System: Support for both simple text files and structured JSON prompt collections
- Model Monitoring: Track Blablador model availability and configuration consistency
- LM Evaluation Harness: Integration with EleutherAI's comprehensive evaluation suite
- LiteLLM Proxy: Built-in proxy server for model routing and load balancing
- Throughput Testing: Performance benchmarking for high-throughput scenarios
- Model Discovery: Dynamic model listing and availability checking (19+ BLABLADOR models currently available)
Installation
Basic Installation
pip install hellmholtz
Development Installation
For development with all optional dependencies:
git clone https://github.com/JonasHeinickeBio/HeLLMholtz.git
cd HeLLMholtz
pip install -e ".[eval,proxy]"
Poetry Installation
poetry install --with eval,proxy
Configuration
- Copy the example environment file:
cp .env.example .env
- Configure your API keys in
.env:
# OpenAI
OPENAI_API_KEY=your_openai_key
# Anthropic
ANTHROPIC_API_KEY=your_anthropic_key
# Google
GOOGLE_API_KEY=your_google_key
# Helmholtz Blablador
BLABLADOR_API_KEY=your_blablador_key
BLABLADOR_API_BASE=https://your-blablador-instance.com
# Optional: Default models
AISUITE_DEFAULT_MODELS='{"openai": "gpt-4o", "anthropic": "claude-3-haiku"}'
Usage
Python API
Basic Chat Interface
from hellmholtz.client import chat
# Simple chat
response = chat("openai:gpt-4o", "Hello, how are you?")
print(response)
# With conversation history
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in simple terms."}
]
response = chat("anthropic:claude-3-sonnet", messages)
Benchmarking
from hellmholtz.benchmark import run_benchmarks
from hellmholtz.core.prompts import load_prompts
# Load prompts from JSON file
prompts = load_prompts("prompts.json", category="reasoning")
# Run benchmarks
results = run_benchmarks(
models=["openai:gpt-4o", "anthropic:claude-3-haiku", "blablador:gpt-4o"],
prompts=prompts,
temperatures=[0.1, 0.7, 1.0],
replications=3
)
# Analyze results
from hellmholtz.evaluation_analysis import EvaluationAnalyzer
analyzer = EvaluationAnalyzer()
analysis = analyzer.analyze_evaluation_results("results/benchmark_latest.json")
analyzer.print_analysis_summary(analysis)
Command Line Interface
HeLLMholtz provides a comprehensive CLI for all operations:
Chat Interface
# Simple chat
hellm chat --model openai:gpt-4o "Explain the theory of relativity"
# Interactive mode
hellm chat --model anthropic:claude-3-sonnet --interactive
# With system prompt
hellm chat --model blablador:gpt-4o --system "You are a coding assistant" "Write a Python function to calculate fibonacci numbers"
Benchmarking
# Basic benchmark
hellm bench --models openai:gpt-4o,anthropic:claude-3-haiku --prompts-file prompts.txt
# Advanced benchmark with evaluation
hellm bench \
--models openai:gpt-4o,blablador:gpt-4o \
--prompts-file prompts.json \
--prompts-category reasoning \
--temperatures 0.1,0.7,1.0 \
--replications 3 \
--evaluate-with openai:gpt-4o \
--results-dir results/
# Throughput testing
hellm bench-throughput \
--model openai:gpt-4o \
--requests 100 \
--concurrency 10 \
--prompt "Write a short story about AI"
Evaluation and Analysis
# Analyze benchmark results
hellm analyze results/benchmark_latest.json --html-report analysis_report.html
# Generate reports
hellm report --results-file results/benchmark_latest.json --output report.md
Model Management
# List available Blablador models
hellm models
# Monitor model availability and test accessibility
hellm monitor --test-accessibility
# Check model configuration consistency
hellm monitor --check-config
Weekly Automated Benchmarking
The repository includes a GitHub Actions workflow that automatically runs benchmarks weekly and updates reports:
- Scheduled: Runs every Sunday at 00:00 UTC
- Model Discovery: Automatically fetches latest Blablador models
- Performance Charts: Generates visual charts comparing model performance
- Multiple Formats: Creates HTML, Markdown, and PNG chart reports
- Auto-commit: Updates reports in the repository for public viewing
To enable automated benchmarking:
-
Set repository secrets for API keys:
BLABLADOR_API_KEY: Your Blablador API keyBLABLADOR_API_BASE: Blablador API base URL (optional)
-
The workflow will automatically:
- Run benchmarks on selected models
- Generate performance reports
- Create visual charts
- Commit updated reports to the repository
Reports are available in the reports/ directory and include:
weekly_benchmark_report.html: Interactive HTML reportweekly_benchmark_report.md: Markdown summaryweekly_benchmark_chart.png: Performance visualization
Advanced Features
# Run LM Evaluation Harness
hellm lm-eval \
--model openai:gpt-4o \
--tasks hellaswag,winogrande \
--limit 100
# Start LiteLLM proxy server
hellm proxy \
--config litellm_config.yaml \
--port 8000
Project Structure
hellmholtz/
├── cli.py # Command-line interface
├── client.py # Unified LLM client
├── monitoring.py # Model availability monitoring
├── evaluation_analysis.py # Statistical analysis and reporting
├── export.py # Result export utilities
├── core/
│ ├── config.py # Configuration management
│ └── prompts.py # Prompt loading and validation
├── benchmark/
│ ├── runner.py # Benchmark execution
│ ├── evaluator.py # LLM-as-a-Judge evaluation
│ └── prompts.py # Benchmark-specific prompts
├── providers/
│ ├── blablador_provider.py # Custom Blablador provider
│ ├── blablador_config.py # Blablador model configuration
│ ├── blablador.py # Blablador utilities
│ └── __init__.py
├── reporting/
│ ├── html.py # HTML report generation
│ ├── markdown.py # Markdown report generation
│ ├── stats.py # Statistical calculations
│ ├── utils.py # Reporting utilities
│ └── templates/ # HTML templates
└── integrations/
├── lm_eval.py # LM Evaluation Harness integration
└── litellm.py # LiteLLM proxy integration
Prompt System
HeLLMholtz supports two prompt formats:
Simple Text Format (prompts.txt)
What is the capital of France?
Explain quantum computing in simple terms.
Write a Python function to reverse a string.
Structured JSON Format (prompts.json)
[
{
"id": "capital-france",
"category": "knowledge",
"description": "Test basic geographical knowledge",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
],
"expected_output": "Paris"
},
{
"id": "quantum-explanation",
"category": "reasoning",
"description": "Test ability to explain complex concepts simply",
"messages": [
{
"role": "user",
"content": "Explain quantum computing in simple terms."
}
]
}
]
Evaluation System
The LLM-as-a-Judge evaluation system provides:
- Automated Scoring: AI-powered evaluation of response quality
- Statistical Analysis: Comprehensive metrics and distributions
- Model Rankings: Performance comparisons across all dimensions
- Interactive Reports: Web-based visualizations of results
- Detailed Critiques: Specific feedback for each response
Example Analysis Output
[Monitor] EVALUATION ANALYSIS RESULTS
══════════════════════════════════════════════════════════════
OVERVIEW
• Total Evaluations: 150
• Models Tested: 3
• Prompts Tested: 5
• Success Rate: 94.7%
🏆 MODEL RANKINGS
1. openai:gpt-4o - Avg Score: 8.7/10 (±0.8)
2. anthropic:claude-3-opus - Avg Score: 8.4/10 (±0.9)
3. blablador:gpt-4o - Avg Score: 7.9/10 (±1.1)
DETAILED METRICS
• Response Quality: 8.3/10 average
• Relevance: 8.6/10 average
• Accuracy: 9.1/10 average
• Creativity: 7.8/10 average
Latest Benchmark Results
Recent benchmarking results from the automated weekly workflow testing BLABLADOR models:
Model Performance Overview
| Model | Success Rate | Avg Latency | Avg Rating (1-10) | Rating Std Dev |
|---|---|---|---|---|
| GPT-OSS-120b | 100.0% | 5.35s | 8.5 | ±2.38 |
| Ministral-3-14B-Instruct-2512 | 100.0% | 9.55s | 7.5 | ±3.70 |
Overall Statistics:
- Total Evaluations: 8 across 4 different prompts
- Models Tested: 2 BLABLADOR models
- Overall Success Rate: 100.0%
- Average Rating: 8.0/10
- Average Latency: 7.45s
Key Findings
- Top Performer: GPT-OSS-120b with highest rating (8.5/10) and fastest response time (5.35s)
- Most Consistent: GPT-OSS-120b with lower rating variation (±2.38)
- Performance Gap: 1.0 point difference between best and worst performing models
- Model Availability: Both tested models are fully operational with 100% success rates
Evaluation Details
- Prompt Categories: Reasoning, coding, and creative writing tasks
- Temperature Testing: Multiple temperature settings (0.1, 0.7, 1.0) for response variation
- LLM-as-a-Judge: Automated evaluation with detailed critiques and statistical analysis
- Rating Distribution: GPT-OSS-120b received mostly 9-10 ratings, Ministral-3-14B showed more variation
Reports and Visualizations
- Interactive HTML Report - Comprehensive evaluation analysis with charts
- Markdown Summary - Detailed performance metrics
- Performance Chart - Visual model comparison
- Basic Report - Simple performance overview
Reports are automatically updated and include LLM-as-a-Judge evaluation with detailed statistical analysis and model rankings.
Development
Setup Development Environment
# Clone repository
git clone https://github.com/JonasHeinickeBio/HeLLMholtz.git
cd HeLLMholtz
# Install with development dependencies
poetry install --with dev
# Install pre-commit hooks
poetry run pre-commit install
Running Tests
# Run all tests
poetry run pytest
# Run with coverage
poetry run pytest --cov=hellmholtz --cov-report=html
# Run specific test categories
poetry run pytest -m "slow" # Slow integration tests
poetry run pytest -m "network" # Tests requiring network access
poetry run pytest -m "model" # Tests using actual models
Code Quality
# Lint code
poetry run ruff check .
# Format code
poetry run ruff format .
# Type checking
poetry run mypy src/
# Security scanning
poetry run bandit -r src/
Building Documentation
# Generate API documentation
poetry run sphinx-build docs/ docs/_build/
# Serve documentation locally
poetry run sphinx-serve docs/_build/
Contributing
We welcome contributions! Please see our Contributing Guide for details.
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Make your changes and add tests
- Run the full test suite:
poetry run pytest - Ensure code quality:
poetry run ruff check . && poetry run mypy src/ - Commit your changes:
git commit -m 'Add amazing feature' - Push to the branch:
git push origin feature/amazing-feature - Open a Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- Built on top of aisuite for unified LLM access
- LLM evaluation powered by EleutherAI's LM Evaluation Harness
- Proxy functionality via LiteLLM
- Special thanks to the Helmholtz Association for Blablador model access
Support
- Documentation: https://hellmholtz.readthedocs.io/
- Issue Tracker: https://github.com/JonasHeinickeBio/HeLLMholtz/issues
- Discussions: https://github.com/JonasHeinickeBio/HeLLMholtz/discussions
Made with love for the scientific computing community
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hellmholtz-0.3.0.tar.gz.
File metadata
- Download URL: hellmholtz-0.3.0.tar.gz
- Upload date:
- Size: 62.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6f39df98dc69ee5eb3260b906297eb7efcccb5aad7a83a2dce2288a4bfbe7b2e
|
|
| MD5 |
ded16ac9c95d390c6d7a7b76c697deda
|
|
| BLAKE2b-256 |
d6b7cb1fe4032f74dd6bfc75f1dd8c249f861d9be1479c570c44f04f93397be2
|
Provenance
The following attestation bundles were made for hellmholtz-0.3.0.tar.gz:
Publisher:
publish.yml on JonasHeinickeBio/HeLLMholtz
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
hellmholtz-0.3.0.tar.gz -
Subject digest:
6f39df98dc69ee5eb3260b906297eb7efcccb5aad7a83a2dce2288a4bfbe7b2e - Sigstore transparency entry: 1123262484
- Sigstore integration time:
-
Permalink:
JonasHeinickeBio/HeLLMholtz@c3b95c82ff4dc9b9a3af1f5f5738a73d18ffa10e -
Branch / Tag:
refs/heads/main - Owner: https://github.com/JonasHeinickeBio
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@c3b95c82ff4dc9b9a3af1f5f5738a73d18ffa10e -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file hellmholtz-0.3.0-py3-none-any.whl.
File metadata
- Download URL: hellmholtz-0.3.0-py3-none-any.whl
- Upload date:
- Size: 72.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
be58a3e906234b0aa4b869413d1768b2cf5e80731925d25daa9f25b25b24a5dd
|
|
| MD5 |
3b6a8a6c29f39ff691270077984a887d
|
|
| BLAKE2b-256 |
fea4c1ebb667eb349579b6fe43f88060600b37ab715dd90f8b9f17e57990c0d4
|
Provenance
The following attestation bundles were made for hellmholtz-0.3.0-py3-none-any.whl:
Publisher:
publish.yml on JonasHeinickeBio/HeLLMholtz
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
hellmholtz-0.3.0-py3-none-any.whl -
Subject digest:
be58a3e906234b0aa4b869413d1768b2cf5e80731925d25daa9f25b25b24a5dd - Sigstore transparency entry: 1123262493
- Sigstore integration time:
-
Permalink:
JonasHeinickeBio/HeLLMholtz@c3b95c82ff4dc9b9a3af1f5f5738a73d18ffa10e -
Branch / Tag:
refs/heads/main - Owner: https://github.com/JonasHeinickeBio
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@c3b95c82ff4dc9b9a3af1f5f5738a73d18ffa10e -
Trigger Event:
workflow_dispatch
-
Statement type: