Skip to main content

A comprehensive Python package for unified LLM access, benchmarking, evaluation, and reporting

Project description

HeLLMholtz LLM Suite

Python Version PyPI Version License Code Style Tests

A comprehensive Python package for unified LLM access, benchmarking, evaluation, and reporting. Built on top of aisuite with specialized support for Helmholtz Blablador models.

Features

  • Unified Client: Single interface for OpenAI, Google, Anthropic, Ollama, and Helmholtz Blablador models
  • Centralized Configuration: Environment-based configuration for all your projects
  • Advanced Benchmarking: Compare model performance across temperatures, replications, and prompt categories
  • LLM-as-a-Judge Evaluation: Automated evaluation with comprehensive statistical analysis
  • Interactive Reports: HTML reports with Chart.js visualizations and Markdown summaries
  • Flexible Prompt System: Support for both simple text files and structured JSON prompt collections
  • Model Monitoring: Track Blablador model availability and configuration consistency
  • LM Evaluation Harness: Integration with EleutherAI's comprehensive evaluation suite
  • LiteLLM Proxy: Built-in proxy server for model routing and load balancing
  • Throughput Testing: Performance benchmarking for high-throughput scenarios
  • Model Discovery: Dynamic model listing and availability checking (19+ BLABLADOR models currently available)

Installation

Basic Installation

pip install hellmholtz

Development Installation

For development with all optional dependencies:

git clone https://github.com/JonasHeinickeBio/HeLLMholtz.git
cd HeLLMholtz
pip install -e ".[eval,proxy]"

Poetry Installation

poetry install --with eval,proxy

Configuration

  1. Copy the example environment file:
cp .env.example .env
  1. Configure your API keys in .env:
# OpenAI
OPENAI_API_KEY=your_openai_key

# Anthropic
ANTHROPIC_API_KEY=your_anthropic_key

# Google
GOOGLE_API_KEY=your_google_key

# Helmholtz Blablador
BLABLADOR_API_KEY=your_blablador_key
BLABLADOR_API_BASE=https://your-blablador-instance.com

# Optional: Default models
AISUITE_DEFAULT_MODELS='{"openai": "gpt-4o", "anthropic": "claude-3-haiku"}'

Usage

Python API

Basic Chat Interface

from hellmholtz.client import chat

# Simple chat
response = chat("openai:gpt-4o", "Hello, how are you?")
print(response)

# With conversation history
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain quantum computing in simple terms."}
]
response = chat("anthropic:claude-3-sonnet", messages)

Benchmarking

from hellmholtz.benchmark import run_benchmarks
from hellmholtz.core.prompts import load_prompts

# Load prompts from JSON file
prompts = load_prompts("prompts.json", category="reasoning")

# Run benchmarks
results = run_benchmarks(
    models=["openai:gpt-4o", "anthropic:claude-3-haiku", "blablador:gpt-4o"],
    prompts=prompts,
    temperatures=[0.1, 0.7, 1.0],
    replications=3
)

# Analyze results
from hellmholtz.evaluation_analysis import EvaluationAnalyzer
analyzer = EvaluationAnalyzer()
analysis = analyzer.analyze_evaluation_results("results/benchmark_latest.json")
analyzer.print_analysis_summary(analysis)

Command Line Interface

HeLLMholtz provides a comprehensive CLI for all operations:

Chat Interface

# Simple chat
hellm chat --model openai:gpt-4o "Explain the theory of relativity"

# Interactive mode
hellm chat --model anthropic:claude-3-sonnet --interactive

# With system prompt
hellm chat --model blablador:gpt-4o --system "You are a coding assistant" "Write a Python function to calculate fibonacci numbers"

Benchmarking

# Basic benchmark
hellm bench --models openai:gpt-4o,anthropic:claude-3-haiku --prompts-file prompts.txt

# Advanced benchmark with evaluation
hellm bench \
  --models openai:gpt-4o,blablador:gpt-4o \
  --prompts-file prompts.json \
  --prompts-category reasoning \
  --temperatures 0.1,0.7,1.0 \
  --replications 3 \
  --evaluate-with openai:gpt-4o \
  --results-dir results/

# Throughput testing
hellm bench-throughput \
  --model openai:gpt-4o \
  --requests 100 \
  --concurrency 10 \
  --prompt "Write a short story about AI"

Evaluation and Analysis

# Analyze benchmark results
hellm analyze results/benchmark_latest.json --html-report analysis_report.html

# Generate reports
hellm report --results-file results/benchmark_latest.json --output report.md

Model Management

# List available Blablador models
hellm models

# Monitor model availability and test accessibility
hellm monitor --test-accessibility

# Check model configuration consistency
hellm monitor --check-config

Weekly Automated Benchmarking

The repository includes a GitHub Actions workflow that automatically runs benchmarks weekly and updates reports:

  • Scheduled: Runs every Sunday at 00:00 UTC
  • Model Discovery: Automatically fetches latest Blablador models
  • Performance Charts: Generates visual charts comparing model performance
  • Multiple Formats: Creates HTML, Markdown, and PNG chart reports
  • Auto-commit: Updates reports in the repository for public viewing

To enable automated benchmarking:

  1. Set repository secrets for API keys:

    • BLABLADOR_API_KEY: Your Blablador API key
    • BLABLADOR_API_BASE: Blablador API base URL (optional)
  2. The workflow will automatically:

    • Run benchmarks on selected models
    • Generate performance reports
    • Create visual charts
    • Commit updated reports to the repository

Reports are available in the reports/ directory and include:

  • weekly_benchmark_report.html: Interactive HTML report
  • weekly_benchmark_report.md: Markdown summary
  • weekly_benchmark_chart.png: Performance visualization

Advanced Features

# Run LM Evaluation Harness
hellm lm-eval \
  --model openai:gpt-4o \
  --tasks hellaswag,winogrande \
  --limit 100

# Start LiteLLM proxy server
hellm proxy \
  --config litellm_config.yaml \
  --port 8000

Project Structure

hellmholtz/
├── cli.py                 # Command-line interface
├── client.py              # Unified LLM client
├── monitoring.py          # Model availability monitoring
├── evaluation_analysis.py # Statistical analysis and reporting
├── export.py              # Result export utilities
├── core/
│   ├── config.py          # Configuration management
│   └── prompts.py         # Prompt loading and validation
├── benchmark/
│   ├── runner.py          # Benchmark execution
│   ├── evaluator.py       # LLM-as-a-Judge evaluation
│   └── prompts.py         # Benchmark-specific prompts
├── providers/
│   ├── blablador_provider.py # Custom Blablador provider
│   ├── blablador_config.py   # Blablador model configuration
│   ├── blablador.py          # Blablador utilities
│   └── __init__.py
├── reporting/
│   ├── html.py            # HTML report generation
│   ├── markdown.py        # Markdown report generation
│   ├── stats.py           # Statistical calculations
│   ├── utils.py           # Reporting utilities
│   └── templates/         # HTML templates
└── integrations/
    ├── lm_eval.py         # LM Evaluation Harness integration
    └── litellm.py         # LiteLLM proxy integration

Prompt System

HeLLMholtz supports two prompt formats:

Simple Text Format (prompts.txt)

What is the capital of France?
Explain quantum computing in simple terms.
Write a Python function to reverse a string.

Structured JSON Format (prompts.json)

[
  {
    "id": "capital-france",
    "category": "knowledge",
    "description": "Test basic geographical knowledge",
    "messages": [
      {
        "role": "user",
        "content": "What is the capital of France?"
      }
    ],
    "expected_output": "Paris"
  },
  {
    "id": "quantum-explanation",
    "category": "reasoning",
    "description": "Test ability to explain complex concepts simply",
    "messages": [
      {
        "role": "user",
        "content": "Explain quantum computing in simple terms."
      }
    ]
  }
]

Evaluation System

The LLM-as-a-Judge evaluation system provides:

  • Automated Scoring: AI-powered evaluation of response quality
  • Statistical Analysis: Comprehensive metrics and distributions
  • Model Rankings: Performance comparisons across all dimensions
  • Interactive Reports: Web-based visualizations of results
  • Detailed Critiques: Specific feedback for each response

Example Analysis Output

[Monitor] EVALUATION ANALYSIS RESULTS
══════════════════════════════════════════════════════════════

 OVERVIEW
• Total Evaluations: 150
• Models Tested: 3
• Prompts Tested: 5
• Success Rate: 94.7%

🏆 MODEL RANKINGS
1. openai:gpt-4o        - Avg Score: 8.7/10 (±0.8)
2. anthropic:claude-3-opus - Avg Score: 8.4/10 (±0.9)
3. blablador:gpt-4o     - Avg Score: 7.9/10 (±1.1)

 DETAILED METRICS
• Response Quality: 8.3/10 average
• Relevance: 8.6/10 average
• Accuracy: 9.1/10 average
• Creativity: 7.8/10 average

Latest Benchmark Results

Recent benchmarking results from the automated weekly workflow testing BLABLADOR models:

Model Performance Overview

Model Success Rate Avg Latency Avg Rating (1-10) Rating Std Dev
GPT-OSS-120b 100.0% 5.35s 8.5 ±2.38
Ministral-3-14B-Instruct-2512 100.0% 9.55s 7.5 ±3.70

Overall Statistics:

  • Total Evaluations: 8 across 4 different prompts
  • Models Tested: 2 BLABLADOR models
  • Overall Success Rate: 100.0%
  • Average Rating: 8.0/10
  • Average Latency: 7.45s

Key Findings

  • Top Performer: GPT-OSS-120b with highest rating (8.5/10) and fastest response time (5.35s)
  • Most Consistent: GPT-OSS-120b with lower rating variation (±2.38)
  • Performance Gap: 1.0 point difference between best and worst performing models
  • Model Availability: Both tested models are fully operational with 100% success rates

Evaluation Details

  • Prompt Categories: Reasoning, coding, and creative writing tasks
  • Temperature Testing: Multiple temperature settings (0.1, 0.7, 1.0) for response variation
  • LLM-as-a-Judge: Automated evaluation with detailed critiques and statistical analysis
  • Rating Distribution: GPT-OSS-120b received mostly 9-10 ratings, Ministral-3-14B showed more variation

Reports and Visualizations

Reports are automatically updated and include LLM-as-a-Judge evaluation with detailed statistical analysis and model rankings.

Development

Setup Development Environment

# Clone repository
git clone https://github.com/JonasHeinickeBio/HeLLMholtz.git
cd HeLLMholtz

# Install with development dependencies
poetry install --with dev

# Install pre-commit hooks
poetry run pre-commit install

Running Tests

# Run all tests
poetry run pytest

# Run with coverage
poetry run pytest --cov=hellmholtz --cov-report=html

# Run specific test categories
poetry run pytest -m "slow"        # Slow integration tests
poetry run pytest -m "network"     # Tests requiring network access
poetry run pytest -m "model"       # Tests using actual models

Code Quality

# Lint code
poetry run ruff check .

# Format code
poetry run ruff format .

# Type checking
poetry run mypy src/

# Security scanning
poetry run bandit -r src/

Building Documentation

# Generate API documentation
poetry run sphinx-build docs/ docs/_build/

# Serve documentation locally
poetry run sphinx-serve docs/_build/

Contributing

We welcome contributions! Please see our Contributing Guide for details.

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/amazing-feature
  3. Make your changes and add tests
  4. Run the full test suite: poetry run pytest
  5. Ensure code quality: poetry run ruff check . && poetry run mypy src/
  6. Commit your changes: git commit -m 'Add amazing feature'
  7. Push to the branch: git push origin feature/amazing-feature
  8. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Support


Made with love for the scientific computing community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hellmholtz-0.3.0.tar.gz (62.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hellmholtz-0.3.0-py3-none-any.whl (72.6 kB view details)

Uploaded Python 3

File details

Details for the file hellmholtz-0.3.0.tar.gz.

File metadata

  • Download URL: hellmholtz-0.3.0.tar.gz
  • Upload date:
  • Size: 62.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for hellmholtz-0.3.0.tar.gz
Algorithm Hash digest
SHA256 6f39df98dc69ee5eb3260b906297eb7efcccb5aad7a83a2dce2288a4bfbe7b2e
MD5 ded16ac9c95d390c6d7a7b76c697deda
BLAKE2b-256 d6b7cb1fe4032f74dd6bfc75f1dd8c249f861d9be1479c570c44f04f93397be2

See more details on using hashes here.

Provenance

The following attestation bundles were made for hellmholtz-0.3.0.tar.gz:

Publisher: publish.yml on JonasHeinickeBio/HeLLMholtz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hellmholtz-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: hellmholtz-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 72.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for hellmholtz-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 be58a3e906234b0aa4b869413d1768b2cf5e80731925d25daa9f25b25b24a5dd
MD5 3b6a8a6c29f39ff691270077984a887d
BLAKE2b-256 fea4c1ebb667eb349579b6fe43f88060600b37ab715dd90f8b9f17e57990c0d4

See more details on using hashes here.

Provenance

The following attestation bundles were made for hellmholtz-0.3.0-py3-none-any.whl:

Publisher: publish.yml on JonasHeinickeBio/HeLLMholtz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page