A comprehensive Python package for unified LLM access, benchmarking, evaluation, and reporting

These details have not been verified by PyPI

Project description

HeLLMholtz LLM Suite

A comprehensive Python package for unified LLM access, benchmarking, evaluation, and reporting. Built on top of aisuite with specialized support for Helmholtz Blablador models.

Features

Unified Client: Single interface for OpenAI, Google, Anthropic, Ollama, and Helmholtz Blablador models
Centralized Configuration: Environment-based configuration for all your projects
Advanced Benchmarking: Compare model performance across temperatures, replications, and prompt categories
LLM-as-a-Judge Evaluation: Automated evaluation with comprehensive statistical analysis
Interactive Reports: HTML reports with Chart.js visualizations and Markdown summaries
Flexible Prompt System: Support for both simple text files and structured JSON prompt collections
Model Monitoring: Track Blablador model availability and configuration consistency
LM Evaluation Harness: Integration with EleutherAI's comprehensive evaluation suite
LiteLLM Proxy: Built-in proxy server for model routing and load balancing
Throughput Testing: Performance benchmarking for high-throughput scenarios
Model Discovery: Dynamic model listing and availability checking (19+ BLABLADOR models currently available)

Installation

Basic Installation

pip install hellmholtz

Development Installation

For development with all optional dependencies:

git clone https://github.com/JonasHeinickeBio/HeLLMholtz.git
cd HeLLMholtz
pip install -e ".[eval,proxy]"

Poetry Installation

poetry install --with eval,proxy

Configuration

Copy the example environment file:

cp .env.example .env

Configure your API keys in .env:

# OpenAI
OPENAI_API_KEY=your_openai_key

# Anthropic
ANTHROPIC_API_KEY=your_anthropic_key

# Google
GOOGLE_API_KEY=your_google_key

# Helmholtz Blablador
BLABLADOR_API_KEY=your_blablador_key
BLABLADOR_API_BASE=https://your-blablador-instance.com

# Optional: Default models
AISUITE_DEFAULT_MODELS='{"openai": "gpt-4o", "anthropic": "claude-3-haiku"}'

Usage

Python API

Basic Chat Interface

from hellmholtz.client import chat

# Simple chat
response = chat("openai:gpt-4o", "Hello, how are you?")
print(response)

# With conversation history
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain quantum computing in simple terms."}
]
response = chat("anthropic:claude-3-sonnet", messages)

Benchmarking

from hellmholtz.benchmark import run_benchmarks
from hellmholtz.core.prompts import load_prompts

# Load prompts from JSON file
prompts = load_prompts("prompts.json", category="reasoning")

# Run benchmarks
results = run_benchmarks(
    models=["openai:gpt-4o", "anthropic:claude-3-haiku", "blablador:gpt-4o"],
    prompts=prompts,
    temperatures=[0.1, 0.7, 1.0],
    replications=3
)

# Analyze results
from hellmholtz.evaluation_analysis import EvaluationAnalyzer
analyzer = EvaluationAnalyzer()
analysis = analyzer.analyze_evaluation_results("results/benchmark_latest.json")
analyzer.print_analysis_summary(analysis)

Command Line Interface

HeLLMholtz provides a comprehensive CLI for all operations:

Chat Interface

# Simple chat
hellm chat --model openai:gpt-4o "Explain the theory of relativity"

# Interactive mode
hellm chat --model anthropic:claude-3-sonnet --interactive

# With system prompt
hellm chat --model blablador:gpt-4o --system "You are a coding assistant" "Write a Python function to calculate fibonacci numbers"

Benchmarking

# Basic benchmark
hellm bench --models openai:gpt-4o,anthropic:claude-3-haiku --prompts-file prompts.txt

# Advanced benchmark with evaluation
hellm bench \
  --models openai:gpt-4o,blablador:gpt-4o \
  --prompts-file prompts.json \
  --prompts-category reasoning \
  --temperatures 0.1,0.7,1.0 \
  --replications 3 \
  --evaluate-with openai:gpt-4o \
  --results-dir results/

# Throughput testing
hellm bench-throughput \
  --model openai:gpt-4o \
  --requests 100 \
  --concurrency 10 \
  --prompt "Write a short story about AI"

Evaluation and Analysis

# Analyze benchmark results
hellm analyze results/benchmark_latest.json --html-report analysis_report.html

# Generate reports
hellm report --results-file results/benchmark_latest.json --output report.md

Model Management

# List available Blablador models
hellm models

# Monitor model availability and test accessibility
hellm monitor --test-accessibility

# Check model configuration consistency
hellm monitor --check-config

Weekly Automated Benchmarking

The repository includes a GitHub Actions workflow that automatically runs benchmarks weekly and updates reports:

Scheduled: Runs every Sunday at 00:00 UTC
Model Discovery: Automatically fetches latest Blablador models
Performance Charts: Generates visual charts comparing model performance
Multiple Formats: Creates HTML, Markdown, and PNG chart reports
Auto-commit: Updates reports in the repository for public viewing

To enable automated benchmarking:

Set repository secrets for API keys:
- BLABLADOR_API_KEY: Your Blablador API key
- BLABLADOR_API_BASE: Blablador API base URL (optional)
The workflow will automatically:
- Run benchmarks on selected models
- Generate performance reports
- Create visual charts
- Commit updated reports to the repository

Reports are available in the reports/ directory and include:

weekly_benchmark_report.html: Interactive HTML report
weekly_benchmark_report.md: Markdown summary
weekly_benchmark_chart.png: Performance visualization

Advanced Features

# Run LM Evaluation Harness
hellm lm-eval \
  --model openai:gpt-4o \
  --tasks hellaswag,winogrande \
  --limit 100

# Start LiteLLM proxy server
hellm proxy \
  --config litellm_config.yaml \
  --port 8000

Project Structure

hellmholtz/
├── cli.py                 # Command-line interface
├── client.py              # Unified LLM client
├── monitoring.py          # Model availability monitoring
├── evaluation_analysis.py # Statistical analysis and reporting
├── export.py              # Result export utilities
├── core/
│   ├── config.py          # Configuration management
│   └── prompts.py         # Prompt loading and validation
├── benchmark/
│   ├── runner.py          # Benchmark execution
│   ├── evaluator.py       # LLM-as-a-Judge evaluation
│   └── prompts.py         # Benchmark-specific prompts
├── providers/
│   ├── blablador_provider.py # Custom Blablador provider
│   ├── blablador_config.py   # Blablador model configuration
│   ├── blablador.py          # Blablador utilities
│   └── __init__.py
├── reporting/
│   ├── html.py            # HTML report generation
│   ├── markdown.py        # Markdown report generation
│   ├── stats.py           # Statistical calculations
│   ├── utils.py           # Reporting utilities
│   └── templates/         # HTML templates
└── integrations/
    ├── lm_eval.py         # LM Evaluation Harness integration
    └── litellm.py         # LiteLLM proxy integration

Prompt System

HeLLMholtz supports two prompt formats:

Simple Text Format (`prompts.txt`)

What is the capital of France?
Explain quantum computing in simple terms.
Write a Python function to reverse a string.

Structured JSON Format (`prompts.json`)

[
  {
    "id": "capital-france",
    "category": "knowledge",
    "description": "Test basic geographical knowledge",
    "messages": [
      {
        "role": "user",
        "content": "What is the capital of France?"
      }
    ],
    "expected_output": "Paris"
  },
  {
    "id": "quantum-explanation",
    "category": "reasoning",
    "description": "Test ability to explain complex concepts simply",
    "messages": [
      {
        "role": "user",
        "content": "Explain quantum computing in simple terms."
      }
    ]
  }
]

Evaluation System

The LLM-as-a-Judge evaluation system provides:

Automated Scoring: AI-powered evaluation of response quality
Statistical Analysis: Comprehensive metrics and distributions
Model Rankings: Performance comparisons across all dimensions
Interactive Reports: Web-based visualizations of results
Detailed Critiques: Specific feedback for each response

Example Analysis Output

[Monitor] EVALUATION ANALYSIS RESULTS
══════════════════════════════════════════════════════════════

 OVERVIEW
• Total Evaluations: 150
• Models Tested: 3
• Prompts Tested: 5
• Success Rate: 94.7%

🏆 MODEL RANKINGS
1. openai:gpt-4o        - Avg Score: 8.7/10 (±0.8)
2. anthropic:claude-3-opus - Avg Score: 8.4/10 (±0.9)
3. blablador:gpt-4o     - Avg Score: 7.9/10 (±1.1)

 DETAILED METRICS
• Response Quality: 8.3/10 average
• Relevance: 8.6/10 average
• Accuracy: 9.1/10 average
• Creativity: 7.8/10 average

Latest Benchmark Results

Recent benchmarking results from the automated weekly workflow testing BLABLADOR models:

Model Performance Overview

Model	Success Rate	Avg Latency	Avg Rating (1-10)	Rating Std Dev
GPT-OSS-120b	100.0%	5.35s	8.5	±2.38
Ministral-3-14B-Instruct-2512	100.0%	9.55s	7.5	±3.70

Overall Statistics:

Total Evaluations: 8 across 4 different prompts
Models Tested: 2 BLABLADOR models
Overall Success Rate: 100.0%
Average Rating: 8.0/10
Average Latency: 7.45s

Key Findings

Top Performer: GPT-OSS-120b with highest rating (8.5/10) and fastest response time (5.35s)
Most Consistent: GPT-OSS-120b with lower rating variation (±2.38)
Performance Gap: 1.0 point difference between best and worst performing models
Model Availability: Both tested models are fully operational with 100% success rates

Evaluation Details

Prompt Categories: Reasoning, coding, and creative writing tasks
Temperature Testing: Multiple temperature settings (0.1, 0.7, 1.0) for response variation
LLM-as-a-Judge: Automated evaluation with detailed critiques and statistical analysis
Rating Distribution: GPT-OSS-120b received mostly 9-10 ratings, Ministral-3-14B showed more variation

Reports and Visualizations

Interactive HTML Report - Comprehensive evaluation analysis with charts
Markdown Summary - Detailed performance metrics
Performance Chart - Visual model comparison
Basic Report - Simple performance overview

Reports are automatically updated and include LLM-as-a-Judge evaluation with detailed statistical analysis and model rankings.

Development

Setup Development Environment

# Clone repository
git clone https://github.com/JonasHeinickeBio/HeLLMholtz.git
cd HeLLMholtz

# Install with development dependencies
poetry install --with dev

# Install pre-commit hooks
poetry run pre-commit install

Running Tests

# Run all tests
poetry run pytest

# Run with coverage
poetry run pytest --cov=hellmholtz --cov-report=html

# Run specific test categories
poetry run pytest -m "slow"        # Slow integration tests
poetry run pytest -m "network"     # Tests requiring network access
poetry run pytest -m "model"       # Tests using actual models

Code Quality

# Lint code
poetry run ruff check .

# Format code
poetry run ruff format .

# Type checking
poetry run mypy src/

# Security scanning
poetry run bandit -r src/

Building Documentation

# Generate API documentation
poetry run sphinx-build docs/ docs/_build/

# Serve documentation locally
poetry run sphinx-serve docs/_build/

Contributing

We welcome contributions! Please see our Contributing Guide for details.

Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Make your changes and add tests
Run the full test suite: poetry run pytest
Ensure code quality: poetry run ruff check . && poetry run mypy src/
Commit your changes: git commit -m 'Add amazing feature'
Push to the branch: git push origin feature/amazing-feature
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Built on top of aisuite for unified LLM access
LLM evaluation powered by EleutherAI's LM Evaluation Harness
Proxy functionality via LiteLLM
Special thanks to the Helmholtz Association for Blablador model access

Support

Documentation: https://hellmholtz.readthedocs.io/
Issue Tracker: https://github.com/JonasHeinickeBio/HeLLMholtz/issues
Discussions: https://github.com/JonasHeinickeBio/HeLLMholtz/discussions

Made with love for the scientific computing community

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.3.0

Mar 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hellmholtz-0.3.0.tar.gz (62.0 kB view details)

Uploaded Mar 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hellmholtz-0.3.0-py3-none-any.whl (72.6 kB view details)

Uploaded Mar 18, 2026 Python 3

File details

Details for the file hellmholtz-0.3.0.tar.gz.

File metadata

Download URL: hellmholtz-0.3.0.tar.gz
Upload date: Mar 18, 2026
Size: 62.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for hellmholtz-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`6f39df98dc69ee5eb3260b906297eb7efcccb5aad7a83a2dce2288a4bfbe7b2e`
MD5	`ded16ac9c95d390c6d7a7b76c697deda`
BLAKE2b-256	`d6b7cb1fe4032f74dd6bfc75f1dd8c249f861d9be1479c570c44f04f93397be2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for hellmholtz-0.3.0.tar.gz:

Publisher: publish.yml on JonasHeinickeBio/HeLLMholtz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: hellmholtz-0.3.0.tar.gz
- Subject digest: 6f39df98dc69ee5eb3260b906297eb7efcccb5aad7a83a2dce2288a4bfbe7b2e
- Sigstore transparency entry: 1123262484
- Sigstore integration time: Mar 18, 2026
Source repository:
- Permalink: JonasHeinickeBio/HeLLMholtz@c3b95c82ff4dc9b9a3af1f5f5738a73d18ffa10e
- Branch / Tag: refs/heads/main
- Owner: https://github.com/JonasHeinickeBio
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@c3b95c82ff4dc9b9a3af1f5f5738a73d18ffa10e
- Trigger Event: workflow_dispatch

File details

Details for the file hellmholtz-0.3.0-py3-none-any.whl.

File metadata

Download URL: hellmholtz-0.3.0-py3-none-any.whl
Upload date: Mar 18, 2026
Size: 72.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for hellmholtz-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`be58a3e906234b0aa4b869413d1768b2cf5e80731925d25daa9f25b25b24a5dd`
MD5	`3b6a8a6c29f39ff691270077984a887d`
BLAKE2b-256	`fea4c1ebb667eb349579b6fe43f88060600b37ab715dd90f8b9f17e57990c0d4`

See more details on using hashes here.

Provenance

The following attestation bundles were made for hellmholtz-0.3.0-py3-none-any.whl:

Publisher: publish.yml on JonasHeinickeBio/HeLLMholtz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: hellmholtz-0.3.0-py3-none-any.whl
- Subject digest: be58a3e906234b0aa4b869413d1768b2cf5e80731925d25daa9f25b25b24a5dd
- Sigstore transparency entry: 1123262493
- Sigstore integration time: Mar 18, 2026
Source repository:
- Permalink: JonasHeinickeBio/HeLLMholtz@c3b95c82ff4dc9b9a3af1f5f5738a73d18ffa10e
- Branch / Tag: refs/heads/main
- Owner: https://github.com/JonasHeinickeBio
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@c3b95c82ff4dc9b9a3af1f5f5738a73d18ffa10e
- Trigger Event: workflow_dispatch

hellmholtz 0.3.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

HeLLMholtz LLM Suite

Features

Installation

Basic Installation

Development Installation

Poetry Installation

Configuration

Usage

Python API

Basic Chat Interface

Benchmarking

Command Line Interface

Chat Interface

Benchmarking

Evaluation and Analysis

Model Management

Weekly Automated Benchmarking

Advanced Features

Project Structure

Prompt System

Simple Text Format (prompts.txt)

Structured JSON Format (prompts.json)

Evaluation System

Example Analysis Output

Latest Benchmark Results

Model Performance Overview

Key Findings

Evaluation Details

Reports and Visualizations

Development

Setup Development Environment

Running Tests

Code Quality

Building Documentation

Contributing

License

Acknowledgments

Support

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Simple Text Format (`prompts.txt`)

Structured JSON Format (`prompts.json`)