Modular benchmarking framework for AI energy measurements - POC Phase

These details have not been verified by PyPI

Project description

AI Energy Benchmarks

A modular benchmarking framework for measuring AI model energy consumption and carbon emissions across different inference backends.

This current release has tested support for pytorch backend and initial support for vllm backend. Some features outlined may still be in development so please contact the maintainers if you have a questions.

Overview

AI Energy Benchmarks provides a flexible, backend-agnostic framework for measuring the energy footprint of AI models during inference. The framework supports multiple backends and integrates with CodeCarbon for accurate emissions tracking.

Key Features:

Multiple Backends: PyTorch for model comparison, vLLM for production deployment testing
Energy Tracking: Integrated CodeCarbon metrics for energy consumption and CO₂ emissions
Flexible Configuration: YAML-based configuration following Hydra/OmegaConf patterns
Dataset Integration: Built-in support for HuggingFace datasets
Reasoning Format Support: Automatic detection and formatting for reasoning-capable models (gpt-oss, DeepSeek, SmolLM, Qwen, etc.)
Multi-GPU Support: Comprehensive multi-GPU support for large models
Modular Design: Easy to extend with new backends, metrics, or reporters
Docker Support: Containerized deployment for reproducible benchmarks

Understanding Backends

The framework is built around a backend-agnostic architecture with two primary backends, each serving different use cases:

PyTorch Backend: Model Comparison & Research

Purpose: Direct model inference for comparing different models head-to-head

Key Characteristics:

✅ Direct model loading from HuggingFace or local paths
✅ Full control over model configuration (quantization, device mapping, etc.)
✅ Multi-GPU support with automatic model sharding
✅ Measures raw model performance without serving overhead
✅ Ideal for controlled experiments

Best For:

Comparing energy efficiency of different models (e.g., GPT-2 vs Llama vs Mistral)
Testing model variants (quantized, pruned, distilled models)
Research and development workflows
Evaluating model optimizations
Multi-model head-to-head comparisons

Example Use Case:

# Compare energy efficiency of a small model (Phi-2) vs large model (Llama-2-70B)
./run_benchmark.sh configs/pytorch_test.yaml        # Uses microsoft/phi-2 (2.7B)
./run_benchmark.sh configs/pytorch_multigpu.yaml   # Uses meta-llama/Llama-2-70b-hf

vLLM Backend: Production Deployment Testing

Purpose: Connect to existing vLLM serving infrastructure to measure production workloads

Key Characteristics:

✅ Connects to running vLLM servers via HTTP
✅ Measures real production serving patterns
✅ Includes serving infrastructure overhead
✅ Tests production-like configurations
✅ No model loading required (uses existing server)

Best For:

Benchmarking production vLLM deployments
Measuring serving infrastructure efficiency
Testing production workload patterns
Optimizing deployment configurations
Production capacity planning

Example Use Case:

# Start vLLM server (production config)
vllm serve openai/gpt-oss-120b --port 8000

# Benchmark the deployment
./run_benchmark.sh configs/gpt_oss_120b.yaml

Choosing the Right Backend

Use Case	Backend	Why
Compare GPT-4 vs Llama 3 energy efficiency	PyTorch	Direct model comparison in controlled environment
Measure production vLLM deployment	vLLM	Real-world serving metrics with infrastructure overhead
Test quantized vs full-precision models	PyTorch	Need control over model loading and configuration
Benchmark serving infrastructure	vLLM	Production-like conditions and serving patterns
Multi-model evaluation (5+ models)	PyTorch	Easy model switching without server restarts
Production optimization and tuning	vLLM	Actual deployment metrics and configurations
Research paper experiments	PyTorch	Reproducible, controlled benchmarking
Capacity planning for production	vLLM	Real-world throughput and latency patterns

Important Note on Comparisons:

Results between PyTorch and vLLM backends are not directly comparable
PyTorch measures raw model performance
vLLM includes serving infrastructure overhead (batching, scheduling, HTTP, etc.)
Always use the same backend for fair model comparisons

Quick Start

Prerequisites

System Requirements:

Python 3.10 or higher
NVIDIA GPU with CUDA support (for GPU benchmarks)
Docker (optional, for containerized deployment)
8GB+ RAM
50GB+ disk space for models

Software Dependencies:

For vLLM backend: vLLM server
For PyTorch backend: PyTorch and transformers
CodeCarbon (for emissions tracking)

Installation

Option 1: Standard Installation

# Clone or navigate to the repository
cd ai_energy_benchmarks

# Create and activate virtual environment
python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Basic installation (vLLM backend only)
pip install -e .

# With PyTorch backend support
pip install -e ".[pytorch]"

# With all dependencies (development + testing)
pip install -e ".[all]"

# Verify installation
python -c "import ai_energy_benchmarks; print('Installation successful!')"

Option 2: Install from PyPI (Production/Docker)

For production deployments or Docker images, install directly from PyPI:

# Basic installation (vLLM backend only)
pip install ai_energy_benchmarks

# With PyTorch backend support
pip install ai_energy_benchmarks[pytorch]

# With all dependencies (development + testing)
pip install ai_energy_benchmarks[all]

Your First Benchmark

Choose your path based on your use case:

Option A: PyTorch Backend (Model Comparison)

No server setup required - direct model inference:

# Run benchmark with PyTorch backend
./run_benchmark.sh configs/pytorch_test.yaml

# View results
cat results/pytorch_test_results.csv

What happened:

Downloaded microsoft/phi-2 model from HuggingFace (2.7B parameters)
Ran inference on 3 test prompts from AIEnergyScore/text_generation dataset
Measured energy consumption and emissions (disabled in test config)
Saved results to CSV

Option B: vLLM Backend (Production Deployment)

Requires running vLLM server first:

# Terminal 1: Start vLLM server
vllm serve openai/gpt-oss-120b \
  --port 8000 \
  --gpu-memory-utilization 0.9

# Wait for "Application startup complete" message

# Terminal 2: Run benchmark
./run_benchmark.sh configs/gpt_oss_120b.yaml

# View results
cat results/gpt_oss_120b_results.csv

What happened:

Connected to running vLLM server
Sent prompts via HTTP API
Measured end-to-end serving performance
Tracked energy and emissions

Understanding Your Results

Results are saved in CSV format with metrics like:

timestamp,name,backend,model,total_prompts,successful_prompts,energy_wh,emissions_g_co2eq,avg_latency_s
2025-10-27T12:00:00,pytorch_backend_test,pytorch,microsoft/phi-2,3,3,0.15,0.04,1.23

Key metrics:

energy_wh: Energy consumed in watt-hours
emissions_g_co2eq: CO₂ emissions in grams
total_prompts: Number of prompts processed
avg_latency_s: Average response time

Usage Modes

The framework supports multiple ways to run benchmarks:

1. Shell Script Mode (Recommended)

Simplest way to run benchmarks:

# Run with default config (gpt_oss_120b.yaml - requires vLLM server)
./run_benchmark.sh

# Run with specific config
./run_benchmark.sh configs/pytorch_test.yaml

# Run with custom config path
./run_benchmark.sh /path/to/my/config.yaml

2. Python API Mode

Programmatic access for integration:

from ai_energy_benchmarks.runner import run_benchmark_from_config

# Basic usage
results = run_benchmark_from_config('configs/pytorch_test.yaml')
print(f"Energy consumed: {results['summary']['total_energy_wh']} Wh")

# With configuration overrides
overrides = {
    'scenario': {
        'num_samples': 20  # Override num_samples
    },
    'backend': {
        'model': 'gpt2-medium'  # Override model
    }
}
results = run_benchmark_from_config('configs/base.yaml', overrides=overrides)

3. Docker Compose Mode

For containerized deployments:

Standard Compose (with integrated Ollama server):

# Set environment variables
export AI_MODEL=llama3.2
export GPU_MODEL=h100

# Run benchmark
docker compose up

# View results
cat benchmark_output/results.csv

POC Compose (with external vLLM server):

# Start vLLM server on host first
vllm serve openai/gpt-oss-120b --port 8000

# Set environment
export VLLM_ENDPOINT=http://host.docker.internal:8000/v1
export CONFIG_FILE=configs/gpt_oss_120b.yaml

# Run benchmark
docker compose -f docker-compose.poc.yml up

# View results
cat results/gpt_oss_120b_results.csv

4. Docker Run Mode

Direct Docker container execution:

# Build image
docker build -t ai-energy-benchmark .

# Run benchmark
docker run --gpus all \
  -v $(pwd)/configs:/app/configs:ro \
  -v $(pwd)/results:/app/results \
  -v $(pwd)/emissions:/app/emissions \
  --network host \
  ai-energy-benchmark \
  ./run_benchmark.sh configs/pytorch_test.yaml

Configuration

Benchmarks are configured using YAML files. The framework follows a Hydra/OmegaConf-inspired configuration pattern.

Configuration Structure

Complete example showing all sections:

name: my_benchmark

backend:
  type: pytorch  # or vllm
  # ... backend-specific settings

scenario:
  dataset_name: AIEnergyScore/text_generation
  num_samples: 100
  # ... scenario settings

metrics:
  type: codecarbon
  enabled: true
  # ... metrics settings

reporter:
  type: csv
  output_file: "./results/results.csv"

output_dir: ./benchmark_output

Backend Configuration

PyTorch Backend - For Model Comparison

When to use: Comparing models, research, development, controlled experiments

Single GPU Configuration:

backend:
  type: pytorch
  model: gpt2  # HuggingFace model name or local path
  device: cuda
  device_ids: [0]  # Use GPU 0
  task: text-generation

Supported Models:

Goal is to support most top models on hugging face.
Small models: gpt2, gpt2-medium, facebook/opt-125m
Medium models: facebook/opt-1.3b, EleutherAI/gpt-neo-1.3B
Large models: meta-llama/Llama-2-7b-hf, mistralai/Mistral-7B-v0.1
Very large models (multi-GPU): meta-llama/Llama-2-70b-hf, tiiuae/falcon-180B

Multi-GPU Configuration (for large models):

backend:
  type: pytorch
  model: meta-llama/Llama-2-70b-hf
  device: cuda
  device_ids: [0, 1, 2, 3]  # Use 4 GPUs
  device_map: auto  # Automatically distribute model across GPUs
  torch_dtype: auto  # Auto-select optimal dtype (float16/bfloat16)

  # Optional: Limit memory per GPU to prevent OOM
  max_memory:
    0: "20GB"
    1: "20GB"
    2: "20GB"
    3: "20GB"

Device Map Strategies:

Strategy	Description	Best For
`auto`	Automatically balance layers across GPUs	Recommended - works for most models
`balanced`	Evenly distribute layers	Models with uniform layer sizes
`balanced_low_0`	Balance across GPUs, minimize GPU 0	When GPU 0 runs other processes
`sequential`	Fill GPUs sequentially (0 first, then 1, etc.)	Debugging or specific hardware configs

Advanced PyTorch Options:

backend:
  type: pytorch
  model: meta-llama/Llama-2-13b-hf
  device: cuda
  device_ids: [0, 1]

  # Model loading options
  torch_dtype: float16  # or bfloat16, float32
  load_in_8bit: false  # Enable 8-bit quantization
  load_in_4bit: false  # Enable 4-bit quantization
  trust_remote_code: false  # Allow custom model code

  # Memory management
  device_map: auto
  max_memory:
    0: "24GB"
    1: "24GB"

  # Performance tuning
  use_cache: true  # Enable KV cache
  pad_token_id: 0  # Set padding token

Use Cases:

✅ Compare energy efficiency of different model sizes
✅ Test quantized vs full-precision models
✅ Evaluate model variants (base vs instruction-tuned)
✅ Research experiments with controlled variables
✅ Multi-model benchmarking

vLLM Backend - For Production Deployments

When to use: Production benchmarks, serving infrastructure testing, deployment analysis

Configuration:

backend:
  type: vllm
  endpoint: "http://localhost:8000/v1"
  model: openai/gpt-oss-120b  # Must match vLLM server model

vLLM Server Setup:

# Basic vLLM server
vllm serve openai/gpt-oss-120b --port 8000

# Production-like configuration
vllm serve nvidia/Llama-3.3-70B-Instruct-FP8 \
  --port 8000 \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.9 \
  --max-num-seqs 256 \
  --dtype float16

# With specific GPU devices
vllm serve meta-llama/Llama-2-70b-hf \
  --port 8000 \
  --tensor-parallel-size 2 \
  --pipeline-parallel-size 2

Docker Network Configuration:

When benchmarking from Docker container to host vLLM server:

backend:
  type: vllm
  endpoint: "http://host.docker.internal:8000/v1"  # Docker → host
  model: openai/gpt-oss-120b

Use Cases:

✅ Benchmark production vLLM deployments
✅ Measure serving infrastructure efficiency
✅ Test production workload patterns
✅ Optimize vLLM configuration parameters
✅ Capacity planning for production

Important Notes:

vLLM server must be running before benchmark starts
Model name in config must match the server's loaded model
Endpoint must be accessible from benchmark environment
Results include serving overhead (batching, scheduling, HTTP)

GenAI-Perf Load Profiling (ai-energy-profile CLI)

The ai-energy-profile CLI provides a streamlined interface for running load profiles using NVIDIA's genai-perf tool against vLLM or OpenAI-compatible endpoints.

Installation:

# Basic installation
pip install -e .

# With profiling dependencies (pandas for result formatting)
pip install -e ".[profiling]"

# Or from PyPI
pip install ai_energy_benchmarks[profiling]

Basic Usage:

# Run a light load test (20 requests)
ai-energy-profile --profile light --model my-model

# Run against a custom endpoint
ai-energy-profile --profile moderate --model my-model --endpoint http://my-server:8000/v1

# Run with reproducible inputs using a seed
ai-energy-profile --profile heavy --model my-model --seed 42

Available Profiles:

Profile	Requests	Concurrency	Description
`light`	20	2	Light load - 10-20% GPU utilization
`moderate`	40	4	Moderate load - 40-50% GPU utilization
`heavy`	80	8	Heavy load - 70-80% GPU utilization
`stress`	240	24	Stress test - 90-100% GPU utilization
`multiphase`	78	varies	Multi-phase workload with variability
`pattern`	varies	varies	Multi-phase pattern test
`power_test`	varies	varies	Extended phases for power measurement

Authentication (--api-key):

For authenticated API endpoints, use the --api-key flag:

# Connect to an authenticated API endpoint
ai-energy-profile --profile light \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --endpoint https://api.neuralwatt.com/v1 \
  --api-key YOUR_API_KEY

The API key is passed as a Bearer token in the Authorization header.

Endpoint Types (--endpoint-type):

Different APIs support different endpoint types:

Endpoint Type	API Path	Use For
`chat` (default)	`/v1/chat/completions`	OpenAI-compatible chat APIs
`completions`	`/v1/completions`	Legacy completions APIs

# Use chat completions endpoint (default)
ai-energy-profile --profile light --model my-model --endpoint-type chat

# Use legacy completions endpoint
ai-energy-profile --profile light --model my-model --endpoint-type completions

All CLI Options:

ai-energy-profile --help

Options:
  --profile {light,moderate,heavy,stress,multiphase,pattern,power_test}
                        Load profile to use (default: moderate)
  --endpoint ENDPOINT   API endpoint URL (default: http://localhost:8000/v1)
  --model MODEL         Model name (required)
  --output-dir DIR      Output directory for results (default: ./benchmark_output)
  --seed SEED           Random seed for reproducible inputs and outputs
  --api-key API_KEY     API key for authenticated endpoints (Bearer token)
  --endpoint-type {chat,completions}
                        Endpoint type (default: chat)
  --run-id-suffix SUFFIX
                        Suffix to append to episode RunId for differentiation
  --prompts-file FILE   Path to custom prompts file

Example: Testing Against Remote API:

# Test against NeuralWatt API with authentication
ai-energy-profile --profile light \
  --model Qwen/Qwen3-Coder-480B-A35B-Instruct \
  --endpoint https://api.neuralwatt.com/v1 \
  --api-key sk-your-api-key-here \
  --seed 42

Example: Multi-Phase Workload:

# Run multi-phase profile (light → moderate → stress)
ai-energy-profile --profile multiphase \
  --model my-model \
  --endpoint http://localhost:8000/v1

Scenario Configuration

Controls the benchmark workload and generation parameters:

scenario:
  # Dataset configuration
  dataset_name: AIEnergyScore/text_generation  # HuggingFace dataset
  text_column_name: text  # Column containing prompts
  num_samples: 100  # Number of prompts to process
  truncation: true  # Truncate long prompts

  # Input configuration
  input_shapes:
    batch_size: 1  # Batch size for inference

  # Generation parameters
  generate_kwargs:
    max_new_tokens: 100  # Maximum tokens to generate
    min_new_tokens: 50   # Minimum tokens to generate
    temperature: 0.7     # Sampling temperature
    top_p: 0.9          # Nucleus sampling threshold
    top_k: 50           # Top-k sampling
    do_sample: true     # Enable sampling (vs greedy)

Common Datasets:

AIEnergyScore/text_generation - General text generation prompts
openai/gsm8k - Math reasoning tasks
tatsu-lab/alpaca - Instruction following
Your custom dataset on HuggingFace

Workload Profiles:

Light workload (testing):

scenario:
  num_samples: 10
  generate_kwargs:
    max_new_tokens: 50

Medium workload:

scenario:
  num_samples: 100
  generate_kwargs:
    max_new_tokens: 100

Heavy workload (production-like):

scenario:
  num_samples: 1000
  generate_kwargs:
    max_new_tokens: 200

Reasoning Parameters

The framework includes a unified reasoning format system that automatically detects and formats prompts for reasoning-capable models.

Supported Model Families

Model Family	Format Type	Configuration	Example Models
gpt-oss (OpenAI)	Harmony	`reasoning_effort: high/medium/low`	`openai/gpt-oss-20b`, `openai/gpt-oss-120b`
SmolLM3 (HuggingFace)	System Prompt	`enable_thinking: true`	`HuggingFaceTB/SmolLM3-3B`
DeepSeek-R1	Prefix + Parameter	`enable_thinking: true`, `thinking_budget: 1000`	`deepseek-ai/DeepSeek-R1`
Qwen (Alibaba)	Parameter	`enable_thinking: true`	`Qwen/Qwen2.5-72B-Instruct`
Hunyuan (Tencent)	System Prompt	`enable_thinking: true`	`tencent/Hunyuan-1.8B-Instruct`
Nemotron (NVIDIA)	System Prompt	`disable_thinking: true` to disable	`nvidia/Nemotron-*` (default enabled)
EXAONE (LG)	Parameter	`enable_thinking: true`	`LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct`
Phi (Microsoft)	Parameter	`enable_thinking: true`	`microsoft/phi-*`
Gemma (Google)	Parameter	`enable_thinking: true`	`google/gemma-*`

Reasoning Configuration Examples

gpt-oss Models (Harmony Format):

backend:
  type: vllm
  endpoint: "http://localhost:8000/v1"
  model: openai/gpt-oss-20b

scenario:
  reasoning_params:
    reasoning_effort: high  # Options: low, medium, high

SmolLM3 (System Prompt):

backend:
  type: pytorch
  model: HuggingFaceTB/SmolLM3-3B
  device: cuda

scenario:
  reasoning_params:
    enable_thinking: true

DeepSeek-R1 (Prefix + Parameter):

backend:
  type: pytorch
  model: deepseek-ai/DeepSeek-R1
  device: cuda

scenario:
  reasoning_params:
    enable_thinking: true
    thinking_budget: 1000  # Token budget for reasoning

Qwen (Parameter-based):

backend:
  type: pytorch
  model: Qwen/Qwen2.5-72B-Instruct
  device: cuda
  device_ids: [0, 1, 2, 3]

scenario:
  reasoning_params:
    enable_thinking: true

How Reasoning Formats Work

Automatic Detection: The FormatterRegistry detects model type from model name
Format Selection: Appropriate formatter selected from ai_energy_benchmarks/config/reasoning_formats.yaml
Prompt Formatting: Formatter modifies prompt and/or generation parameters
Backward Compatibility: Legacy use_harmony parameter still works (deprecated)

Works with both PyTorch and vLLM backends!

Adding New Reasoning Models

To add support for a new reasoning model, simply update reasoning_formats.yaml:

families:
  new-model-family:
    patterns:
      - "company/new-model"
      - "company/new-model-v2"
    type: system_prompt  # or harmony, parameter, prefix
    enable_flag: "/reason"
    disable_flag: "/no_reason"
    default_enabled: false
    description: "New reasoning model using /reason flags"

No code changes required! The system automatically picks up the configuration.

Metrics Configuration

Controls energy and performance metrics collection via CodeCarbon:

metrics:
  type: codecarbon
  enabled: true
  project_name: "my_benchmark"
  output_dir: "./emissions"
  country_iso_code: "USA"
  region: null  # or specific region like "california"

Supported Carbon Regions:

# United States
country_iso_code: "USA"
region: null  # US average
# or region: "california", "texas", "new_york", etc.

# Europe
country_iso_code: "FRA"  # France
country_iso_code: "DEU"  # Germany
country_iso_code: "GBR"  # United Kingdom

# Other regions
country_iso_code: "CAN"  # Canada
country_iso_code: "CHN"  # China
country_iso_code: "IND"  # India

See CodeCarbon documentation for full list.

Metrics Collected:

Energy consumption (kWh)
CO₂ emissions (kg CO₂eq)
GPU power draw (W)
CPU power draw (W)
RAM power draw (W)
Carbon intensity of electricity grid (g CO₂/kWh)

Reporter Configuration

Controls how results are output:

reporter:
  type: csv  # Currently only CSV supported
  output_file: "./results/benchmark_results.csv"

CSV Output Columns:

timestamp - ISO 8601 timestamp
name - Benchmark name
backend - Backend type (pytorch/vllm)
model - Model name
total_prompts - Total prompts processed
successful_prompts - Successfully processed prompts
failed_prompts - Failed prompts
energy_wh - Energy consumed (Wh)
emissions_g_co2eq - CO₂ emissions (g)
avg_latency_s - Average latency (seconds)
throughput_prompts_per_sec - Throughput
gpu_stats_* - Per-GPU metrics (PyTorch backend only)

Environment Variable Overrides

You can use environment variables in config files:

backend:
  endpoint: "${VLLM_ENDPOINT:-http://localhost:8000/v1}"
  model: "${MODEL_NAME:-openai/gpt-oss-120b}"

scenario:
  num_samples: "${NUM_SAMPLES:-100}"

Then set environment variables:

export VLLM_ENDPOINT=http://my-server:8000/v1
export MODEL_NAME=meta-llama/Llama-3-70b
export NUM_SAMPLES=500

./run_benchmark.sh configs/example.yaml

Or inline:

VLLM_ENDPOINT=http://localhost:8001/v1 ./run_benchmark.sh configs/example.yaml

Common Workflows

Model Comparison Workflow (PyTorch Backend)

Compare energy efficiency of different models:

# Step 1: Create configs for each model
# configs/compare_phi2.yaml
name: phi2_comparison
backend:
  type: pytorch
  model: microsoft/phi-2
  device: cuda
  device_ids: [0]
scenario:
  num_samples: 100

# configs/compare_llama7b.yaml
name: llama7b_comparison
backend:
  type: pytorch
  model: meta-llama/Llama-2-7b-hf
  device: cuda
  device_ids: [0]
scenario:
  num_samples: 100

# Step 2: Run benchmarks
./run_benchmark.sh configs/compare_phi2.yaml
./run_benchmark.sh configs/compare_llama7b.yaml

# Step 3: Compare results
python -c "
import pandas as pd
phi2 = pd.read_csv('results/phi2_results.csv')
llama = pd.read_csv('results/llama7b_results.csv')
print('Phi-2 (2.7B) Energy:', phi2['energy_wh'].iloc[0], 'Wh')
print('Llama-7B Energy:', llama['energy_wh'].iloc[0], 'Wh')
"

Multi-model comparison script:

# Compare multiple models in one go
for model in "microsoft/phi-2" "HuggingFaceTB/SmolLM3-3B" "meta-llama/Llama-2-7b-hf"; do
  echo "Benchmarking $model..."
  BENCHMARK_MODEL=$model ./run_benchmark.sh configs/pytorch_test.yaml
done

Understanding Results

Output Files

Benchmarks generate several output files:

results/
  benchmark_results.csv       # Main results file
emissions/
  emissions.csv               # CodeCarbon emissions tracking
  emissions_TIMESTAMP.csv     # Per-run emissions
benchmark_output/
  benchmark.log               # Execution logs
  debug_info.json             # Debug information

Project Structure

ai_energy_benchmarks/
├── ai_energy_benchmarks/          # Main package
│   ├── backends/                  # Inference backend implementations
│   │   ├── base.py               # Abstract backend base class
│   │   ├── vllm.py               # vLLM backend
│   │   └── pytorch.py            # PyTorch backend
│   ├── formatters/               # Reasoning format handlers
│   │   ├── base.py               # Abstract formatter base
│   │   ├── harmony.py            # Harmony formatter (gpt-oss)
│   │   ├── system_prompt.py      # System prompt formatter
│   │   ├── parameter.py          # Parameter-based formatter
│   │   ├── prefix.py             # Prefix/suffix formatter
│   │   └── registry.py           # Formatter registry
│   ├── config/                   # Configuration files
│   │   ├── parser.py             # Config parsing
│   │   └── reasoning_formats.yaml # Model format registry
│   ├── datasets/                 # Dataset loaders
│   │   └── loader.py             # HuggingFace dataset integration
│   ├── metrics/                  # Metrics collectors
│   │   └── codecarbon.py         # CodeCarbon integration
│   ├── reporters/                # Result reporters
│   │   └── csv_reporter.py       # CSV output
│   ├── utils/                    # Utility functions
│   │   ├── gpu.py                # GPU utilities
│   │   └── logging.py            # Logging setup
│   └── runner.py                 # Main benchmark runner
├── configs/                      # Example configurations
│   ├── gpt_oss_120b.yaml        # vLLM backend example
│   ├── pytorch_test.yaml         # PyTorch single GPU
│   ├── pytorch_multigpu.yaml     # PyTorch multi-GPU
│   └── pytorch_validation.yaml   # Validation config
├── tests/                        # Test suite
│   ├── unit/                     # Unit tests
│   ├── integration/              # Integration tests
│   └── test_formatters.py        # Formatter tests
├── results/                      # Benchmark results output
├── emissions/                    # CodeCarbon emissions data
├── ai_helpers/                   # Development and testing scripts
├── run_benchmark.sh              # Main runner script
├── build_wheel.sh                # Wheel building script
├── docker-compose.yml            # Standard Docker Compose
├── docker-compose.poc.yml        # POC Docker Compose
├── Dockerfile                    # Standard Dockerfile
├── Dockerfile.poc                # POC Dockerfile
├── setup.py                      # Package setup
├── pyproject.toml                # Project metadata
├── requirements.txt              # Dependencies
└── README.md                     # This file

Key Modules:

backends/: Backend implementations (add new backends here)
formatters/: Reasoning format handlers (extensible via config)
config/: Configuration parsing and reasoning format registry
datasets/: Dataset loading and preprocessing
metrics/: Metrics collection (CodeCarbon, custom metrics)
reporters/: Results output (CSV, JSON, etc.)
runner.py: Main orchestration logic

Development

Setting Up Development Environment

# Clone repository
cd ai_energy_benchmarks

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install in development mode with all dependencies
pip install -e ".[all]"

# Install pre-commit hooks (optional but recommended)
pip install pre-commit
pre-commit install

Development Dependencies:

pytest: Testing framework
pytest-cov: Coverage reporting
ruff: Linting
mypy: Type checking
black: Code formatting
pre-commit: Git hooks

Running Tests

All Tests

# Run all tests
pytest

# Run with verbose output
pytest -v

# Run with coverage
pytest --cov=ai_energy_benchmarks --cov-report=html

# Open coverage report
open htmlcov/index.html  # macOS
xdg-open htmlcov/index.html  # Linux

Specific Test Categories

# Unit tests only
pytest tests/unit/

# Integration tests only
pytest tests/integration/

# Specific test file
pytest tests/unit/test_vllm_backend.py

# Specific test function
pytest tests/unit/test_vllm_backend.py::TestVLLMBackend::test_initialization

# Tests matching pattern
pytest -k "test_reasoning"

Test Markers

# Run only fast tests (skip slow integration tests)
pytest -m "not integration"

# Run only integration tests
pytest -m integration

# Run with specific markers
pytest -m "pytorch"
pytest -m "vllm"

Debugging Tests

# Show print statements
pytest -s

# Show full traceback
pytest --tb=long

# Drop into debugger on failure
pytest --pdb

# Stop on first failure
pytest -x

Code Quality

The project uses multiple tools to ensure code quality:

Linting with Ruff

# Check all code
ruff check ai_energy_benchmarks/

# Check specific files
ruff check ai_energy_benchmarks/backends/

# Auto-fix issues
ruff check --fix ai_energy_benchmarks/

# Show all violations
ruff check --show-fixes ai_energy_benchmarks/

Type Checking with MyPy

# Type check all code
mypy ai_energy_benchmarks/

# Type check specific module
mypy ai_energy_benchmarks/backends/

# Strict mode
mypy --strict ai_energy_benchmarks/

Code Formatting with Ruff

# Check formatting
ruff format --check ai_energy_benchmarks/

# Format code
ruff format ai_energy_benchmarks/

# Format specific files
ruff format ai_energy_benchmarks/backends/pytorch.py

Pre-commit Hooks

Run all checks before committing:

# Install hooks
pre-commit install

# Run manually on all files
pre-commit run --all-files

# Run on specific files
pre-commit run --files ai_energy_benchmarks/backends/pytorch.py

Pre-commit checks:

Ruff linting
Ruff formatting
MyPy type checking
Trailing whitespace removal
End-of-file fixer
YAML validation

Building for Distribution

Build Wheel

# Build wheel
./build_wheel.sh

# Output: dist/ai_energy_benchmarks-VERSION-py3-none-any.whl

# Install wheel
pip install dist/ai_energy_benchmarks-*.whl

# Install with optional dependencies
pip install 'dist/ai_energy_benchmarks-*.whl[pytorch]'
pip install 'dist/ai_energy_benchmarks-*.whl[all]'

Build Docker Images

# Standard image
docker build -t ai-energy-benchmark:latest .

# POC image
docker build -f Dockerfile.poc -t ai-energy-benchmark:poc .

# Multi-platform build
docker buildx build --platform linux/amd64,linux/arm64 -t ai-energy-benchmark:latest .

Development Workflow

Create feature branch
```
git checkout -b feature/my-feature
```

Make changes and test

# Make changes
vim ai_energy_benchmarks/backends/new_backend.py

# Run tests
pytest tests/

# Check code quality
ruff check ai_energy_benchmarks/
mypy ai_energy_benchmarks/

Format and lint

ruff format ai_energy_benchmarks/
ruff check --fix ai_energy_benchmarks/

Commit changes

git add .
git commit -m "Add new backend"
# Pre-commit hooks run automatically

Push and create PR
```
git push origin feature/my-feature
```

Docker Deployment

Building Images

Standard Dockerfile

# Build image
docker build -t ai-energy-benchmark:latest .

# Build with specific tag
docker build -t ai-energy-benchmark:v1.0.0 .

# Build with build args
docker build \
  --build-arg PYTHON_VERSION=3.11 \
  -t ai-energy-benchmark:py311 .

POC Dockerfile

# Build POC image (lighter weight)
docker build -f Dockerfile.poc -t ai-energy-benchmark:poc .

Docker Run Commands

Basic Docker Run

docker run --gpus all \
  -v $(pwd)/configs:/app/configs:ro \
  -v $(pwd)/results:/app/results \
  -v $(pwd)/emissions:/app/emissions \
  ai-energy-benchmark:latest \
  ./run_benchmark.sh configs/pytorch_test.yaml

Docker Run with Network Access

For vLLM backend connecting to host:

docker run --gpus all \
  --network host \
  -v $(pwd)/configs:/app/configs:ro \
  -v $(pwd)/results:/app/results \
  -e VLLM_ENDPOINT=http://localhost:8000/v1 \
  ai-energy-benchmark:latest \
  ./run_benchmark.sh configs/vllm_config.yaml

Docker Run with Environment Variables

docker run --gpus all \
  -e BENCHMARK_BACKEND=pytorch \
  -e BENCHMARK_MODEL=gpt2 \
  -e NUM_SAMPLES=50 \
  -v $(pwd)/results:/app/results \
  ai-energy-benchmark:latest

Interactive Docker Session

docker run --gpus all -it \
  -v $(pwd):/workspace \
  ai-energy-benchmark:latest \
  /bin/bash

# Inside container
cd /workspace
python -c "from ai_energy_benchmarks.runner import run_benchmark_from_config; run_benchmark_from_config('configs/test.yaml')"

Docker Volume Mounting

Read-only configs:

-v $(pwd)/configs:/app/configs:ro

Writable results:

-v $(pwd)/results:/app/results
-v $(pwd)/emissions:/app/emissions
-v $(pwd)/benchmark_output:/app/benchmark_output

Mount entire directory:

-v $(pwd):/workspace

Docker GPU Access

All GPUs:

--gpus all

Specific GPUs:

--gpus '"device=0,1"'  # GPUs 0 and 1
--gpus '"device=2"'     # GPU 2 only

GPU memory limits:

--gpus 'all,capabilities=compute,utility' \
--memory="32g" \
--memory-swap="32g"

Extending the Framework

Adding New Backends

To add a new backend (e.g., TensorRT-LLM, MLX):

Create backend class in ai_energy_benchmarks/backends/:

# ai_energy_benchmarks/backends/tensorrt.py
from typing import Dict, Any, List
from .base import Backend

class TensorRTBackend(Backend):
    """TensorRT-LLM backend for optimized inference."""

    def __init__(
        self,
        model: str,
        device: str = "cuda",
        device_ids: List[int] = None,
        **kwargs
    ):
        super().__init__()
        self.model = model
        self.device = device
        self.device_ids = device_ids or [0]
        # Initialize TensorRT engine

    def validate_environment(self) -> bool:
        """Validate TensorRT is available."""
        try:
            import tensorrt_llm
            return True
        except ImportError:
            return False

    def load_model(self):
        """Load TensorRT engine."""
        # Implementation here
        pass

    def run_inference(
        self,
        prompt: str,
        reasoning_params: Dict[str, Any] = None,
        **generate_kwargs
    ) -> Dict[str, Any]:
        """Run inference with TensorRT."""
        # Implementation here
        pass

    def cleanup(self):
        """Clean up resources."""
        pass

Register backend in ai_energy_benchmarks/runner.py:

from .backends.tensorrt import TensorRTBackend

BACKEND_REGISTRY = {
    'vllm': VLLMBackend,
    'pytorch': PyTorchBackend,
    'tensorrt': TensorRTBackend,  # Add here
}

Use new backend in config:

backend:
  type: tensorrt
  model: meta-llama/Llama-2-7b-hf
  device: cuda

Adding New Reasoning Formats

To add support for new reasoning models:

Update ai_energy_benchmarks/config/reasoning_formats.yaml:

families:
  new-model-family:
    patterns:
      - "company/new-model"
      - "company/new-model-v2"
    type: system_prompt  # or harmony, parameter, prefix
    enable_flag: "/think"
    disable_flag: "/no_think"
    default_enabled: false
    system_prompt_template: "You are a helpful assistant. Use {flag} to enable reasoning."
    description: "New reasoning model"

No code changes needed! The formatter registry automatically picks up the config.
Test the new format:

backend:
  type: pytorch
  model: company/new-model

scenario:
  reasoning_params:
    enable_thinking: true

Adding New Metrics Collectors

To add custom metrics (e.g., network traffic, disk I/O):

Create metrics class in ai_energy_benchmarks/metrics/:

# ai_energy_benchmarks/metrics/network.py
from typing import Dict, Any

class NetworkMetricsCollector:
    """Collect network traffic metrics."""

    def __init__(self, interface: str = "eth0"):
        self.interface = interface
        self.start_bytes = 0
        self.end_bytes = 0

    def start(self):
        """Start collecting metrics."""
        self.start_bytes = self._get_bytes_transferred()

    def stop(self) -> Dict[str, Any]:
        """Stop and return metrics."""
        self.end_bytes = self._get_bytes_transferred()
        return {
            'network_bytes_transferred': self.end_bytes - self.start_bytes,
            'interface': self.interface
        }

    def _get_bytes_transferred(self) -> int:
        """Get bytes transferred on interface."""
        # Implementation here
        pass

Integrate in runner (modify runner.py):

from .metrics.network import NetworkMetricsCollector

# In BenchmarkRunner.run():
network_metrics = NetworkMetricsCollector()
network_metrics.start()
# ... run benchmark ...
metrics.update(network_metrics.stop())

Adding New Reporters

To add output formats (e.g., JSON, database):

Create reporter class in ai_energy_benchmarks/reporters/:

# ai_energy_benchmarks/reporters/json_reporter.py
import json
from typing import Dict, Any
from pathlib import Path

class JSONReporter:
    """Report results in JSON format."""

    def __init__(self, output_file: str):
        self.output_file = Path(output_file)
        self.output_file.parent.mkdir(parents=True, exist_ok=True)

    def report(self, results: Dict[str, Any]):
        """Write results to JSON file."""
        with open(self.output_file, 'w') as f:
            json.dump(results, f, indent=2)

Register reporter in config parser:

REPORTER_REGISTRY = {
    'csv': CSVReporter,
    'json': JSONReporter,  # Add here
}

Use in config:

reporter:
  type: json
  output_file: "./results/benchmark_results.json"

Troubleshooting

Backend-Specific Issues

PyTorch Backend Issues

Problem: GPU Out of Memory (OOM)

RuntimeError: CUDA out of memory. Tried to allocate XXX MiB

Solutions:

Reduce batch size:

scenario:
  input_shapes:
    batch_size: 1  # Minimum

Use multi-GPU:

backend:
  device_ids: [0, 1, 2, 3]
  device_map: auto

Set max memory per GPU:

backend:
  max_memory:
    0: "20GB"
    1: "20GB"

Use quantization:

backend:
  load_in_8bit: true  # or load_in_4bit: true

Reduce sequence length:

scenario:
  generate_kwargs:
    max_new_tokens: 50  # Reduce from 100+

Problem: Multi-GPU Not Working

ValueError: Model too large for single GPU

Solutions:

Check device_ids:
```
nvidia-smi  # Verify GPU availability
```

Verify device_map:

backend:
  device_ids: [0, 1]
  device_map: auto  # Must be set for multi-GPU

Install accelerate:
```
pip install accelerate
```

Problem: Model Loading Errors

OSError: model not found

Solutions:

Check model name:

# Valid examples
gpt2
facebook/opt-1.3b
meta-llama/Llama-2-7b-hf

Check HuggingFace access:
```
huggingface-cli login
```

Verify model exists:

python -c "from transformers import AutoModel; AutoModel.from_pretrained('gpt2')"

Problem: CUDA Errors

RuntimeError: CUDA error: device-side assert triggered

Solutions:

Check CUDA installation:

nvidia-smi
nvcc --version
python -c "import torch; print(torch.cuda.is_available())"

Update PyTorch:

pip install --upgrade torch torchvision torchaudio

Clear GPU memory:

# Kill processes using GPU
nvidia-smi
kill -9 <PID>

vLLM Backend Issues

Problem: vLLM Connection Errors

Backend validation failed: Could not connect to vLLM endpoint

Solutions:

Verify vLLM server is running:

curl http://localhost:8000/health
# Expected: {"status": "ok"}

Check endpoint in config:

backend:
  endpoint: "http://localhost:8000/v1"  # Must include /v1

Test with curl:
```
curl http://localhost:8000/v1/models
```
Check firewall:
```
sudo ufw allow 8000
```

Problem: Server Not Responding

Timeout waiting for vLLM server

Solutions:

Check server logs:

# In vLLM server terminal
# Look for errors or warnings

Increase timeout:
```
export VLLM_TIMEOUT=300  # 5 minutes
```

Restart vLLM:

pkill -f vllm
vllm serve MODEL --port 8000

Problem: Model Mismatch

Model name in config does not match server

Solutions:

Check server model:
```
curl http://localhost:8000/v1/models
```

Update config to match:

backend:
  model: openai/gpt-oss-120b  # Must match server

Problem: Docker to Host Connection

Cannot connect to host vLLM server from Docker

Solutions:

Use host.docker.internal:

backend:
  endpoint: "http://host.docker.internal:8000/v1"

Or use host network:
```
docker run --network host ...
```

Or get host IP:

# On Linux
ip addr show docker0 | grep inet

# Use host IP in config
endpoint: "http://172.17.0.1:8000/v1"

Common Issues

Problem: Dataset Download Fails

ConnectionError: Could not download dataset

Solutions:

Check internet connection

Set HuggingFace cache:

export HF_HOME=/path/to/cache
export HF_DATASETS_CACHE=/path/to/cache/datasets

Pre-download dataset:

from datasets import load_dataset
load_dataset("AIEnergyScore/text_generation")

Use local dataset:

scenario:
  dataset_name: /path/to/local/dataset

Problem: Import Errors

ModuleNotFoundError: No module named 'ai_energy_benchmarks'

Solutions:

Install package:
```
pip install -e .
```

Verify installation:

python -c "import ai_energy_benchmarks"

Check Python path:

python -c "import sys; print(sys.path)"

Problem: Permission Errors

PermissionError: [Errno 13] Permission denied: 'results/'

Solutions:

Create directories:

mkdir -p results emissions benchmark_output

Fix permissions:

chmod 755 results emissions benchmark_output

Use different output path:
```
output_dir: /tmp/benchmark_output
```

Problem: CodeCarbon Installation

ImportError: codecarbon not installed

Solutions:

Install codecarbon:
```
pip install codecarbon
```
Or disable metrics:
```
metrics:
  enabled: false
```

Debug Mode

Enable verbose logging:

# Set log level
export LOG_LEVEL=DEBUG

# Run benchmark
./run_benchmark.sh configs/test.yaml

Or in Python:

import logging
logging.basicConfig(level=logging.DEBUG)

from ai_energy_benchmarks.runner import run_benchmark_from_config
results = run_benchmark_from_config('config.yaml')

Inspect outputs:

# View benchmark logs
cat benchmark_output/benchmark.log

# View emissions data
cat emissions/emissions.csv

# View results
cat results/benchmark_results.csv

Docker-Specific Issues

Problem: GPU Not Accessible in Docker

RuntimeError: CUDA not available

Solutions:

Install nvidia-container-toolkit:

# Ubuntu/Debian
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Test GPU access:

docker run --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi

Check Docker version:
```
docker --version  # Should be 19.03+
```

Problem: Volume Permission Errors

Permission denied: '/app/results'

Solutions:

Fix permissions on host:

sudo chown -R $USER:$USER results/ emissions/

Run with user:

docker run --user $(id -u):$(id -g) ...

Problem: Network Configuration

Cannot resolve host.docker.internal

Solutions:

Use host network (Linux):
```
docker run --network host ...
```

Add host entry (Linux):

docker run --add-host host.docker.internal:host-gateway ...

Use bridge network with host IP:

docker run -e VLLM_ENDPOINT=http://172.17.0.1:8000/v1 ...

Best Practices

General Best Practices

Start Small for Testing

scenario:
  num_samples: 5  # Test with small dataset first

Set Accurate Carbon Region

metrics:
  country_iso_code: "USA"
  region: "california"  # More accurate emissions

Organize Output Directories

results/
  2025-10-27/
    model_a/
    model_b/
  2025-10-28/
    ...

Version Control Configs

git add configs/
git commit -m "Add benchmark config for Model X"

# But exclude results
echo "results/" >> .gitignore
echo "emissions/" >> .gitignore

Document Configurations

# configs/production.yaml
name: production_benchmark
# This config tests production workload with 1000 prompts
# Expected runtime: 30 minutes
# Expected energy: ~50 Wh
scenario:
  num_samples: 1000

Backend-Specific Best Practices

PyTorch Backend

Use Multi-GPU for Large Models

# Models > 13B parameters
backend:
  device_ids: [0, 1, 2, 3]
  device_map: auto

Set max_memory to Prevent OOM

backend:
  max_memory:
    0: "22GB"  # Leave 2GB buffer on 24GB GPU
    1: "22GB"

Choose Appropriate device_map

# Default: auto (recommended)
device_map: auto

# For specific use cases:
device_map: balanced        # Even distribution
device_map: balanced_low_0  # Minimize GPU 0

Monitor Per-GPU Metrics

# Check GPU balance after benchmark
cat results/results.csv | grep gpu_stats

# Look for:
# - Similar utilization across GPUs
# - Similar memory usage
# - No GPU at 100% while others idle

Use Quantization for Memory Constraints

# 8-bit quantization (good balance)
backend:
  load_in_8bit: true

# 4-bit quantization (max memory savings)
backend:
  load_in_4bit: true

vLLM Backend

Always Start Server Before Benchmark

# Terminal 1
vllm serve MODEL --port 8000
# Wait for "Application startup complete"

# Terminal 2
./run_benchmark.sh config.yaml

Match Model Name to Server

# Server
vllm serve openai/gpt-oss-120b

# Config
backend:
  model: openai/gpt-oss-120b  # MUST MATCH

Use Production-Like vLLM Config

vllm serve MODEL \
  --tensor-parallel-size 4 \
  --max-num-seqs 256 \
  --gpu-memory-utilization 0.9 \
  --dtype float16

Docker to Host Communication

# When benchmark runs in Docker, server on host
backend:
  endpoint: "http://host.docker.internal:8000/v1"

Test Server Health First

# Before running benchmark
curl http://localhost:8000/health
curl http://localhost:8000/v1/models

Multi-GPU Best Practices

Check GPU Topology

nvidia-smi topo -m
# Use GPUs with faster interconnect

Balance Memory Usage

backend:
  max_memory:
    0: "20GB"
    1: "20GB"
    2: "20GB"
    3: "20GB"

Monitor During Benchmark

watch -n 1 nvidia-smi
# Check for:
# - Balanced utilization
# - No thermal throttling
# - Expected power draw

Verify Model Fits

# Estimate model size
model_params = 70e9  # 70B parameters
bytes_per_param = 2  # float16
gb_needed = (model_params * bytes_per_param) / 1e9
print(f"Need ~{gb_needed}GB across GPUs")

Benchmarking Best Practices

Warm-up Runs

# First run may be slower (model loading, compilation)
# Run twice and use second result
scenario:
  num_samples: 100

Control for Variables

# Keep these constant for fair comparison:
scenario:
  num_samples: 100         # Same across runs
  generate_kwargs:
    max_new_tokens: 100    # Same across runs
    temperature: 0.7       # Same across runs

Use Same Backend for Comparisons

# ✅ Good: Compare PyTorch to PyTorch
./run_benchmark.sh configs/pytorch_test.yaml
./run_benchmark.sh configs/pytorch_multigpu.yaml

# ❌ Bad: Compare PyTorch to vLLM
./run_benchmark.sh configs/pytorch_test.yaml
./run_benchmark.sh configs/gpt_oss_120b.yaml

Document Environment

# Save environment details with results
# GPU model, driver version, CUDA version
# PyTorch/vLLM version
# System load, temperature

Reference

Configuration Schema

Complete YAML schema reference:

# Required fields
name: string  # Benchmark name

# Backend configuration (required)
backend:
  type: string  # "pytorch" or "vllm"

  # Common fields
  model: string  # Model name or path
  device: string  # "cuda" or "cpu" (optional, default: "cuda")
  device_ids: list[int]  # GPU IDs (optional, default: [0])

  # PyTorch-specific
  torch_dtype: string  # "auto", "float16", "bfloat16", "float32" (optional)
  device_map: string  # "auto", "balanced", etc. (optional)
  max_memory: dict  # Per-GPU memory limits (optional)
  load_in_8bit: bool  # Enable 8-bit quantization (optional)
  load_in_4bit: bool  # Enable 4-bit quantization (optional)
  trust_remote_code: bool  # Allow custom code (optional)

  # vLLM-specific
  endpoint: string  # vLLM server endpoint (required for vLLM)

# Scenario configuration (required)
scenario:
  dataset_name: string  # HuggingFace dataset or path
  text_column_name: string  # Column with prompts (optional, default: "text")
  num_samples: int  # Number of prompts to process
  truncation: bool  # Truncate long prompts (optional, default: true)

  # Input configuration
  input_shapes:
    batch_size: int  # Batch size (optional, default: 1)

  # Generation parameters
  generate_kwargs:
    max_new_tokens: int  # Max tokens to generate (optional, default: 100)
    min_new_tokens: int  # Min tokens to generate (optional)
    temperature: float  # Sampling temperature (optional, default: 1.0)
    top_p: float  # Nucleus sampling (optional)
    top_k: int  # Top-k sampling (optional)
    do_sample: bool  # Enable sampling (optional, default: false)

  # Reasoning parameters (optional)
  reasoning_params:
    reasoning_effort: string  # "low", "medium", "high" (for Harmony)
    enable_thinking: bool  # Enable reasoning (for other models)
    thinking_budget: int  # Token budget (for DeepSeek)

# Metrics configuration (optional)
metrics:
  type: string  # "codecarbon" (default)
  enabled: bool  # Enable metrics (optional, default: true)
  project_name: string  # Project name (optional)
  output_dir: string  # Output directory (optional, default: "./emissions")
  country_iso_code: string  # Country code (optional, default: "USA")
  region: string  # Specific region (optional)

# Reporter configuration (optional)
reporter:
  type: string  # "csv" (default)
  output_file: string  # Output file path (optional)

# Output directory (optional)
output_dir: string  # Base output directory (default: "./benchmark_output")

API Reference

BenchmarkRunner

Main benchmark orchestration class.

from ai_energy_benchmarks.runner import BenchmarkRunner
from ai_energy_benchmarks.config.parser import BenchmarkConfig

# Create config
config = BenchmarkConfig()
config.name = "my_benchmark"
config.backend.type = "pytorch"
config.backend.model = "gpt2"
config.scenario.num_samples = 10

# Create runner
runner = BenchmarkRunner(config)

# Run benchmark
results = runner.run()

# Results structure
results = {
    'summary': {
        'name': str,
        'backend': str,
        'model': str,
        'total_prompts': int,
        'successful_prompts': int,
        'failed_prompts': int,
        'total_energy_wh': float,
        'total_emissions_g_co2eq': float,
        'avg_latency_s': float,
        'throughput_prompts_per_sec': float
    },
    'per_prompt_results': [...],
    'gpu_stats': {...}  # PyTorch only
}

run_benchmark_from_config

Helper function to run from config file.

from ai_energy_benchmarks.runner import run_benchmark_from_config

# Basic usage
results = run_benchmark_from_config('configs/test.yaml')

# With overrides
overrides = {
    'scenario': {'num_samples': 20},
    'backend': {'model': 'gpt2-medium'}
}
results = run_benchmark_from_config('configs/test.yaml', overrides=overrides)

ConfigParser

Configuration parsing utilities.

from ai_energy_benchmarks.config.parser import ConfigParser

# Load config
config = ConfigParser.load_config('configs/test.yaml')

# Load with overrides
overrides = {'scenario': {'num_samples': 20}}
config = ConfigParser.load_config_with_overrides('configs/test.yaml', overrides)

# Validate config
is_valid = ConfigParser.validate_config(config)

Backend Classes

PyTorchBackend:

from ai_energy_benchmarks.backends.pytorch import PyTorchBackend

backend = PyTorchBackend(
    model="gpt2",
    device="cuda",
    device_ids=[0],
    torch_dtype="float16"
)
backend.validate_environment()
backend.load_model()
result = backend.run_inference("Hello world")
backend.cleanup()

VLLMBackend:

from ai_energy_benchmarks.backends.vllm import VLLMBackend

backend = VLLMBackend(
    endpoint="http://localhost:8000/v1",
    model="openai/gpt-oss-120b"
)
backend.validate_environment()
result = backend.run_inference("Hello world")

CLI Reference

run_benchmark.sh

./run_benchmark.sh [CONFIG_FILE]

# Default config
./run_benchmark.sh

# Specific config
./run_benchmark.sh configs/pytorch_test.yaml

# Custom path
./run_benchmark.sh /path/to/config.yaml

Environment Variables

# Backend configuration
BENCHMARK_BACKEND=pytorch|vllm
BENCHMARK_MODEL=model_name
VLLM_ENDPOINT=http://localhost:8000/v1

# Scenario configuration
NUM_SAMPLES=100
MAX_NEW_TOKENS=100

# Metrics configuration
COUNTRY_ISO_CODE=USA
REGION=california

# Output configuration
OUTPUT_DIR=/path/to/output
RESULTS_FILE=/path/to/results.csv

# Debugging
LOG_LEVEL=DEBUG|INFO|WARNING|ERROR

Contributing

We welcome contributions! Here's how to get involved:

How to Contribute

Fork the repository

git clone https://github.com/yourusername/ai_energy_benchmarks.git
cd ai_energy_benchmarks

Create feature branch
```
git checkout -b feature/my-feature
```

Set up development environment

python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[all]"
pre-commit install

Make changes
- Write code
- Add tests
- Update documentation

Run tests and checks

pytest
ruff check ai_energy_benchmarks/
mypy ai_energy_benchmarks/
ruff format ai_energy_benchmarks/

Commit changes

git add .
git commit -m "Add feature X"
# Pre-commit hooks run automatically

Push and create PR

git push origin feature/my-feature
# Create pull request on GitHub

Code Standards

Python: PEP 8 style guide
Type hints: Use type hints for all functions
Docstrings: Google-style docstrings
Tests: Write tests for new features
Formatting: Use ruff for formatting
Linting: Pass ruff checks
Type checking: Pass mypy checks

Pull Request Process

Update documentation if adding features
Add tests for new functionality
Ensure all checks pass (tests, linting, type checking)
Update CHANGELOG if applicable
Request review from maintainers

Areas for Contribution

New backends: TensorRT-LLM, MLX, GGML, etc.
New metrics: Network, disk I/O, memory bandwidth
New reporters: JSON, database, visualization
New reasoning formats: Support for new models
Performance improvements: Optimization, caching
Documentation: Examples, tutorials, guides
Testing: More test coverage, edge cases

License & Citation

License

MIT License - see LICENSE file for details.

Citation

If you use this framework in your research, please cite:

@software{ai_energy_benchmarks,
  title={AI Energy Benchmarks: A Framework for Measuring AI Model Energy Consumption},
  author={NeuralWatt},
  year={2025},
  url={https://github.com/neuralwatt/ai_energy_benchmarks},
  version={1.0.0}
}

Acknowledgments

This framework builds upon:

CodeCarbon: For emissions tracking (Zenodo DOI: 10.5281/zenodo.17298293)
HuggingFace: For model and dataset ecosystems
vLLM: For high-performance serving
PyTorch: For deep learning infrastructure

Support

Getting Help

Documentation: You're reading it!
GitHub Issues: Report bugs and request features
Email: info@neuralwatt.com
Community: Join our discussions on GitHub

Reporting Issues

When reporting issues, please include:

System information:

python --version
nvidia-smi
pip list | grep -E "torch|vllm|codecarbon"

Configuration file: Your YAML config
Error message: Full error output
Steps to reproduce: How to trigger the issue
Expected vs actual behavior

Feature Requests

We welcome feature requests! Please:

Check existing issues first
Describe use case clearly
Explain why it's beneficial
Provide examples if possible

Changelog

Version 1.0.0 (2025-10-27)

Major Features:

✅ PyTorch backend with multi-GPU support
✅ vLLM backend for production deployments
✅ Unified reasoning format system (9+ model families)
✅ CodeCarbon integration for emissions tracking
✅ CSV reporting with per-GPU metrics
✅ Docker and Docker Compose support
✅ Comprehensive documentation

Supported Backends:

PyTorch (direct inference)
vLLM (serving infrastructure)

Supported Reasoning Models:

gpt-oss, DeepSeek-R1, SmolLM3, Qwen, Hunyuan, Nemotron, EXAONE, Phi, Gemma

Known Limitations:

Only CSV reporter implemented
Only CodeCarbon metrics collector
No streaming support yet
No batch inference optimization yet

Thank you for using AI Energy Benchmarks! We hope this framework helps you build more energy-efficient AI systems. 🌱

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.0.7

Dec 26, 2025

This version

0.0.6

Dec 24, 2025

0.0.4

Oct 30, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai_energy_benchmarks-0.0.6.tar.gz (170.0 kB view details)

Uploaded Dec 24, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ai_energy_benchmarks-0.0.6-py3-none-any.whl (78.0 kB view details)

Uploaded Dec 24, 2025 Python 3

File details

Details for the file ai_energy_benchmarks-0.0.6.tar.gz.

File metadata

Download URL: ai_energy_benchmarks-0.0.6.tar.gz
Upload date: Dec 24, 2025
Size: 170.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ai_energy_benchmarks-0.0.6.tar.gz
Algorithm	Hash digest
SHA256	`d236c6d373c641df0256992d1549e7c42b3518aa212e7c5e131982205b38bef4`
MD5	`476b289ef9acb25e69352a1ae30369f7`
BLAKE2b-256	`1f42118bc1d1d5fd36308fa2d9f328f9493ce2ce50094ff2b7a37dfe466cee69`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ai_energy_benchmarks-0.0.6.tar.gz:

Publisher: pypi-publish.yml on neuralwatt/ai_energy_benchmarks

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ai_energy_benchmarks-0.0.6.tar.gz
- Subject digest: d236c6d373c641df0256992d1549e7c42b3518aa212e7c5e131982205b38bef4
- Sigstore transparency entry: 779142504
- Sigstore integration time: Dec 24, 2025
Source repository:
- Permalink: neuralwatt/ai_energy_benchmarks@645160d110e380fad8ae979d71d847176912d6d0
- Branch / Tag: refs/heads/main
- Owner: https://github.com/neuralwatt
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi-publish.yml@645160d110e380fad8ae979d71d847176912d6d0
- Trigger Event: push

File details

Details for the file ai_energy_benchmarks-0.0.6-py3-none-any.whl.

File metadata

Download URL: ai_energy_benchmarks-0.0.6-py3-none-any.whl
Upload date: Dec 24, 2025
Size: 78.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ai_energy_benchmarks-0.0.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`17249e919d666bda80563bf6a4990fde2dbf31e517671518bcc4517a0ab6d8ed`
MD5	`f3ce412996bcac70bacd4c9a7806a534`
BLAKE2b-256	`e558e496fdc2feef93173aa42162da3f768feb49c04d3c256f525b6333dd41f3`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ai_energy_benchmarks-0.0.6-py3-none-any.whl:

Publisher: pypi-publish.yml on neuralwatt/ai_energy_benchmarks

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ai_energy_benchmarks-0.0.6-py3-none-any.whl
- Subject digest: 17249e919d666bda80563bf6a4990fde2dbf31e517671518bcc4517a0ab6d8ed
- Sigstore transparency entry: 779142507
- Sigstore integration time: Dec 24, 2025
Source repository:
- Permalink: neuralwatt/ai_energy_benchmarks@645160d110e380fad8ae979d71d847176912d6d0
- Branch / Tag: refs/heads/main
- Owner: https://github.com/neuralwatt
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi-publish.yml@645160d110e380fad8ae979d71d847176912d6d0
- Trigger Event: push

ai-energy-benchmarks 0.0.6

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

AI Energy Benchmarks

Overview

Understanding Backends

PyTorch Backend: Model Comparison & Research

vLLM Backend: Production Deployment Testing

Choosing the Right Backend

Quick Start

Prerequisites

Installation

Option 1: Standard Installation

Option 2: Install from PyPI (Production/Docker)

Your First Benchmark

Option A: PyTorch Backend (Model Comparison)

Option B: vLLM Backend (Production Deployment)

Understanding Your Results

Usage Modes

1. Shell Script Mode (Recommended)

2. Python API Mode

3. Docker Compose Mode

4. Docker Run Mode

Configuration

Configuration Structure

Backend Configuration

PyTorch Backend - For Model Comparison

vLLM Backend - For Production Deployments

GenAI-Perf Load Profiling (ai-energy-profile CLI)

Scenario Configuration

Reasoning Parameters

Supported Model Families

Reasoning Configuration Examples

How Reasoning Formats Work

Adding New Reasoning Models

Metrics Configuration

Reporter Configuration

Environment Variable Overrides

Common Workflows

Model Comparison Workflow (PyTorch Backend)

Understanding Results

Output Files

Project Structure

Development

Setting Up Development Environment

Running Tests

All Tests

Specific Test Categories

Test Markers

Debugging Tests

Code Quality

Linting with Ruff

Type Checking with MyPy

Code Formatting with Ruff

Pre-commit Hooks

Building for Distribution

Build Wheel

Build Docker Images

Development Workflow

Docker Deployment

Building Images

Standard Dockerfile

POC Dockerfile

Docker Run Commands

Basic Docker Run

Docker Run with Network Access

Docker Run with Environment Variables

Interactive Docker Session

Docker Volume Mounting

Docker GPU Access

Extending the Framework

Adding New Backends

Adding New Reasoning Formats