GPU energy monitoring agent — per-job cost attribution and energy-efficient fine-tuning for AI teams

These details have not been verified by PyPI

Project links

Project description

AluminatAI Agent

The open-source GPU energy monitoring agent for AluminatAI.

Runs on any GPU machine — NVIDIA, AMD, Intel Gaudi, Intel Arc, Apple Silicon, or CPU-only — samples power every 5 seconds, attributes energy to individual jobs, and streams dollar costs to your dashboard.

Supported Hardware

Backend	GPUs	Primary SDK	CLI Fallback
NVIDIA	A100, H100, H200, L40S, RTX 4090, T4, V100, …	`nvidia-ml-py` (NVML)	—
AMD	MI300X, MI300A, MI325X, MI250X, MI210, MI100, …	`amdsmi`	`rocm-smi`
Intel Gaudi	Gaudi, Gaudi2, Gaudi3	`pyhlml` (SynapseAI)	`hl-smi`
Intel Arc	A770, A750, A580, B580, Flex 170/140, Max 1550/1100	`xpu-smi` (oneAPI)	hwmon sysfs + `intel_gpu_top`
Apple Silicon	M1–M5, Pro/Max/Ultra	`powermetrics` (sudo)	`ioreg`
CPU-only	Any x86 (Intel/AMD)	RAPL sysfs	—

The agent auto-detects your hardware at startup. No configuration needed — just install and run.

Detection cascade: NVIDIA → AMD → Gaudi → Intel Arc → Apple Silicon → RAPL (CPU-only)

Product Tiers

The agent supports three operating modes, each building on the previous:

Tier	Mode	What it does	Key features
Monitor	Default	Read-only metrics + attribution	Power tracking, cost dashboards, Prometheus, carbon tracking
Advisor	Opt-in	Recommendations + approval workflows	"GPU 3 is 40% idle — cap to 200W?" with one-click apply/rollback
Swarm	Opt-in	Autonomous multi-agent optimization	Fleet-wide power capping, thermal balancing, carbon-aware scheduling, leader election

All tiers share the same agent binary. Enable higher tiers via environment variables:

# Monitor (default — no extra config)
aluminatiai

# Advisor — agent uploads recommendations, polls for approved commands
AUTO_TUNE_ENABLED=1 COMMAND_POLL_ENABLED=1 aluminatiai

# Swarm — one agent becomes fleet leader, optimizes across all nodes
SWARM_ENABLED=1 COMMAND_POLL_ENABLED=1 AUTO_TUNE_ENABLED=1 aluminatiai

Install

pip install aluminatiai

Optional extras:

pip install 'aluminatiai[prometheus]'     # Prometheus /metrics endpoint
pip install 'aluminatiai[secure]'         # Encrypted WAL (AES-128 Fernet)
pip install 'aluminatiai[observability]'  # YAML config + OTEL exporter
pip install 'aluminatiai[benchmark]'      # Benchmark CLI dependencies
pip install 'aluminatiai[dcgm]'           # DCGM phase decomposition
pip install 'aluminatiai[all]'            # Everything

Quick Start

export ALUMINATAI_API_KEY=alum_your_key_here
aluminatiai

Get your API key at aluminatiai.com/dashboard.

The agent will detect your GPU, start sampling, and upload metrics to your dashboard. That's it.

CLI Commands

The aluminatiai command includes 8 subcommands:

`aluminatiai` / `aluminatiai run`

Main daemon. Collects GPU metrics, attributes energy to jobs, uploads to the cloud.

aluminatiai                            # run forever (default)
aluminatiai --interval 2               # sample every 2 seconds
aluminatiai --duration 3600            # run for 1 hour then exit
aluminatiai --output /data/metrics.csv # also write a local CSV manifest
aluminatiai --dry-run                  # collect + attribute, skip uploads
aluminatiai --prometheus-only          # local Prometheus only, no cloud

`aluminatiai benchmark`

Measure GPU power baseline and energy efficiency.

aluminatiai benchmark                              # 60s power baseline
aluminatiai benchmark --gpu 0 --duration 120       # specific GPU, 2 min
aluminatiai benchmark --upload                     # submit to Green AI Index
aluminatiai benchmark --model-tag llama-3-70b      # tag with model profile

Output includes average power (W), J/GPU-hr, kWh/GPU-hr, and roofline efficiency rating.

`aluminatiai optimize`

Real-time efficiency analysis with actionable recommendations.

aluminatiai optimize                    # analyze all GPUs, 60s window
aluminatiai optimize --gpu 0 --json     # JSON output for automation
aluminatiai optimize --duration 300     # 5 minute analysis window

Detects compute precision, classifies memory-bound vs. compute-bound workloads, and ranks recommendations (P1/P2/P3) for power caps, precision switches, and GPU right-sizing.

`aluminatiai ab`

A/B testing framework for comparing GPU energy efficiency between configurations.

aluminatiai ab --baseline "power_limit=300" --variant "power_limit=250" --duration 120

Produces statistical comparison with confidence intervals, energy savings, and throughput impact (AEM — Adjusted Energy Metric).

`aluminatiai carbon-schedule`

Recommends the optimal time to start a job based on grid carbon intensity forecasts.

aluminatiai carbon-schedule --duration 4h --zone US-CAL-CISO

Uses the Electricity Maps API to find the lowest-carbon window in the next 24 hours.

`aluminatiai report`

Generate chargeback reports for cost attribution.

aluminatiai report --format csv --output chargeback.csv
aluminatiai report --format html --from 2026-05-01 --to 2026-05-07
aluminatiai report --format json --with-carbon

`aluminatiai query`

Query the local SQLite time-series database.

aluminatiai query --metric power --gpu 0 --from 2026-05-08 --to 2026-05-09

`aluminatiai replay`

Export and optionally clear the offline WAL.

aluminatiai replay --output metrics.csv
aluminatiai replay --output metrics.csv --clear

Configuration

Settings are read in priority order (highest wins):

Environment variables (ALUMINATAI_*, SAMPLE_INTERVAL, etc.)
Config file — JSON or YAML (via --config flag or ALUMINATAI_CONFIG env var)
Built-in defaults

Config file

aluminatiai --config /etc/aluminatai.json
# or
ALUMINATAI_CONFIG=/etc/aluminatai.yaml aluminatiai

Default search order when ALUMINATAI_CONFIG is unset:

./aluminatai.json
./aluminatai.yaml
~/.config/aluminatai/config.json

Example aluminatai.json:

{
  "api_key": "alum_your_key_here",
  "sample_interval": 2.0,
  "upload_interval": 30,
  "metrics_port": 9100,
  "log_format": "json"
}

YAML config requires pip install 'aluminatiai[observability]'.

Configuration Reference

API & Upload

Env var	Config key	Default	Description
`ALUMINATAI_API_KEY`	`api_key`	(required)	Your API key
`ALUMINATAI_API_ENDPOINT`	`api_endpoint`	`https://…/v1/metrics/ingest`	Ingest endpoint URL
`UPLOAD_INTERVAL`	`upload_interval`	`60`	Seconds between metric flushes
`UPLOAD_BATCH_SIZE`	`upload_batch_size`	`100`	Metrics per HTTP request
`UPLOAD_MAX_RETRIES`	`upload_max_retries`	`5`	Max retry attempts (exponential backoff)
`UPLOAD_MAX_RETRY_DELAY`	`upload_max_retry_delay`	`60`	Backoff cap in seconds
`UPLOAD_TIMEOUT`	—	`30`	HTTP request timeout in seconds

Sampling

Env var	Config key	Default	Description
`SAMPLE_INTERVAL`	`sample_interval`	`5.0`	Seconds between GPU samples
`NVML_TIMEOUT`	—	`2.0`	Per-GPU collection timeout

Write-Ahead Log (WAL)

Env var	Config key	Default	Description
`DATA_DIR`	`data_dir`	`./data`	Base data directory
`WAL_MAX_MB`	`wal_max_mb`	`512`	WAL size cap
`WAL_MAX_AGE_HOURS`	`wal_max_age_hours`	`24`	WAL retention period

Hardware Backends

Env var	Config key	Default	Description
`CPU_ONLY_MODE`	—	`false`	Skip GPU detection, use RAPL only
`RAPL_ENABLED`	—	`true`	Enable CPU RAPL energy counters
`RAPL_CPU_MODEL_OVERRIDE`	—	(auto)	Override CPU model name
`GAUDI_ENABLED`	—	`true`	Enable Intel Gaudi collector
`HL_SMI_PATH`	—	`hl-smi`	Custom path to hl-smi binary
`INTEL_ARC_ENABLED`	—	`true`	Enable Intel Arc collector
`XPU_SMI_PATH`	—	`xpu-smi`	Custom path to xpu-smi binary
`APPLE_POWERMETRICS_ENABLED`	—	`true`	Enable powermetrics (requires sudo NOPASSWD)
`APPLE_POWERMETRICS_INTERVAL_MS`	—	`1000`	powermetrics sampling interval (ms)
`APPLE_CHIP_TDP_OVERRIDE`	—	(auto)	Override Apple GPU TDP estimate (watts)
`DCGM_ENABLED`	`dcgm_enabled`	`true`	Enable DCGM phase decomposition (NVIDIA)

Prometheus Metrics Server

Env var	Config key	Default	Description
`METRICS_PORT`	`metrics_port`	`9100`	Scrape port (`0` = disabled)
`METRICS_BIND_HOST`	`metrics_bind_host`	(all)	Bind address
`METRICS_BASIC_AUTH`	`metrics_basic_auth`	(none)	`user:pass` for HTTP Basic Auth

Attribution

Env var	Config key	Default	Description
`ALUMINATAI_ATTRIBUTION_CONFIG`	`attribution_config`	(auto-search)	Path to `attribution_rules.json`
`ALUMINATAI_TRUSTED_UIDS`	`trusted_uids`	(all)	Comma-separated UIDs for spoofing protection
`PID_SMOOTH_WINDOW`	`pid_smooth_window`	`30`	PID stability window (seconds)
`PID_STABLE_THRESHOLD`	`pid_stable_threshold`	`0.60`	Fraction of window a PID must appear in

Auto-Tuning & Power Budget

Env var	Default	Description
`AUTO_TUNE_ENABLED`	`false`	Enable periodic roofline analysis
`AUTO_TUNE_INTERVAL`	`300`	Analysis interval (seconds)
`AUTO_TUNE_MIN_SAVINGS_PCT`	`10`	Min savings to recommend a power cap
`POWER_BUDGET_ENABLED`	`false`	Enable per-GPU power cap enforcement
`POWER_BUDGET_WATTS`	`0`	Per-GPU power cap (watts, 0 = disabled)

Advisor Tier (Recommendations + Commands)

Env var	Default	Description
`COMMAND_POLL_ENABLED`	`false`	Enable polling for approved commands
`COMMAND_POLL_INTERVAL`	`60`	Base poll interval (seconds); adapts up to 5 min when idle

When AUTO_TUNE_ENABLED=1 and COMMAND_POLL_ENABLED=1, the agent:

Runs roofline analysis every AUTO_TUNE_INTERVAL seconds
Uploads optimization recommendations to the cloud dashboard
Polls for user-approved commands (power caps, rollbacks)
Executes approved commands with safety validation (100–1200W range)

The dashboard shows recommendations at /dashboard/advisor with one-click approve, dismiss, and rollback.

Swarm Tier (Fleet-Wide Optimization)

Env var	Default	Description
`SWARM_ENABLED`	`false`	Enable swarm leader candidacy
`SWARM_EVAL_INTERVAL`	`300`	Seconds between fleet policy evaluations
`SWARM_MAX_RECS`	`20`	Max recommendations per eval cycle

When enabled, the agent participates in leader election — one agent per (user, cluster) becomes the swarm leader. The leader:

Acquires a 10-minute lease via POST /api/agent/swarm/lease
Fetches fleet-wide GPU state via GET /api/agent/fleet-state
Evaluates 4 built-in policies across all nodes
Uploads cross-node recommendations with blast radius limiting
Other agents receive and execute approved commands

Built-in policies:

Policy	What it detects	Action
`idle_gpu_power_cap`	GPUs with <10% utilization	Cap to 40% of TDP
`thermal_balancing`	Single GPU overheating while others are cool	Reduce power 15% on hot GPU
`carbon_aware_fleet_cap`	Grid carbon >400 gCO2e/kWh	Cap non-critical GPUs to 65%
`fleet_gpu_rightsizing`	GPUs consistently underutilized	Flag for consolidation

Safety guardrails:

Blast radius: max 25% of fleet affected per eval (configurable)
Canary ramp-up: new policies start at 10% of fleet, double each successful eval
Leader election: only one leader per cluster — prevents duplicate commands
Adaptive polling: command polling backs off 60s → 300s when idle, resets on command
Priority sorting: P1 thermal/safety recs get through before P2/P3

Fleet Aggregation

Env var	Default	Description
`FLEET_AGGREGATOR_ENABLED`	`false`	Enable fleet aggregation endpoint
`FLEET_AGGREGATOR_PORT`	`9101`	Aggregator HTTP port
`FLEET_AGGREGATOR_PEERS`	(none)	Comma-separated peer URLs

Multi-Agent High-Frequency Sampling

Env var	Default	Description
`MULTI_AGENT_ENABLED`	`false`	Enable high-frequency ring buffer sampling
`FAST_SAMPLE_INTERVAL`	`0.2`	Fast sample interval (seconds)
`FAST_SAMPLE_BUFFER_SIZE`	`100`	Ring buffer size per GPU

Idle Calibration

Env var	Default	Description
`IDLE_BASELINE_WINDOW`	`30`	Seconds to calibrate idle power at startup
`WARMUP_DISCARD_SECONDS`	`45`	Discard samples in this startup window

Cluster Identity

Env var	Config key	Default	Description
`ALUMINATAI_CLUSTER_TAG`	`cluster_tag`	(none)	Cluster identifier (e.g., `aws-us-west-2`)
`ALUMINATAI_LOCATION_HINT`	`location_hint`	(none)	Free-text location (shown in UI)
`ALUMINATAI_GRID_ZONE`	`grid_zone`	(none)	Electricity Maps zone (e.g., `US-CAL-CISO`)
`HEARTBEAT_INTERVAL`	`heartbeat_interval`	`300`	Heartbeat interval (seconds)

TLS & Proxy

Env var	Config key	Default	Description
`HTTPS_PROXY`	`https_proxy`	(none)	HTTPS proxy URL
`ALUMINATAI_CA_BUNDLE`	`ca_bundle`	(none)	Path to custom CA PEM
`ALUMINATAI_CLIENT_CERT`	`client_cert`	(none)	mTLS client cert path
`ALUMINATAI_CLIENT_KEY`	`client_key`	(none)	mTLS client key path

Run Modes

Env var	Config key	Default	Description
`OFFLINE_MODE`	`offline_mode`	`false`	WAL only, no HTTP uploads
`DRY_RUN`	`dry_run`	`false`	Collect + attribute, skip uploads and WAL
`PROMETHEUS_ONLY`	`prometheus_only`	`false`	Local Prometheus only

Logging

Env var	Config key	Default	Description
`LOG_LEVEL`	`log_level`	`INFO`	`DEBUG` / `INFO` / `WARNING` / `ERROR`
`LOG_FORMAT`	`log_format`	`text`	`text` or `json` (newline-delimited JSON for ELK/Loki)
`LOG_DIR`	—	`./logs`	Log file directory

Deployment

One-line installer (Linux + systemd)

curl -sSL https://get.aluminatiai.com | bash

Flag	Effect
`--local`	Install from local source (dev / air-gapped)
`--no-service`	Package only — skip systemd setup
`--unattended` / `-y`	Non-interactive; requires `ALUMINATAI_API_KEY` env var

# CI / non-interactive
ALUMINATAI_API_KEY=alum_xxx curl -sSL https://get.aluminatiai.com | bash -s -- --unattended

# Check service health
sudo systemctl status aluminatai-agent
sudo journalctl -u aluminatai-agent -f

Manual systemd setup

pip install aluminatiai

# Create system user and directories
sudo useradd --system --no-create-home --shell /usr/sbin/nologin aluminatai
sudo install -d -m 0700 -o aluminatai -g aluminatai /var/lib/aluminatai
sudo install -d -m 0755 -o aluminatai -g aluminatai /var/log/aluminatai
sudo install -d -m 0750 /etc/aluminatai

# Write the env file (mode 600 — contains your API key)
sudo tee /etc/aluminatai/agent.env > /dev/null <<'EOF'
ALUMINATAI_API_KEY=alum_your_key_here
SAMPLE_INTERVAL=5.0
UPLOAD_INTERVAL=60
METRICS_PORT=9100
LOG_LEVEL=INFO
EOF
sudo chmod 600 /etc/aluminatai/agent.env

# Install the unit file
sudo cp deploy/aluminatai-agent.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now aluminatai-agent

The service unit includes systemd security hardening: NoNewPrivileges, ProtectSystem=strict, PrivateTmp, MemoryMax=256M, and system call filtering.

Docker (NVIDIA)

docker run --rm --runtime=nvidia --pid=host \
  -e ALUMINATAI_API_KEY=alum_your_key_here \
  ghcr.io/agentmulder404/aluminatai-agent:latest

Kubernetes DaemonSet

kubectl apply -f deploy/k8s/daemonset.yaml

The DaemonSet includes RBAC for pod metadata queries (used by the K8s attribution adapter).

Slurm

Add to your job prolog/epilog scripts:

# /etc/slurm/prolog.d/aluminatiai.sh
source /etc/aluminatai/agent.env
aluminatiai &

# /etc/slurm/epilog.d/aluminatiai-stop.sh
pkill -f aluminatiai

Attribution

The agent attributes GPU power to jobs using a multi-step resolution pipeline. The first step that matches wins.

Resolution pipeline

Priority	Method	Confidence	How it works
1	`ALUMINATAI_TEAM` env var	1.00	Explicit user tag — most trustworthy
1.5	`/api/v1/tag` REST registration	0.95	Background polling every 30s
2	Scheduler env vars	0.90	`SLURM_JOB_ID`, `RUNAI_JOB_NAME`, K8s pod UID
3	Scheduler poll	0.75	`gpu_to_job()` fallback query
4	Custom rules file	0.60	JSON regex patterns (see below)
5	Cmdline heuristics	0.40	Built-in patterns (jupyter, vllm, torchserve, ollama, …)
6	Memory split	0.20	Unresolved power split by GPU memory usage
7	Idle attribution	0.30	`ALUMINATAI_IDLE_TEAM` env var fallback

Tagging workloads

# Simplest: set env vars before launching your job
ALUMINATAI_TEAM=nlp-team \
ALUMINATAI_MODEL=llama3-finetune \
python train.py

Custom attribution rules

Create an attribution_rules.json file to map command-line patterns to teams:

{
  "rules": [
    { "pattern": "python.*gpt4_train", "team": "llm-infra", "model": "gpt4",     "priority": 10 },
    { "pattern": "vllm.*llama",        "team": "inference",  "model": "llama",    "priority": 5  },
    { "pattern": "jupyter",            "team": "research",   "model": "notebook", "priority": 1  }
  ]
}

Search order for the rules file:

ALUMINATAI_ATTRIBUTION_CONFIG env var (explicit path)
./attribution_rules.json
~/.config/aluminatai/attribution_rules.json

Supported schedulers

Scheduler	Detection	Job metadata source
Slurm	`SLURM_JOB_ID` env var	`scontrol show job`
Kubernetes	Pod cgroup UID	K8s API (requires RBAC)
Run:ai	`RUNAI_JOB_NAME` env var	Run:ai API

Spoofing protection

On multi-user hosts, restrict which UIDs can self-tag:

export ALUMINATAI_TRUSTED_UIDS=0,1000   # only root and UID 1000 may use ALUMINATAI_TEAM

When unset, all UIDs are trusted (backward compatible).

Prometheus Metrics

The agent exposes a /metrics endpoint (default port 9100) with these gauges and counters:

GPU metrics

Metric	Type	Description
`aluminatai_gpu_power_watts`	Gauge	Current power draw per GPU
`aluminatai_gpu_energy_joules_total`	Counter	Cumulative energy per GPU
`aluminatai_gpu_utilization_pct`	Gauge	GPU compute utilization
`aluminatai_gpu_temperature_c`	Gauge	GPU temperature

Phase decomposition (DCGM)

Metric	Type	Description
`aluminatai_gpu_tensor_power_watts`	Gauge	Tensor core power
`aluminatai_gpu_fp16_power_watts`	Gauge	FP16 path power
`aluminatai_gpu_memory_power_watts`	Gauge	Memory subsystem power
`aluminatai_gpu_idle_power_watts`	Gauge	Baseline idle power

Upload health

Metric	Type	Description
`aluminatai_upload_success_total`	Counter	Successful uploads
`aluminatai_upload_failure_total`	Counter	Failed uploads
`aluminatai_buffer_size`	Gauge	In-memory buffer entries pending
`aluminatai_wal_size_bytes`	Gauge	WAL file size

Attribution

Metric	Type	Labels	Description
`aluminatai_attribution_confidence`	Gauge	gpu_index, job_id, method	Confidence score (0–1)
`aluminatai_attribution_uncertainty_pct`	Gauge	gpu_index, job_id	± % uncertainty
`aluminatai_attribution_unresolved_total`	Counter	—	Unattributed power

Agent health

Metric	Type	Description
`aluminatai_agent_uptime_seconds`	Gauge	Agent uptime
`aluminatai_agent_info`	Gauge	Version, hostname, mode metadata

Prometheus scrape config

scrape_configs:
  - job_name: aluminatiai
    static_configs:
      - targets: ['gpu-host:9100']

ML Framework Integrations

MLflow

from aluminatiai.integrations.mlflow_callback import AluminatiMLflowCallback

with mlflow.start_run():
    trainer.add_callback(AluminatiMLflowCallback())
    trainer.train()
    # energy_kwh, cost_usd, co2_kg logged automatically at run end

Weights & Biases

from aluminatiai.integrations.wandb_callback import AluminatiWandbCallback

wandb.init(project="my-project")
trainer.add_callback(AluminatiWandbCallback())
trainer.train()
# energy metrics logged to wandb.run.summary

OpenTelemetry

from aluminatiai.integrations.otel_exporter import AluminatiOtelExporter

exporter = AluminatiOtelExporter()
# GPU metrics exported as OTEL span attributes to Jaeger/Datadog/etc.

Hardware-Specific Notes

NVIDIA

Standard NVML-based collection. Works out of the box with any NVIDIA datacenter or consumer GPU with driver 450.80.02+.

pip install nvidia-ml-py   # or nvidia-ml-py3

AMD

Requires either amdsmi (Python bindings from ROCm 6+) or rocm-smi CLI in PATH.

pip install amdsmi   # preferred
# or ensure rocm-smi is in PATH

Intel Gaudi

Requires either pyhlml (ships with SynapseAI driver) or hl-smi CLI in PATH.

# pyhlml is installed with the Habana SynapseAI SDK
# or set HL_SMI_PATH if hl-smi is not in PATH
export HL_SMI_PATH=/opt/habanalabs/bin/hl-smi

Intel Arc

Requires xpu-smi (ships with Intel oneAPI Base Toolkit) or the xe/i915 kernel driver with hwmon sysfs.

# xpu-smi is installed with the oneAPI toolkit
# or set XPU_SMI_PATH if not in PATH
export XPU_SMI_PATH=/opt/intel/oneapi/xpu-smi/bin/xpu-smi

Apple Silicon

Uses powermetrics for accurate power reading (requires passwordless sudo) or falls back to ioreg (utilization only, estimates power from TDP).

For powermetrics access, add to /etc/sudoers:

your_username ALL=(ALL) NOPASSWD: /usr/bin/powermetrics

Without sudo access, the agent uses ioreg (less accurate but no privileges needed).

CPU-Only (RAPL)

For machines with no discrete GPU. Monitors CPU package power via Intel/AMD RAPL sysfs counters.

export CPU_ONLY_MODE=1   # skip GPU detection entirely
aluminatiai

Requires read access to /sys/class/powercap/intel-rapl:* or /sys/class/powercap/amd_rapl:*.

Security

Environment variable privacy

The agent reads /proc/<pid>/environ to attribute jobs. Only a small allowlist of env var keys is retained:

SLURM_JOB_ID, RUNAI_JOB_NAME, KUBERNETES_SERVICE_HOST,
ALUMINATAI_TEAM, ALUMINATAI_MODEL, ALUMINATAI_* (any prefix)

All other env vars (credentials, tokens, database URLs) are dropped immediately.

WAL encryption

The write-ahead log is encrypted automatically when ALUMINATAI_API_KEY is set and the cryptography package is installed:

pip install 'aluminatiai[secure]'

Encryption key = SHA-256(API_KEY), using AES-128 Fernet. Without the package, the agent falls back to plaintext WAL with a one-time warning.

Prometheus endpoint hardening

# Bind to localhost only
export METRICS_BIND_HOST=127.0.0.1

# Require HTTP Basic Auth
export METRICS_BASIC_AUTH=scrape_user:strong_password

Use a TLS-terminating reverse proxy (nginx, Caddy) in front of the metrics endpoint in production.

Offline / air-gapped clusters

# No outbound HTTP — all metrics go to WAL
OFFLINE_MODE=1 aluminatiai

# Later, on a machine with network access
aluminatiai replay --output metrics.csv --clear

Directory permissions

Data, WAL, and log directories are created with mode 0o700 (owner-only access).

Self-Hosting

Point the agent at your own ingest endpoint:

ALUMINATAI_API_ENDPOINT=https://your-api.internal/v1/metrics/ingest \
ALUMINATAI_API_KEY=your_key \
aluminatiai

Package Structure

agent/
├── agent.py              # Daemon entry point, signal handling, main loop
├── cli.py                # CLI router (subcommand dispatch)
├── config.py             # Config file + env var loader with validation
├── collector.py          # NVIDIA GPU collector (NVML)
├── amd_collector.py      # AMD GPU collector (amdsmi / rocm-smi)
├── gaudi_collector.py    # Intel Gaudi collector (pyhlml / hl-smi)
├── intel_arc_collector.py# Intel Arc collector (xpu-smi / hwmon)
├── apple_collector.py    # Apple Silicon collector (powermetrics / ioreg)
├── rapl_collector.py     # CPU-only collector (RAPL sysfs)
├── uploader.py           # HTTPS upload + WAL + exponential backoff
├── metrics_server.py     # Prometheus /metrics endpoint
├── fleet_aggregator.py   # Multi-node fleet metric rollups
├── benchmark.py          # GPU power baseline CLI
├── attribution/          # Job attribution engine
│   ├── engine.py         # 7-step resolution pipeline
│   ├── pid_resolver.py   # PID → team resolver
│   ├── pid_smoother.py   # Transient PID filtering (30s window)
│   ├── process_probe.py  # /proc reader (environ, cmdline, cgroup)
│   └── rules.py          # Custom JSON attribution rules
├── schedulers/           # Scheduler adapters
│   ├── slurm.py          # Slurm (scontrol)
│   ├── kubernetes.py     # Kubernetes (pod UID → K8s API)
│   └── runai.py          # Run:ai
├── integrations/         # ML framework callbacks
│   ├── mlflow_callback.py
│   ├── wandb_callback.py
│   └── otel_exporter.py
├── recommendation_reporter.py  # Uploads optimization recs to cloud (Advisor)
├── command_receiver.py         # Polls + executes approved commands (Advisor)
├── swarm/                      # Fleet-wide optimization (Swarm)
│   ├── policy_engine.py        # Leader election, blast radius, ramp-up
│   ├── fleet_state.py          # Fleet snapshot data model
│   └── policies.py             # 4 built-in fleet policies
├── efficiency/           # Energy analysis
│   ├── gpu_specs.py      # 45 GPU architecture specs + roofline model
│   ├── rapl.py           # Multi-socket RAPL reader
│   ├── auto_tuner.py     # Periodic power cap recommendations
│   ├── optimize.py       # Real-time efficiency analyzer
│   ├── carbon.py         # Electricity Maps carbon intensity
│   ├── carbon_scheduler.py # Carbon-aware job scheduling
│   ├── curve_builder.py  # Fleet efficiency curves
│   ├── hardware_match.py # Roofline hardware match scorer
│   └── power_control.py  # NVML power limit enforcement
├── storage/
│   └── tsdb.py           # Local SQLite time-series store
├── deploy/               # Production deployment files
│   ├── aluminatai-agent.service  # systemd unit (hardened)
│   ├── k8s/              # K8s DaemonSet + RBAC
│   └── install.sh        # One-line installer
└── tests/                # 16 test files, 300+ tests

Development

git clone https://github.com/AgentMulder404/AluminatAI.git
cd AluminatAI/agent
pip install -e ".[all]"
python -m pytest tests/ --ignore=tests/powercap_ab_test.py -v

License

Apache 2.0 — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.1

May 20, 2026

This version

0.3.0

May 20, 2026

0.2.1

Mar 8, 2026

0.2.0

Mar 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aluminatiai-0.3.0.tar.gz (260.0 kB view details)

Uploaded May 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

aluminatiai-0.3.0-py3-none-any.whl (275.5 kB view details)

Uploaded May 20, 2026 Python 3

File details

Details for the file aluminatiai-0.3.0.tar.gz.

File metadata

Download URL: aluminatiai-0.3.0.tar.gz
Upload date: May 20, 2026
Size: 260.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for aluminatiai-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`739ad2f89ebe49a698044a5ade20bc3e1617b1edabb91aedf9dc9830df094724`
MD5	`aa51885ac46c717159eae314bf0c531b`
BLAKE2b-256	`3d92bee13ea3f7dc9e8864c3a044459622d4f211ba07e8b749976edab0f47fa4`

See more details on using hashes here.

File details

Details for the file aluminatiai-0.3.0-py3-none-any.whl.

File metadata

Download URL: aluminatiai-0.3.0-py3-none-any.whl
Upload date: May 20, 2026
Size: 275.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for aluminatiai-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9d07a9ed2eecf0ea0e376c7499fe357a1f34c7f92b733d9ea08fdd2952fc266f`
MD5	`9ac55b81ad5956d16546b80ef4bce76b`
BLAKE2b-256	`15c8986defc6356f1671df233fbf3f7f496c4b945ee22eff12a50245b9abb0a3`

See more details on using hashes here.

aluminatiai 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

AluminatAI Agent

Supported Hardware

Product Tiers

Install

Quick Start

CLI Commands

aluminatiai / aluminatiai run

aluminatiai benchmark

aluminatiai optimize

aluminatiai ab

aluminatiai carbon-schedule

aluminatiai report

aluminatiai query

aluminatiai replay

Configuration

Config file

Configuration Reference

API & Upload

Sampling

Write-Ahead Log (WAL)

Hardware Backends

Prometheus Metrics Server

Attribution

Auto-Tuning & Power Budget

Advisor Tier (Recommendations + Commands)

Swarm Tier (Fleet-Wide Optimization)

Fleet Aggregation

Multi-Agent High-Frequency Sampling

Idle Calibration

Cluster Identity

TLS & Proxy

Run Modes

Logging

Deployment

One-line installer (Linux + systemd)

Manual systemd setup

Docker (NVIDIA)

Kubernetes DaemonSet

Slurm

Attribution

Resolution pipeline

Tagging workloads

Custom attribution rules

Supported schedulers

Spoofing protection

Prometheus Metrics

GPU metrics

Phase decomposition (DCGM)

Upload health

Attribution

Agent health

Prometheus scrape config

ML Framework Integrations

MLflow

Weights & Biases

OpenTelemetry

Hardware-Specific Notes

NVIDIA

AMD

Intel Gaudi

Intel Arc

Apple Silicon

CPU-Only (RAPL)

Security

Environment variable privacy

WAL encryption

Prometheus endpoint hardening

Offline / air-gapped clusters

Directory permissions

Self-Hosting

Package Structure

`aluminatiai` / `aluminatiai run`

`aluminatiai benchmark`

`aluminatiai optimize`

`aluminatiai ab`

`aluminatiai carbon-schedule`

`aluminatiai report`

`aluminatiai query`

`aluminatiai replay`