Python library for LLM evaluation, observability, and cost monitoring with regression detection and budget-aware workflows.
Project description
Aegis Monitor - LLM Evaluation & Cost Governance
Aegis AI is an open-source framework for evaluating, comparing, and governing LLM systems. Engineers use Aegis AI to:
- Evaluate LLM outputs with pluggable metrics
- Monitor costs in real-time and enforce budgets
- Detect regressions before deploying to production
- Compare models objectively on quality vs. cost
- Integrate evaluations into CI/CD pipelines
Built for engineering teams that want reproducible, cost-conscious LLM workflows.
Quick Start
Installation
# Core installation
pip install aegis-monitor
# With OpenAI support
pip install "aegis-monitor[openai]"
# With Anthropic (Claude) support
pip install "aegis-monitor[anthropic]"
# With all providers
pip install "aegis-monitor[all]"
1-Minute Example
Create a dataset (examples/qa.yaml):
name: qa_sample
description: Basic Q&A evaluation
cases:
- input: "What is the capital of France?"
expected: "Paris"
- input: "Explain photosynthesis"
expected: "Process where plants convert light to energy"
Run evaluation:
export OPENAI_API_KEY=your-key-here
aegis eval run \
--dataset examples/qa.yaml \
--model gpt-4
Output:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Dataset: qa_sample (2 cases)
Model: gpt-4
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Results:
Avg Score: 0.92
Total Cost: $0.0045
Avg Latency: 1.2s
Pass Rate: 2/2
Status: ✓ PASSED
Core Features
Evaluation Metrics
Evaluate using:
- Exact Match: String similarity
- Semantic Similarity: Embedding-based comparison
- Composite: Multiple metrics with weights
# Compare models on quality vs cost
aegis compare \
--dataset examples/qa.yaml \
--models gpt-4,gpt-3.5-turbo,claude-3-opus
Cost Transparency
Track costs across evaluations:
# Weekly cost report
aegis cost report --period week
# By-model breakdown
aegis cost report --period month
# Export for analysis
aegis cost report --period month --export costs.csv
Regression Detection
Maintain quality standards:
# Set baseline
aegis baseline set --dataset qa --run-id abc123
# Compare to baseline
aegis eval run \
--dataset examples/qa.yaml \
--model gpt-4 \
--baseline qa
Result: ✓ PASS, ⚠ WARNING, or ✗ FAIL
Budget Enforcement
Control spending:
# Set monthly budget
aegis cost budget --limit 500.00 --mode warn
# Per-feature budgets
aegis cost budget \
--limit 100.00 \
--dataset summarization \
--mode block
Modes:
block: Raise error if exceededwarn: Log warning but continuelog: Silent logging only
Advanced Usage
Programmatic API
from aegis.core.evaluator import Evaluator
from aegis.adapters.registry import get_adapter
from aegis.scoring.semantic_similarity import SemanticSimilarityScorer
from aegis.core.dataset import Dataset
# Load dataset
dataset = Dataset.load_from_yaml("examples/qa.yaml")
# Create adapter and scorer
adapter = get_adapter("gpt-4")
scorer = SemanticSimilarityScorer()
# Run evaluation
evaluator = Evaluator(adapter, scorer)
results = evaluator.evaluate(dataset)
# Access results
print(f"Average score: {results.avg_score}")
print(f"Total cost: ${results.total_cost:.4f}")
Custom Scorers
Create your own scoring logic:
from aegis.scoring.base import BaseScorer
class CustomScorer(BaseScorer):
"""Custom evaluation metric."""
name = "custom"
def score(self, expected: str, actual: str) -> float:
"""Score output (0.0 to 1.0)."""
# Your logic here
return 1.0 if expected.lower() == actual.lower() else 0.0
# Use in evaluation
scorer = CustomScorer()
evaluator = Evaluator(adapter, scorer)
results = evaluator.evaluate(dataset)
Custom Adapters
Integrate any LLM provider:
See ADAPTER_DEVELOPMENT.md for complete guide.
from aegis.adapters.base import BaseModelAdapter, ModelResponse
class CustomAdapter(BaseModelAdapter):
"""Adapter for custom LLM service."""
async def call(self, prompt: str, **kwargs) -> ModelResponse:
# Call your model
response = await self._call_api(prompt)
return ModelResponse(
text=response.text,
input_tokens=response.input_tokens,
output_tokens=response.output_tokens,
latency_ms=response.latency_ms,
model=self.model,
)
def validate_connection(self) -> bool:
# Test API
return True
def get_model_info(self) -> dict:
return {
"model": self.model,
"provider": "custom",
"pricing": {"input": 0.01, "output": 0.05},
}
# Register in aegis/adapters/registry.py
CLI Reference
Evaluation
# Run single model
aegis eval run \
--dataset <path> \
--model <model-name> \
--provider <auto|openai|anthropic|mock> \
--output <text|json> \
--baseline <name>
# Compare models
aegis compare \
--dataset <path> \
--models <model1,model2,...>
Baselines
# Set baseline
aegis baseline set \
--dataset <name> \
--run-id <id>
# Show baseline
aegis baseline show --dataset <name>
# List baselines
aegis baseline list
Cost Intelligence
# Cost report
aegis cost report \
--period <day|week|month] \
--export <file.csv>
# Cost analysis
aegis cost analyze --period week
# Budget management
aegis cost budget \
--limit <amount> \
--mode <block|warn|log> \
--dataset <optional>
Architecture
┌─────────────────────────────────────────────────┐
│ CLI (typer) │
│ ├─ eval: Run evaluation on dataset │
│ ├─ compare: Multi-model comparison │
│ ├─ baseline: Manage baseline comparisons │
│ └─ cost: Cost tracking and budgets │
└────────────┬────────────────────────────────────┘
│
┌────────────▼────────────────────────────────────┐
│ Core Orchestration (Evaluator) │
│ ├─ Loads dataset │
│ ├─ Calls LLM adapter │
│ ├─ Scores outputs │
│ ├─ Calculates costs │
│ └─ Detects regressions │
└────────────┬────────────────────────────────────┘
│
┌───────┴───────┬──────────┬──────────┐
│ │ │ │
┌────▼──────┐ ┌────▼──────┐ ┌─▼──────┐ ┌─▼────────┐
│ Adapters │ │ Scorers │ │ Cost │ │ Storage │
├───────────┤ ├───────────┤ ├────────┤ ├──────────┤
│ OpenAI │ │ Exact │ │ Calc │ │ SQLite │
│ Anthropic │ │ Semantic │ │ Budget │ │ Aggregate│
│ Custom │ │ Composite │ │ Report │ │ Export │
└───────────┘ └───────────┘ └────────┘ └──────────┘
Dataset Format
YAML datasets define test cases:
name: my_dataset
description: "My evaluation dataset"
# Individual test cases
cases:
- input: "User question here"
expected: "Expected answer"
tags: [feature-1, easy]
- input: "Another question"
expected: "Another answer"
tags: [feature-2, hard]
# Scoring configuration
scoring:
type: composite
weights:
exact_match: 0.3
semantic_similarity: 0.7
# Optional thresholds for pass/fail
thresholds:
pass: 0.8
warning: 0.7
Examples
Example 1: Q&A Evaluation
# Evaluate Q&A model
aegis eval run \
--dataset examples/qa_sample.yaml \
--model gpt-4
# Compare models
aegis compare \
--dataset examples/qa_sample.yaml \
--models gpt-4,gpt-3.5-turbo
Example 2: Cost Tracking
# Track costs over time
aegis eval run \
--dataset examples/qa_sample.yaml \
--model gpt-4
# View cost report
aegis cost report --period week
# Set budgets
aegis cost budget --limit 100.0 --mode warn
See: examples/cost_tracking_demo.py
Example 3: Model Comparison
# Compare multiple models
aegis compare \
--dataset examples/qa_sample.yaml \
--models gpt-4,claude-3-opus,gpt-3.5-turbo
# See cost-per-quality rankings
See: examples/model_compare.py
Example 4: CI/CD Integration
# .github/workflows/llm-test.yml
name: LLM Tests
on: [push, pull_request]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install Aegis AI
run: pip install aegis-monitor[openai]
- name: Run Evaluations
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
aegis eval run \
--dataset tests/evaluation.yaml \
--model gpt-4 \
--baseline production \
--output json > results.json
- name: Check Costs
run: |
aegis cost report --period day
- name: Fail on Regression
run: |
# Custom logic to fail if regression detected
python scripts/check_regression.py results.json
Configuration
Environment Variables
# OpenAI
export OPENAI_API_KEY=sk-...
# Anthropic
export ANTHROPIC_API_KEY=sk-ant-...
# Aegis AI
export AEGIS_STORAGE=./aegis.db
export AEGIS_LOG_LEVEL=INFO
Configuration File
Create aegis.yaml:
models:
gpt-4:
provider: openai
temperature: 0.7
max_tokens: 1000
claude-3-opus:
provider: anthropic
temperature: 0.7
storage:
backend: sqlite
path: ./aegis.db
thresholds:
score_drop_pct: 5
cost_increase_pct: 10
budget:
monthly_limit: 1000.0
enforcement: warn
Performance
- Per-evaluation latency: ~2-5 seconds (depends on model)
- Storage: SQLite, minimal overhead
- Memory: ~50MB for typical datasets
- Cost calculation: Real-time per request
Testing
Run the test suite:
# All tests
pytest
# Specific module
pytest tests/test_evaluator.py -v
# With coverage
pytest --cov=aegis --cov-report=html
# Integration tests
pytest tests/integration/ -v
Tests use mock LLM responses, no real API calls required.
Contributing
Contributions welcome! See CONTRIBUTING.md for guidelines.
Areas for contribution:
- New adapter implementations
- Custom scoring metrics
- Documentation improvements
- Example projects
- Bug fixes
Support
- Documentation: docs/
- Examples: examples/
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Roadmap
**✅ Complete **
- Foundation, first integration, intelligence layer, cost engine
- Model comparison, Anthropic adapter
- Hardening, 80%+ coverage, comprehensive docs
Future
- Web dashboard
- Real-time monitoring
- Advanced analytics
License
MIT License - see LICENSE file
Acknowledgments
Built with:
- Typer - CLI framework
- Pydantic - Data validation
- Rich - Terminal formatting
- sentence-transformers - Sentence embeddings
Questions? Open an issue or discussion on GitHub.
Ready to get started? See Quick Start above.
Made with ❤️ for the AI engineering community.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aegis_monitor-0.1.0.tar.gz.
File metadata
- Download URL: aegis_monitor-0.1.0.tar.gz
- Upload date:
- Size: 80.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
86e8280367838ec7ae1f2ec6ef39d448cde8520a073e7486d416dfb54fb7700a
|
|
| MD5 |
7a24eade6ffc7c71b94a78ee1482d26d
|
|
| BLAKE2b-256 |
74eda31c3c310df5b8eabe2e9bc3afe2b8ffd2703ba80155773d07b6358f1531
|
Provenance
The following attestation bundles were made for aegis_monitor-0.1.0.tar.gz:
Publisher:
python-publish.yml on adetorodev/aegis-monitor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
aegis_monitor-0.1.0.tar.gz -
Subject digest:
86e8280367838ec7ae1f2ec6ef39d448cde8520a073e7486d416dfb54fb7700a - Sigstore transparency entry: 1017164721
- Sigstore integration time:
-
Permalink:
adetorodev/aegis-monitor@8b235901cc28e3c052c736f736cf69b8bd80bee7 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/adetorodev
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@8b235901cc28e3c052c736f736cf69b8bd80bee7 -
Trigger Event:
release
-
Statement type:
File details
Details for the file aegis_monitor-0.1.0-py3-none-any.whl.
File metadata
- Download URL: aegis_monitor-0.1.0-py3-none-any.whl
- Upload date:
- Size: 46.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4077cff3b8cf741d27c5a7d5e12b4e9b4da5b0a20194374e8396120929c01ee2
|
|
| MD5 |
f0494cf6d8c1b9ba09071303519eea89
|
|
| BLAKE2b-256 |
76cd8ff6a4425284c00804ac819d1eea1074b7b9d17a78bbe736275c83f9d91a
|
Provenance
The following attestation bundles were made for aegis_monitor-0.1.0-py3-none-any.whl:
Publisher:
python-publish.yml on adetorodev/aegis-monitor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
aegis_monitor-0.1.0-py3-none-any.whl -
Subject digest:
4077cff3b8cf741d27c5a7d5e12b4e9b4da5b0a20194374e8396120929c01ee2 - Sigstore transparency entry: 1017164748
- Sigstore integration time:
-
Permalink:
adetorodev/aegis-monitor@8b235901cc28e3c052c736f736cf69b8bd80bee7 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/adetorodev
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@8b235901cc28e3c052c736f736cf69b8bd80bee7 -
Trigger Event:
release
-
Statement type: