Python library for LLM evaluation, observability, and cost monitoring with regression detection and budget-aware workflows.

These details have not been verified by PyPI

Project description

Aegis Monitor - LLM Evaluation & Cost Governance

Aegis AI is an open-source framework for evaluating, comparing, and governing LLM systems. Engineers use Aegis AI to:

Evaluate LLM outputs with pluggable metrics
Monitor costs in real-time and enforce budgets
Detect regressions before deploying to production
Compare models objectively on quality vs. cost
Integrate evaluations into CI/CD pipelines

Built for engineering teams that want reproducible, cost-conscious LLM workflows.

Quick Start

Installation

# Core installation
pip install aegis-monitor

# With OpenAI support
pip install "aegis-monitor[openai]"

# With Anthropic (Claude) support
pip install "aegis-monitor[anthropic]"

# With all providers
pip install "aegis-monitor[all]"

1-Minute Example

Create a dataset (examples/qa.yaml):

name: qa_sample
description: Basic Q&A evaluation
cases:
  - input: "What is the capital of France?"
    expected: "Paris"

  - input: "Explain photosynthesis"
    expected: "Process where plants convert light to energy"

Run evaluation:

export OPENAI_API_KEY=your-key-here

aegis eval run \
  --dataset examples/qa.yaml \
  --model gpt-4

Output:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Dataset:     qa_sample (2 cases)
Model:       gpt-4
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Results:
  Avg Score:      0.92
  Total Cost:     $0.0045
  Avg Latency:    1.2s
  Pass Rate:      2/2

Status: ✓ PASSED

Core Features

Evaluation Metrics

Evaluate using:

Exact Match: String similarity
Semantic Similarity: Embedding-based comparison
Composite: Multiple metrics with weights

# Compare models on quality vs cost
aegis compare \
  --dataset examples/qa.yaml \
  --models gpt-4,gpt-3.5-turbo,claude-3-opus

Cost Transparency

Track costs across evaluations:

# Weekly cost report
aegis cost report --period week

# By-model breakdown
aegis cost report --period month

# Export for analysis
aegis cost report --period month --export costs.csv

Regression Detection

Maintain quality standards:

# Set baseline
aegis baseline set --dataset qa --run-id abc123

# Compare to baseline
aegis eval run \
  --dataset examples/qa.yaml \
  --model gpt-4 \
  --baseline qa

Result: ✓ PASS, ⚠ WARNING, or ✗ FAIL

Budget Enforcement

Control spending:

# Set monthly budget
aegis cost budget --limit 500.00 --mode warn

# Per-feature budgets
aegis cost budget \
  --limit 100.00 \
  --dataset summarization \
  --mode block

Modes:

block: Raise error if exceeded
warn: Log warning but continue
log: Silent logging only

Advanced Usage

Programmatic API

from aegis.core.evaluator import Evaluator
from aegis.adapters.registry import get_adapter
from aegis.scoring.semantic_similarity import SemanticSimilarityScorer
from aegis.core.dataset import Dataset

# Load dataset
dataset = Dataset.load_from_yaml("examples/qa.yaml")

# Create adapter and scorer
adapter = get_adapter("gpt-4")
scorer = SemanticSimilarityScorer()

# Run evaluation
evaluator = Evaluator(adapter, scorer)
results = evaluator.evaluate(dataset)

# Access results
print(f"Average score: {results.avg_score}")
print(f"Total cost: ${results.total_cost:.4f}")

Custom Scorers

Create your own scoring logic:

from aegis.scoring.base import BaseScorer

class CustomScorer(BaseScorer):
    """Custom evaluation metric."""

    name = "custom"

    def score(self, expected: str, actual: str) -> float:
        """Score output (0.0 to 1.0)."""
        # Your logic here
        return 1.0 if expected.lower() == actual.lower() else 0.0

# Use in evaluation
scorer = CustomScorer()
evaluator = Evaluator(adapter, scorer)
results = evaluator.evaluate(dataset)

Custom Adapters

Integrate any LLM provider:

See ADAPTER_DEVELOPMENT.md for complete guide.

from aegis.adapters.base import BaseModelAdapter, ModelResponse

class CustomAdapter(BaseModelAdapter):
    """Adapter for custom LLM service."""

    async def call(self, prompt: str, **kwargs) -> ModelResponse:
        # Call your model
        response = await self._call_api(prompt)

        return ModelResponse(
            text=response.text,
            input_tokens=response.input_tokens,
            output_tokens=response.output_tokens,
            latency_ms=response.latency_ms,
            model=self.model,
        )

    def validate_connection(self) -> bool:
        # Test API
        return True

    def get_model_info(self) -> dict:
        return {
            "model": self.model,
            "provider": "custom",
            "pricing": {"input": 0.01, "output": 0.05},
        }

# Register in aegis/adapters/registry.py

CLI Reference

Evaluation

# Run single model
aegis eval run \
  --dataset <path> \
  --model <model-name> \
  --provider <auto|openai|anthropic|mock> \
  --output <text|json> \
  --baseline <name>

# Compare models
aegis compare \
  --dataset <path> \
  --models <model1,model2,...>

Baselines

# Set baseline
aegis baseline set \
  --dataset <name> \
  --run-id <id>

# Show baseline
aegis baseline show --dataset <name>

# List baselines
aegis baseline list

Cost Intelligence

# Cost report
aegis cost report \
  --period <day|week|month] \
  --export <file.csv>

# Cost analysis
aegis cost analyze --period week

# Budget management
aegis cost budget \
  --limit <amount> \
  --mode <block|warn|log> \
  --dataset <optional>

Architecture

┌─────────────────────────────────────────────────┐
│         CLI (typer)                             │
│  ├─ eval: Run evaluation on dataset             │
│  ├─ compare: Multi-model comparison             │
│  ├─ baseline: Manage baseline comparisons       │
│  └─ cost: Cost tracking and budgets             │
└────────────┬────────────────────────────────────┘
             │
┌────────────▼────────────────────────────────────┐
│  Core Orchestration (Evaluator)                 │
│  ├─ Loads dataset                               │
│  ├─ Calls LLM adapter                           │
│  ├─ Scores outputs                              │
│  ├─ Calculates costs                            │
│  └─ Detects regressions                         │
└────────────┬────────────────────────────────────┘
             │
     ┌───────┴───────┬──────────┬──────────┐
     │               │          │          │
┌────▼──────┐  ┌────▼──────┐ ┌─▼──────┐ ┌─▼────────┐
│ Adapters  │  │  Scorers  │ │ Cost   │ │ Storage  │
├───────────┤  ├───────────┤ ├────────┤ ├──────────┤
│ OpenAI    │  │ Exact     │ │ Calc   │ │ SQLite   │
│ Anthropic │  │ Semantic  │ │ Budget │ │ Aggregate│
│ Custom    │  │ Composite │ │ Report │ │ Export   │
└───────────┘  └───────────┘ └────────┘ └──────────┘

Dataset Format

YAML datasets define test cases:

name: my_dataset
description: "My evaluation dataset"

# Individual test cases
cases:
  - input: "User question here"
    expected: "Expected answer"
    tags: [feature-1, easy]

  - input: "Another question"
    expected: "Another answer"
    tags: [feature-2, hard]

# Scoring configuration
scoring:
  type: composite
  weights:
    exact_match: 0.3
    semantic_similarity: 0.7

# Optional thresholds for pass/fail
thresholds:
  pass: 0.8
  warning: 0.7

Examples

Example 1: Q&A Evaluation

# Evaluate Q&A model
aegis eval run \
  --dataset examples/qa_sample.yaml \
  --model gpt-4

# Compare models
aegis compare \
  --dataset examples/qa_sample.yaml \
  --models gpt-4,gpt-3.5-turbo

See: examples/simple_eval.py

Example 2: Cost Tracking

# Track costs over time
aegis eval run \
  --dataset examples/qa_sample.yaml \
  --model gpt-4

# View cost report
aegis cost report --period week

# Set budgets
aegis cost budget --limit 100.0 --mode warn

See: examples/cost_tracking_demo.py

Example 3: Model Comparison

# Compare multiple models
aegis compare \
  --dataset examples/qa_sample.yaml \
  --models gpt-4,claude-3-opus,gpt-3.5-turbo

# See cost-per-quality rankings

See: examples/model_compare.py

Example 4: CI/CD Integration

# .github/workflows/llm-test.yml
name: LLM Tests

on: [push, pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'

      - name: Install Aegis AI
        run: pip install aegis-monitor[openai]

      - name: Run Evaluations
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          aegis eval run \
            --dataset tests/evaluation.yaml \
            --model gpt-4 \
            --baseline production \
            --output json > results.json

      - name: Check Costs
        run: |
          aegis cost report --period day

      - name: Fail on Regression
        run: |
          # Custom logic to fail if regression detected
          python scripts/check_regression.py results.json

Configuration

Environment Variables

# OpenAI
export OPENAI_API_KEY=sk-...

# Anthropic
export ANTHROPIC_API_KEY=sk-ant-...

# Aegis AI
export AEGIS_STORAGE=./aegis.db
export AEGIS_LOG_LEVEL=INFO

Configuration File

Create aegis.yaml:

models:
  gpt-4:
    provider: openai
    temperature: 0.7
    max_tokens: 1000

  claude-3-opus:
    provider: anthropic
    temperature: 0.7

storage:
  backend: sqlite
  path: ./aegis.db

thresholds:
  score_drop_pct: 5
  cost_increase_pct: 10

budget:
  monthly_limit: 1000.0
  enforcement: warn

Performance

Per-evaluation latency: ~2-5 seconds (depends on model)
Storage: SQLite, minimal overhead
Memory: ~50MB for typical datasets
Cost calculation: Real-time per request

Testing

Run the test suite:

# All tests
pytest

# Specific module
pytest tests/test_evaluator.py -v

# With coverage
pytest --cov=aegis --cov-report=html

# Integration tests
pytest tests/integration/ -v

Tests use mock LLM responses, no real API calls required.

Contributing

Contributions welcome! See CONTRIBUTING.md for guidelines.

Areas for contribution:

New adapter implementations
Custom scoring metrics
Documentation improvements
Example projects
Bug fixes

Support

Documentation: docs/
Examples: examples/
Issues: GitHub Issues
Discussions: GitHub Discussions

Roadmap

**✅ Complete **

Foundation, first integration, intelligence layer, cost engine
Model comparison, Anthropic adapter
Hardening, 80%+ coverage, comprehensive docs

Future

Web dashboard
Real-time monitoring
Advanced analytics

License

MIT License - see LICENSE file

Acknowledgments

Built with:

Typer - CLI framework
Pydantic - Data validation
Rich - Terminal formatting
sentence-transformers - Sentence embeddings

Questions? Open an issue or discussion on GitHub.

Ready to get started? See Quick Start above.

Made with ❤️ for the AI engineering community.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Mar 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aegis_monitor-0.1.0.tar.gz (80.2 kB view details)

Uploaded Mar 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

aegis_monitor-0.1.0-py3-none-any.whl (46.0 kB view details)

Uploaded Mar 3, 2026 Python 3

File details

Details for the file aegis_monitor-0.1.0.tar.gz.

File metadata

Download URL: aegis_monitor-0.1.0.tar.gz
Upload date: Mar 3, 2026
Size: 80.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for aegis_monitor-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`86e8280367838ec7ae1f2ec6ef39d448cde8520a073e7486d416dfb54fb7700a`
MD5	`7a24eade6ffc7c71b94a78ee1482d26d`
BLAKE2b-256	`74eda31c3c310df5b8eabe2e9bc3afe2b8ffd2703ba80155773d07b6358f1531`

See more details on using hashes here.

Provenance

The following attestation bundles were made for aegis_monitor-0.1.0.tar.gz:

Publisher: python-publish.yml on adetorodev/aegis-monitor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: aegis_monitor-0.1.0.tar.gz
- Subject digest: 86e8280367838ec7ae1f2ec6ef39d448cde8520a073e7486d416dfb54fb7700a
- Sigstore transparency entry: 1017164721
- Sigstore integration time: Mar 3, 2026
Source repository:
- Permalink: adetorodev/aegis-monitor@8b235901cc28e3c052c736f736cf69b8bd80bee7
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/adetorodev
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@8b235901cc28e3c052c736f736cf69b8bd80bee7
- Trigger Event: release

File details

Details for the file aegis_monitor-0.1.0-py3-none-any.whl.

File metadata

Download URL: aegis_monitor-0.1.0-py3-none-any.whl
Upload date: Mar 3, 2026
Size: 46.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for aegis_monitor-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4077cff3b8cf741d27c5a7d5e12b4e9b4da5b0a20194374e8396120929c01ee2`
MD5	`f0494cf6d8c1b9ba09071303519eea89`
BLAKE2b-256	`76cd8ff6a4425284c00804ac819d1eea1074b7b9d17a78bbe736275c83f9d91a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for aegis_monitor-0.1.0-py3-none-any.whl:

Publisher: python-publish.yml on adetorodev/aegis-monitor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: aegis_monitor-0.1.0-py3-none-any.whl
- Subject digest: 4077cff3b8cf741d27c5a7d5e12b4e9b4da5b0a20194374e8396120929c01ee2
- Sigstore transparency entry: 1017164748
- Sigstore integration time: Mar 3, 2026
Source repository:
- Permalink: adetorodev/aegis-monitor@8b235901cc28e3c052c736f736cf69b8bd80bee7
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/adetorodev
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@8b235901cc28e3c052c736f736cf69b8bd80bee7
- Trigger Event: release

aegis-monitor 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Aegis Monitor - LLM Evaluation & Cost Governance

Quick Start

Installation

1-Minute Example

Core Features

Evaluation Metrics

Cost Transparency

Regression Detection

Budget Enforcement

Advanced Usage

Programmatic API

Custom Scorers

Custom Adapters

CLI Reference

Evaluation

Baselines

Cost Intelligence

Architecture

Dataset Format

Examples

Example 1: Q&A Evaluation

Example 2: Cost Tracking

Example 3: Model Comparison

Example 4: CI/CD Integration

Configuration

Environment Variables

Configuration File

Performance

Testing

Contributing

Support

Roadmap

License

Acknowledgments

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance