Skip to main content

Production-grade LLM observability. G-ARVIS scoring for Groundedness, Accuracy, Reliability, Variance, Inference Cost, and Safety.

Project description

ARGUS-AI

Production-Grade LLM Observability in 3 Lines of Code

PyPI version Python 3.9+ License: Apache 2.0 CI

ARGUS-AI is the G-ARVIS scoring engine for monitoring LLM application quality in production. It evaluates every LLM response across six orthogonal dimensions: Groundedness, Accuracy, Reliability, Variance, Inference Cost, and Safety.

Your LLM app is degrading right now. You just can't see it yet.

import argus_ai

argus = argus_ai.init()
result = argus.evaluate(prompt=prompt, response=response, context=context)

That's it. Every LLM call now has a quality score.


Why ARGUS

LLM outputs degrade silently. Models update, prompts drift, context windows overflow, and costs creep. Traditional monitoring catches latency and errors. It does not catch a model that starts hallucinating 12% more after a provider update, or a prompt that silently loses grounding when context exceeds 80K tokens.

G-ARVIS catches it.

Dimension What It Measures Why It Matters
Groundedness Is the response grounded in provided context? Hallucination detection
Accuracy Does it match ground truth / internal consistency? Factual correctness
Reliability Format consistency, completeness, latency SLA Structural quality
Variance Output determinism and confidence signals Consistency across runs
Inference Cost Token efficiency, cost-per-word, latency-to-value Budget control
Safety PII leakage, toxicity, injection, harmful content Compliance and trust

Plus three agentic evaluation metrics for autonomous workflows:

Metric Formula What It Measures
ASF (Agent Stability Factor) completion × (1 - failure_rate) × consistency Workflow completion reliability
ERR (Error Recovery Rate) recovered / failed Self-healing capability
CPCS (Cost Per Completed Step) total_cost / completed_steps Economic efficiency per step

Install

pip install argus-ai

With provider integrations:

pip install argus-ai[anthropic]    # Anthropic Claude wrapper
pip install argus-ai[openai]       # OpenAI wrapper
pip install argus-ai[prometheus]   # Prometheus export
pip install argus-ai[opentelemetry] # OTEL export
pip install argus-ai[all]          # Everything

Quick Start

Basic Evaluation

import argus_ai

argus = argus_ai.init(profile="enterprise")

result = argus.evaluate(
    prompt="What causes climate change?",
    response="Greenhouse gas emissions from fossil fuels are the primary driver.",
    context="Climate change is driven by human activities releasing greenhouse gases.",
    model_name="claude-sonnet-4",
    latency_ms=1200.0,
    input_tokens=45,
    output_tokens=30,
    cost_usd=0.002,
)

print(f"Composite: {result.garvis_composite:.3f}")  # 0.847
print(f"Passing:   {result.passing}")                # True
print(f"Safety:    {result.safety:.3f}")             # 0.950

Agentic Workflow Evaluation

from argus_ai.types import AgenticEvalRequest

argus = argus_ai.init(profile="agentic")

workflow = AgenticEvalRequest(
    prompt="Research competitors and generate report",
    response="Report generated with 5 competitor analyses.",
    steps_planned=8,
    steps_completed=7,
    steps_failed=2,
    steps_recovered=1,
    retries=3,
    total_cost_usd=0.45,
)

result, agentic_metrics = argus.evaluate_agentic(workflow)

for m in agentic_metrics:
    print(f"{m.name}: {m.score:.3f}")
# AgentStabilityFactor: 0.612
# ErrorRecoveryRate: 0.600
# CostPerCompletedStep: 0.357

Drop-In Provider Wrappers

Anthropic Claude:

from argus_ai.integrations.anthropic import InstrumentedAnthropic

argus = argus_ai.init()
client = InstrumentedAnthropic(argus=argus)

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain transformers"}],
)

# G-ARVIS score automatically attached
print(response._argus_score.garvis_composite)

OpenAI:

from argus_ai.integrations.openai import InstrumentedOpenAI

client = InstrumentedOpenAI(argus=argus_ai.init())

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain transformers"}],
)

print(response._argus_score.garvis_composite)

Threshold Monitoring with Alerting

from argus_ai.monitoring.thresholds import ThresholdConfig
from argus_ai.monitoring.alerts import AlertRule, AlertSeverity

config = ThresholdConfig(
    composite_min=0.80,
    safety_min=0.90,
    window_size=100,
    breach_ratio=0.15,
)

rules = [
    AlertRule(
        dimension="safety",
        threshold=0.85,
        severity=AlertSeverity.CRITICAL,
        message="Safety below critical threshold",
    ),
]

argus = argus_ai.init(
    profile="healthcare",
    thresholds=config,
    alert_rules=rules,
    exporters=["console", "prometheus"],
    on_alert=lambda msg, result: pagerduty.trigger(msg),
)

Decorator Instrumentation

from argus_ai.sdk.decorators import argus_evaluate

argus = argus_ai.init()

@argus_evaluate(argus, model_name="claude-sonnet-4")
def generate(prompt: str) -> str:
    return anthropic_client.messages.create(
        model="claude-sonnet-4-20250514",
        messages=[{"role": "user", "content": prompt}]
    ).content[0].text

Weight Profiles

G-ARVIS weights are configurable per deployment scenario:

Profile G A R V I S Best For
enterprise 0.20 0.20 0.15 0.15 0.10 0.20 General production
healthcare 0.25 0.25 0.15 0.10 0.05 0.20 HIPAA workloads
finance 0.20 0.25 0.20 0.10 0.05 0.20 SOX/GAAP compliance
consumer 0.15 0.15 0.20 0.15 0.20 0.15 Cost-sensitive apps
agentic 0.15 0.15 0.25 0.20 0.10 0.15 Autonomous agents

Custom weights:

from argus_ai.scoring.garvis import GarvisWeights

argus = argus_ai.init(weights=GarvisWeights(safety=0.40, accuracy=0.30))

Metrics Export

Prometheus

argus = argus_ai.init(exporters=["prometheus"])

Exposes: argus_garvis_composite, argus_garvis_{dimension}, argus_evaluation_duration_ms, argus_alerts_total

OpenTelemetry

argus = argus_ai.init(exporters=["opentelemetry"])

Compatible with Datadog, New Relic, Honeycomb, Grafana Cloud, and any OTLP backend.


Architecture

See ARCHITECTURE.md for the full system design and open core split.

argus-ai (Open Source)          ARGUS Platform (Commercial)
├── G-ARVIS Scorer              ├── Autonomous Correction Loop
├── 3-Line SDK                  ├── Prompt Optimizer
├── Threshold Monitor           ├── LLM-as-Judge Evaluation
├── ASF/ERR/CPCS Metrics        ├── Multi-Run Variance Analysis
├── Prometheus/OTEL Export      ├── Dashboard UI
└── Anthropic/OpenAI Wrappers   └── SOC2/HIPAA Compliance

Performance

G-ARVIS heuristic scoring runs in sub-5ms per evaluation with zero external dependencies at runtime.

Benchmark Value
Single evaluation < 3ms
Batch (1000 requests) < 2.5s
Memory overhead < 5MB
Dependencies (core) 3 (pydantic, numpy, structlog)

Roadmap

  • LiteLLM integration
  • LangChain callback handler
  • Grafana dashboard templates
  • Custom scorer plugin API
  • Async evaluation support
  • CLI tool for offline batch scoring

Contributing

Contributions welcome. See CONTRIBUTING.md for guidelines.

git clone https://github.com/anilatambharii/argus-ai.git
cd argus-ai
pip install -e ".[dev]"
pytest tests/ -v

About

Built by Anil Prasad at Ambharii Labs.

G-ARVIS framework published in "Field Notes: Production AI" on LinkedIn.

ARGUS: Autonomous Runtime Guardian for Unified Systems.


License

Apache 2.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

argus_llm-0.1.0.tar.gz (47.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

argus_llm-0.1.0-py3-none-any.whl (36.6 kB view details)

Uploaded Python 3

File details

Details for the file argus_llm-0.1.0.tar.gz.

File metadata

  • Download URL: argus_llm-0.1.0.tar.gz
  • Upload date:
  • Size: 47.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for argus_llm-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0e5be5d5779142822a72fdf660422ad4a1127e6f60a7e7561acafa37128794e1
MD5 e54bb56eda6832dc55443324b50365f1
BLAKE2b-256 84d477bdee5d2c3b3937ee538916ca5d34d8d4d97d77931fae7aada0a18e56f8

See more details on using hashes here.

File details

Details for the file argus_llm-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: argus_llm-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 36.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for argus_llm-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3a64a1910ed2279165f6ccdf1903460cef45a6ef6237491109ea588a38941f57
MD5 315ee0b0418fbbd6b17bf4422578f90c
BLAKE2b-256 36534fc698ff30588c0be9c4fde8ebd1ee73bf30d061d7e8bf159d6c4bf638aa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page