Skip to main content

AI Reliability Engineering for LLM Agents - Monitor hallucinations, latency, and throughput

Project description

🧠 AgentOps Client - AI Reliability Engineering for LLM Agents

PyPI version Python 3.8+ License: MIT Downloads

A flexible hallucination detection system for LLM responses that works both with and without RAG (Retrieval Augmented Generation). Monitor truth quality, latency, and throughput for production LLM systems.

🎯 Features

Truth Detection

  • Dual-mode operation: Works with or without retrieved documents
  • Multi-signal detection: Combines semantic drift, uncertainty analysis, and factual checking
  • Explainable scores: Returns detailed breakdown of all metrics

Reliability Engineering 🆕

  • Latency tracking: Measures end-to-end evaluation time
  • Throughput monitoring: Calculates requests per second (single-run or batch)
  • Full observability: Datadog-style metrics for LLM systems

Technical

  • OpenAI-powered: Uses embeddings and GPT-4o-mini for evaluation
  • Thread-safe: Concurrent throughput tracking with locks
  • Simple API: Single function call with optional parameters

🏗️ Architecture

User Prompt + Response
   ↓
[ ⏱️  Latency Timer Start ]
   ↓
[ Embedding Drift Check ]   → always active
[ Uncertainty Analysis ]    → always active
[ Evidence Entailment ]     → only if retrieved_docs
[ Factual Self-Check LLM ]  → fallback when no evidence
   ↓
Weighted Fusion (0.4 factual + 0.4 drift + 0.2 uncertainty)
   ↓
[ ⏱️  Latency Timer End ]
[ 📊 Throughput Calculation ]
   ↓
→ truth metrics + reliability metrics

📦 Installation

Quick Install from PyPI

pip install agentops-client

That's it! You're ready to start monitoring your LLM agents.

Environment Setup

Create a .env file or set environment variable:

export OPENAI_API_KEY=your_openai_api_key_here

Or in Python:

import os
os.environ['OPENAI_API_KEY'] = 'your_openai_api_key_here'

Development Installation

If you want to contribute or modify the code:

git clone https://github.com/ezazahamad2003/agentops.git
cd agentops
pip install -e .

🚀 Quick Start

SDK Usage (Recommended)

from agentops import AgentOps

# Initialize the SDK
ops = AgentOps()

# Evaluate a single response
result = ops.evaluate(
    prompt="Who discovered penicillin?",
    response="Penicillin was discovered by Alexander Fleming in 1928."
)

print(f"Hallucinated: {result['hallucinated']}")
print(f"Latency: {result['latency_sec']}s")

🌐 Production Backend (NEW in v0.2.1)

Connect to our production backend for automatic metrics storage and analytics:

from agentops import AgentOps

# Initialize with production backend
ops = AgentOps(
    api_key="your_api_key",  # Get from /register endpoint
    api_url="https://agentops-api-1081133763032.us-central1.run.app"
)

# Evaluations are automatically uploaded to backend
result = ops.evaluate(
    prompt="What is AI?",
    response="AI is artificial intelligence...",
    model_name="gpt-4o-mini"
)

# Data is stored in Supabase for analytics! ✅

Get Your API Key:

curl -X POST https://agentops-api-1081133763032.us-central1.run.app/register \
  -H "Content-Type: application/json" \
  -d '{"name":"my_agent"}'

RAG Mode with Retrieved Documents

from agentops import AgentOps

ops = AgentOps()

# RAG Mode evaluation
result = ops.evaluate(
    prompt="What are the side effects of aspirin?",
    response="Aspirin causes stomach upset, nausea, and heartburn.",
    retrieved_docs=[
        "Common side effects include stomach upset and nausea.",
        "Some people may experience allergic reactions."
    ]
)

print(result)

Batch Monitoring with Sessions

from agentops import AgentOps

ops = AgentOps()

# Start a monitoring session
ops.start_session()

# Run multiple evaluations
for prompt, response in your_test_cases:
    result = ops.evaluate(prompt, response)
    print(f"Latency: {result['latency_sec']}s")

# Get session statistics
stats = ops.end_session()
print(f"Total evaluations: {stats['total_evaluations']}")
print(f"Average throughput: {stats['throughput_qps']} req/sec")

Context Manager (Auto Sessions)

from agentops import AgentOps

with AgentOps() as ops:
    result = ops.evaluate(prompt, response)
    # Session automatically closed after block

Direct Function Access

from agentops import detect_hallucination

# Direct function call (lower-level API)
result = detect_hallucination(prompt, response, retrieved_docs)
print(result)

📊 Return Format

{
    # Truth Metrics
    "semantic_drift": 0.22,         # 0-1: semantic distance from prompt
    "uncertainty": 0.0,             # 0-1: uncertainty language score
    "factual_support": 0.52,        # 0-1: factual grounding score
    "mode": "retrieved-doc entailment",  # or "self-check"
    "hallucination_probability": 0.57,   # 0-1: overall score
    "hallucinated": True,           # True if probability > 0.45
    
    # Reliability Metrics 🆕
    "latency_sec": 2.34,            # End-to-end evaluation time in seconds
    "throughput_qps": 0.427         # Requests per second (queries per second)
}

🎯 Detection Modes

Mode retrieved_docs Truth Checks Reliability Metrics
RAG mode List of chunks Semantic drift + entailment (evidence-based factuality) Latency + Throughput (tracked)
No-RAG mode None Semantic drift + uncertainty + factual self-check (LLM) Latency + Throughput (tracked)

📈 Reliability Metrics

Latency Tracking

  • What: End-to-end time from request to response
  • Why: Shows model responsiveness and performance degradation
  • Unit: Seconds (rounded to 3 decimal places)

Throughput Tracking

  • What: Number of evaluations processed per second
  • Why: Measures system capacity and parallel efficiency
  • Unit: Queries per second (QPS)
  • Modes:
    • Single-run (track_throughput=False): throughput = 1 / latency
    • Batch mode (track_throughput=True): throughput = total_evaluations / total_time

🧪 Testing

Run the test suite:

pytest test_detector.py -v

Run example scenarios:

python examples.py

📈 Scoring System

Components

  1. Semantic Drift (weight: 0.4)

    • Measures cosine distance between prompt and response embeddings
    • High drift = response is semantically distant from question
  2. Uncertainty (weight: 0.2)

    • Detects uncertainty language: "maybe", "probably", "might", etc.
    • Higher score = more uncertain language
  3. Factual Support (weight: 0.4)

    • RAG mode: Entailment check against retrieved docs
    • No-RAG mode: LLM self-check for factual accuracy

Threshold

  • Hallucination threshold: 0.45
  • Scores above 0.45 are flagged as potential hallucinations

🔧 Configuration

Adjusting Weights

Edit the fusion weights in detector_flexible.py:

halluc_prob = round(0.4 * (1 - factual) + 0.4 * drift + 0.2 * uncert, 3)
#                   ^^^                    ^^^          ^^^
#                   factual weight         drift        uncertainty

Adjusting Threshold

Change the threshold in the return statement:

"hallucinated": halluc_prob > 0.45  # Change 0.45 to desired threshold

Throughput Tracking Modes

# Single-run mode (throughput = 1/latency)
result = detect_hallucination(prompt, response, track_throughput=False)

# Batch mode (cumulative tracking)
result = detect_hallucination(prompt, response, track_throughput=True)

# Reset cumulative tracker
from detector_flexible import reset_throughput_tracker
reset_throughput_tracker()

# Get current stats
from detector_flexible import get_throughput_stats
stats = get_throughput_stats()
# Returns: {'total_evaluations': int, 'total_time_sec': float, 'throughput_qps': float}

📝 Example Use Cases

Case 1: Medical RAG System

prompt = "What are Ozempic side effects?"
docs = ["Common: nausea, vomiting", "Rare: pancreatitis"]
response = "Causes nausea and heart palpitations"  # ⚠️ heart palpitations not in docs

result = detect_hallucination(prompt, response, docs)
# High hallucination probability due to unsupported claim

Case 2: General Knowledge

prompt = "Who invented the telephone?"
response = "Alexander Graham Bell invented the telephone."

result = detect_hallucination(prompt, response)
# Low hallucination probability - factually correct

Case 3: Uncertain Response

prompt = "What's the weather like?"
response = "Maybe it's probably sunny, I'm not sure."

result = detect_hallucination(prompt, response)
# High uncertainty score detected

🛠️ API Reference

AgentOps SDK Client

AgentOps(api_key=None, track_throughput=True)

Initialize the AgentOps SDK client.

Parameters:

  • api_key (str, optional): API key for future cloud features
  • track_throughput (bool, default=True): Enable cumulative throughput tracking

Methods:

evaluate(prompt, response, retrieved_docs=None)

Evaluate an agent's response for hallucinations and reliability.

Returns: dict with truth and reliability metrics

metrics()

Get current cumulative statistics.

Returns: {'total_evaluations': int, 'total_time_sec': float, 'throughput_qps': float}

reset_metrics()

Reset throughput tracker for new session.

start_session()

Start a new monitoring session with fresh metrics.

end_session()

End current session and return final statistics.

Context Manager Support:

with AgentOps() as ops:
    result = ops.evaluate(prompt, response)

Direct Function API

detect_hallucination(prompt, response, retrieved_docs=None, track_throughput=True)

Low-level detection function with reliability metrics.

Parameters:

  • prompt (str): Original user question/prompt
  • response (str): LLM's generated response
  • retrieved_docs (list[str], optional): Retrieved evidence chunks for RAG mode
  • track_throughput (bool, default=True): Enable cumulative throughput tracking

Returns:

  • dict: Detection results with truth metrics and reliability metrics

Utility Functions

  • reset_throughput_tracker(): Reset cumulative throughput counters
  • get_throughput_stats(): Get current throughput statistics
  • uncertainty_score(text): Calculate uncertainty language score

🚧 Roadmap

Phase 1: Core Detection ✅

  • Dual-mode hallucination detection
  • Semantic drift, uncertainty, factual support
  • Comprehensive test suite

Phase 2: Reliability Metrics ✅

  • Latency tracking
  • Throughput calculation (single-run and batch)
  • Thread-safe cumulative tracking

Phase 3: Integration (Next)

  • FastAPI endpoint for HTTP access
  • Supabase/database logging
  • AgentOps SDK for automatic instrumentation
  • Visual dashboard (metrics over time)

Phase 4: Advanced Features

  • Sentence-level breakdown (flag specific hallucinated sentences)
  • Custom model support (non-OpenAI)
  • Async/concurrent evaluation
  • Performance optimization for large-scale deployment
  • Alerting on anomalies (latency spikes, hallucination rate)

📄 License

MIT License - feel free to use in your projects!

🤝 Contributing

Contributions welcome! Please test your changes with the test suite before submitting.


Built with ❤️ using OpenAI APIs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentops_client-0.2.2.tar.gz (23.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentops_client-0.2.2-py3-none-any.whl (12.5 kB view details)

Uploaded Python 3

File details

Details for the file agentops_client-0.2.2.tar.gz.

File metadata

  • Download URL: agentops_client-0.2.2.tar.gz
  • Upload date:
  • Size: 23.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for agentops_client-0.2.2.tar.gz
Algorithm Hash digest
SHA256 55773730b4c6cbbfba5defbe94934b41cf3e4a534e2550f3d9a3ea86ced05f64
MD5 24210c80f0ad8b0158e9f932d0da40ec
BLAKE2b-256 9f9c85021ad89a9e667c2dc8468f76601092d5f669bce06af919bd83215f07a3

See more details on using hashes here.

File details

Details for the file agentops_client-0.2.2-py3-none-any.whl.

File metadata

File hashes

Hashes for agentops_client-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 66e1f5c714a8841359dc7926efb619d9a22df2471ce1e4c2875d2b641f6d2533
MD5 345e4d76c39f0b8d19d62280ec0f7c08
BLAKE2b-256 a6fa46d8435f3c2db3c95bb5d3d41876916c509d0c317483cda5d91920a1dfc2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page