AI Reliability Engineering for LLM Agents - Monitor hallucinations, latency, and throughput

These details have not been verified by PyPI

Project links

Project description

🧠 Dual-Mode Hallucination Detector with Reliability Metrics

A flexible hallucination detection system for LLM responses that works both with and without RAG (Retrieval Augmented Generation). Now includes AI reliability engineering metrics for full observability.

🎯 Features

Truth Detection

Dual-mode operation: Works with or without retrieved documents
Multi-signal detection: Combines semantic drift, uncertainty analysis, and factual checking
Explainable scores: Returns detailed breakdown of all metrics

Reliability Engineering 🆕

Latency tracking: Measures end-to-end evaluation time
Throughput monitoring: Calculates requests per second (single-run or batch)
Full observability: Datadog-style metrics for LLM systems

Technical

OpenAI-powered: Uses embeddings and GPT-4o-mini for evaluation
Thread-safe: Concurrent throughput tracking with locks
Simple API: Single function call with optional parameters

🏗️ Architecture

User Prompt + Response
   ↓
[ ⏱️  Latency Timer Start ]
   ↓
[ Embedding Drift Check ]   → always active
[ Uncertainty Analysis ]    → always active
[ Evidence Entailment ]     → only if retrieved_docs
[ Factual Self-Check LLM ]  → fallback when no evidence
   ↓
Weighted Fusion (0.4 factual + 0.4 drift + 0.2 uncertainty)
   ↓
[ ⏱️  Latency Timer End ]
[ 📊 Throughput Calculation ]
   ↓
→ truth metrics + reliability metrics

📦 Installation

Option 1: Install as Package (Recommended)

# Clone the repository
git clone https://github.com/yourusername/agentops.git
cd agentops

# Install in development mode
pip install -e .

Option 2: Install Dependencies Only

pip install -r requirements.txt

Environment Setup

Create a .env file in the project root:

OPENAI_API_KEY=your_openai_api_key_here

🚀 Quick Start

SDK Usage (Recommended)

from agentops import AgentOps

# Initialize the SDK
ops = AgentOps()

# Evaluate a single response
result = ops.evaluate(
    prompt="Who discovered penicillin?",
    response="Penicillin was discovered by Alexander Fleming in 1928."
)

print(f"Hallucinated: {result['hallucinated']}")
print(f"Latency: {result['latency_sec']}s")

RAG Mode with Retrieved Documents

from agentops import AgentOps

ops = AgentOps()

# RAG Mode evaluation
result = ops.evaluate(
    prompt="What are the side effects of aspirin?",
    response="Aspirin causes stomach upset, nausea, and heartburn.",
    retrieved_docs=[
        "Common side effects include stomach upset and nausea.",
        "Some people may experience allergic reactions."
    ]
)

print(result)

Batch Monitoring with Sessions

from agentops import AgentOps

ops = AgentOps()

# Start a monitoring session
ops.start_session()

# Run multiple evaluations
for prompt, response in your_test_cases:
    result = ops.evaluate(prompt, response)
    print(f"Latency: {result['latency_sec']}s")

# Get session statistics
stats = ops.end_session()
print(f"Total evaluations: {stats['total_evaluations']}")
print(f"Average throughput: {stats['throughput_qps']} req/sec")

Context Manager (Auto Sessions)

from agentops import AgentOps

with AgentOps() as ops:
    result = ops.evaluate(prompt, response)
    # Session automatically closed after block

Direct Function Access

from agentops import detect_hallucination

# Direct function call (lower-level API)
result = detect_hallucination(prompt, response, retrieved_docs)
print(result)

📊 Return Format

{
    # Truth Metrics
    "semantic_drift": 0.22,         # 0-1: semantic distance from prompt
    "uncertainty": 0.0,             # 0-1: uncertainty language score
    "factual_support": 0.52,        # 0-1: factual grounding score
    "mode": "retrieved-doc entailment",  # or "self-check"
    "hallucination_probability": 0.57,   # 0-1: overall score
    "hallucinated": True,           # True if probability > 0.45
    
    # Reliability Metrics 🆕
    "latency_sec": 2.34,            # End-to-end evaluation time in seconds
    "throughput_qps": 0.427         # Requests per second (queries per second)
}

🎯 Detection Modes

Mode	retrieved_docs	Truth Checks	Reliability Metrics
RAG mode	List of chunks	Semantic drift + entailment (evidence-based factuality)	Latency + Throughput (tracked)
No-RAG mode	None	Semantic drift + uncertainty + factual self-check (LLM)	Latency + Throughput (tracked)

📈 Reliability Metrics

Latency Tracking

What: End-to-end time from request to response
Why: Shows model responsiveness and performance degradation
Unit: Seconds (rounded to 3 decimal places)

Throughput Tracking

What: Number of evaluations processed per second
Why: Measures system capacity and parallel efficiency
Unit: Queries per second (QPS)
Modes:
- Single-run (track_throughput=False): throughput = 1 / latency
- Batch mode (track_throughput=True): throughput = total_evaluations / total_time

🧪 Testing

Run the test suite:

pytest test_detector.py -v

Run example scenarios:

python examples.py

📈 Scoring System

Components

Semantic Drift (weight: 0.4)
- Measures cosine distance between prompt and response embeddings
- High drift = response is semantically distant from question
Uncertainty (weight: 0.2)
- Detects uncertainty language: "maybe", "probably", "might", etc.
- Higher score = more uncertain language
Factual Support (weight: 0.4)
- RAG mode: Entailment check against retrieved docs
- No-RAG mode: LLM self-check for factual accuracy

Threshold

Hallucination threshold: 0.45
Scores above 0.45 are flagged as potential hallucinations

🔧 Configuration

Adjusting Weights

Edit the fusion weights in detector_flexible.py:

halluc_prob = round(0.4 * (1 - factual) + 0.4 * drift + 0.2 * uncert, 3)
#                   ^^^                    ^^^          ^^^
#                   factual weight         drift        uncertainty

Adjusting Threshold

Change the threshold in the return statement:

"hallucinated": halluc_prob > 0.45  # Change 0.45 to desired threshold

Throughput Tracking Modes

# Single-run mode (throughput = 1/latency)
result = detect_hallucination(prompt, response, track_throughput=False)

# Batch mode (cumulative tracking)
result = detect_hallucination(prompt, response, track_throughput=True)

# Reset cumulative tracker
from detector_flexible import reset_throughput_tracker
reset_throughput_tracker()

# Get current stats
from detector_flexible import get_throughput_stats
stats = get_throughput_stats()
# Returns: {'total_evaluations': int, 'total_time_sec': float, 'throughput_qps': float}

📝 Example Use Cases

Case 1: Medical RAG System

prompt = "What are Ozempic side effects?"
docs = ["Common: nausea, vomiting", "Rare: pancreatitis"]
response = "Causes nausea and heart palpitations"  # ⚠️ heart palpitations not in docs

result = detect_hallucination(prompt, response, docs)
# High hallucination probability due to unsupported claim

Case 2: General Knowledge

prompt = "Who invented the telephone?"
response = "Alexander Graham Bell invented the telephone."

result = detect_hallucination(prompt, response)
# Low hallucination probability - factually correct

Case 3: Uncertain Response

prompt = "What's the weather like?"
response = "Maybe it's probably sunny, I'm not sure."

result = detect_hallucination(prompt, response)
# High uncertainty score detected

🛠️ API Reference

AgentOps SDK Client

`AgentOps(api_key=None, track_throughput=True)`

Initialize the AgentOps SDK client.

Parameters:

api_key (str, optional): API key for future cloud features
track_throughput (bool, default=True): Enable cumulative throughput tracking

Methods:

`evaluate(prompt, response, retrieved_docs=None)`

Evaluate an agent's response for hallucinations and reliability.

Returns: dict with truth and reliability metrics

`metrics()`

Get current cumulative statistics.

Returns: {'total_evaluations': int, 'total_time_sec': float, 'throughput_qps': float}

`reset_metrics()`

Reset throughput tracker for new session.

`start_session()`

Start a new monitoring session with fresh metrics.

`end_session()`

End current session and return final statistics.

Context Manager Support:

with AgentOps() as ops:
    result = ops.evaluate(prompt, response)

Direct Function API

`detect_hallucination(prompt, response, retrieved_docs=None, track_throughput=True)`

Low-level detection function with reliability metrics.

Parameters:

prompt (str): Original user question/prompt
response (str): LLM's generated response
retrieved_docs (list[str], optional): Retrieved evidence chunks for RAG mode
track_throughput (bool, default=True): Enable cumulative throughput tracking

Returns:

dict: Detection results with truth metrics and reliability metrics

Utility Functions

reset_throughput_tracker(): Reset cumulative throughput counters
get_throughput_stats(): Get current throughput statistics
uncertainty_score(text): Calculate uncertainty language score

🚧 Roadmap

Phase 1: Core Detection ✅

Dual-mode hallucination detection
Semantic drift, uncertainty, factual support
Comprehensive test suite

Phase 2: Reliability Metrics ✅

Latency tracking
Throughput calculation (single-run and batch)
Thread-safe cumulative tracking

Phase 3: Integration (Next)

FastAPI endpoint for HTTP access
Supabase/database logging
AgentOps SDK for automatic instrumentation
Visual dashboard (metrics over time)

Phase 4: Advanced Features

Sentence-level breakdown (flag specific hallucinated sentences)
Custom model support (non-OpenAI)
Async/concurrent evaluation
Performance optimization for large-scale deployment
Alerting on anomalies (latency spikes, hallucination rate)

📄 License

MIT License - feel free to use in your projects!

🤝 Contributing

Contributions welcome! Please test your changes with the test suite before submitting.

Built with ❤️ using OpenAI APIs

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.2

Nov 2, 2025

0.2.1

Nov 2, 2025

This version

0.2.0

Nov 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentops_client-0.2.0.tar.gz (20.5 kB view details)

Uploaded Nov 2, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agentops_client-0.2.0-py3-none-any.whl (11.0 kB view details)

Uploaded Nov 2, 2025 Python 3

File details

Details for the file agentops_client-0.2.0.tar.gz.

File metadata

Download URL: agentops_client-0.2.0.tar.gz
Upload date: Nov 2, 2025
Size: 20.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for agentops_client-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`7bbce7b9140cea8510d582368fc1854938d91e472667bc286a84b4ce3ed30063`
MD5	`5e6c5481edde19855f918b50355b8b29`
BLAKE2b-256	`c2449b4ef4c6c2abed16ba7f954a2e42aa075af3b9576634215481714ade7049`

See more details on using hashes here.

File details

Details for the file agentops_client-0.2.0-py3-none-any.whl.

File metadata

Download URL: agentops_client-0.2.0-py3-none-any.whl
Upload date: Nov 2, 2025
Size: 11.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for agentops_client-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`661984a3a03f025ff347008a2ff0e843e002585198a76f9b6f09ab9566e1ee33`
MD5	`5ad1c1fe39c89d4c4029ac1b8324b599`
BLAKE2b-256	`c44df5e0c5845b1f49fa7555d49aae46b9342fc243e8912240422fe38245fd4a`

See more details on using hashes here.

agentops-client 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🧠 Dual-Mode Hallucination Detector with Reliability Metrics

🎯 Features

Truth Detection

Reliability Engineering 🆕

Technical

🏗️ Architecture

📦 Installation

Option 1: Install as Package (Recommended)

Option 2: Install Dependencies Only

Environment Setup

🚀 Quick Start

SDK Usage (Recommended)

RAG Mode with Retrieved Documents

Batch Monitoring with Sessions

Context Manager (Auto Sessions)

Direct Function Access

📊 Return Format

🎯 Detection Modes

📈 Reliability Metrics

Latency Tracking

Throughput Tracking

🧪 Testing

📈 Scoring System

Components

Threshold

🔧 Configuration

Adjusting Weights

Adjusting Threshold

Throughput Tracking Modes

📝 Example Use Cases

Case 1: Medical RAG System

Case 2: General Knowledge

Case 3: Uncertain Response

🛠️ API Reference

AgentOps SDK Client

AgentOps(api_key=None, track_throughput=True)

evaluate(prompt, response, retrieved_docs=None)

metrics()

reset_metrics()

start_session()

end_session()

Direct Function API

detect_hallucination(prompt, response, retrieved_docs=None, track_throughput=True)

Utility Functions

🚧 Roadmap

Phase 1: Core Detection ✅

Phase 2: Reliability Metrics ✅

Phase 3: Integration (Next)

Phase 4: Advanced Features

📄 License

🤝 Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`AgentOps(api_key=None, track_throughput=True)`

`evaluate(prompt, response, retrieved_docs=None)`

`metrics()`

`reset_metrics()`

`start_session()`

`end_session()`

`detect_hallucination(prompt, response, retrieved_docs=None, track_throughput=True)`