Skip to main content

Framework-agnostic observability, audit, and eval for AI agent applications

Project description

agent-observe

Enterprise-grade observability for AI agents. Deploy with confidence. Improve continuously.

Python 3.9+ License: MIT


The Problem

You're deploying AI agents in production. But:

  • What is it doing? You can't see inside the black box
  • Is it safe? No way to enforce policies or block dangerous operations
  • Is it improving? No systematic way to evaluate and iterate
  • Can you prove it? No audit trail for compliance
  • What does it cost? No visibility into token usage and spend

The Solution

agent-observe is an embeddable observability layer for AI agents. Not a platform you connect to—a library you embed directly in your agent code.

from agent_observe import observe, tool, model_call

observe.install()

@tool(name="query_database", kind="db")
def query_database(sql: str) -> list:
    return db.execute(sql)

with observe.run("my-agent", user_id="jane") as run:
    run.set_input(user_request)
    result = agent.run(user_request)
    run.set_output(result)

# Now you have: traces, policy enforcement, audit logs, extensibility hooks

Why agent-observe?

For Enterprise Deployment

Challenge How agent-observe Helps
Visibility Full tracing of every tool call, LLM request, and decision
Control Policy engine blocks dangerous operations before they execute
Compliance Immutable audit trail with user attribution and timestamps
Improvement Extensible hooks for eval, feedback, and iteration
Cost Visibility Track token usage and spend via span attributes

vs. Platforms like Langfuse

Aspect Langfuse agent-observe
Architecture External SaaS platform Embeddable library
Data location Their cloud or self-hosted service Your database (SQLite, Postgres, OTLP)
Policy enforcement ❌ None ✅ Block/allow rules, call limits
Extensibility Limited ✅ Full lifecycle hooks (v0.2)
PII handling ❌ None ✅ Pre-storage redaction (v0.2)
Replay testing ❌ None ✅ Deterministic replay
UI/Dashboard ✅ Built-in ❌ Bring your own (or use OTLP → Jaeger/Grafana)

Use Langfuse if you want an all-in-one platform with UI, prompt management, and built-in evals.

Use agent-observe if you want:

  • Full control over your data
  • Policy enforcement and safety guardrails
  • Extensibility to build your own workflows
  • Lightweight library without external dependencies

Installation

pip install agent-observe

# With PostgreSQL support
pip install agent-observe[postgres]

# With viewer UI
pip install agent-observe[viewer]

Enterprise Deployment Lifecycle

agent-observe supports the full lifecycle of deploying and improving agents in enterprise settings:

┌─────────────────────────────────────────────────────────────────────────────┐
│                 TRUSTABILITY + OBSERVABILITY FOR ENTERPRISE                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   TRUSTABILITY: Can you trust the agent?                                    │
│   ──────────────────────────────────────                                    │
│      ├── Policy engine blocks dangerous operations before execution         │
│      ├── Call limits prevent runaway loops and cost explosions              │
│      ├── Approval workflows for high-risk actions (v0.2)                    │
│      ├── PII redaction before data leaves your control (v0.2)               │
│      └── Immutable audit trail proves what happened                         │
│                                                                             │
│   OBSERVABILITY: Can you see what's happening?                              │
│   ─────────────────────────────────────────────                             │
│      ├── Full traces: every tool call, model request, decision              │
│      ├── Lifecycle hooks: inject logic at any point (v0.2)                  │
│      ├── User & session attribution for multi-tenant systems                │
│      ├── Error tracking with full context                                   │
│      └── Export to any backend: Postgres, OTLP, Jaeger, Grafana             │
│                                                                             │
│   IMPROVEMENT: Can you make it better?                                      │
│   ────────────────────────────────────                                      │
│      ├── Hooks for eval, feedback, custom metrics (v0.2)                    │
│      ├── Replay testing for deterministic agent testing                     │
│      └── Query traces to find patterns and regressions                      │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Quick Start

from agent_observe import observe, tool, model_call

# Initialize (zero-config, defaults to full capture)
observe.install()

# Wrap your tools
@tool(name="search", kind="http")
def search_web(query: str) -> list:
    return requests.get(f"https://api.search.com?q={query}").json()

# Wrap your LLM calls
@model_call(provider="openai", model="gpt-4")
def call_llm(messages: list) -> str:
    return openai.chat.completions.create(
        model="gpt-4",
        messages=messages,
    ).choices[0].message.content

# Run your agent with full context
with observe.run(
    "my-agent",
    user_id="jane",              # Who triggered this?
    session_id="conv_123",       # Part of which conversation?
) as run:
    run.set_input("Research AI agents")  # Capture user request

    results = search_web("AI agents")
    analysis = call_llm([
        {"role": "system", "content": "You are a research assistant"},
        {"role": "user", "content": f"Analyze: {results}"},
    ])

    run.set_output(analysis)  # Capture final response

View traces:

agent-observe view
# Open http://localhost:8765

Documentation

Document Description
Examples Runnable code examples (basic usage, async, policies, hooks, PII)
Guide Data model, capture modes, policies, risk scoring, querying
Configuration Environment variables and Config options
Patterns Enterprise patterns and recipes
Integration Guide How to integrate with OpenAI, Anthropic, LangChain, etc.

Key Concepts

Runs, Spans, and Events

┌─────────────────────────────────────────────────────────────┐
│                        observe.run()                         │
│                           (Run)                              │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐          │
│  │ @tool       │  │ @model_call │  │ emit_event  │          │
│  │  (Span)     │  │   (Span)    │  │  (Event)    │          │
│  └─────────────┘  └─────────────┘  └─────────────┘          │
└─────────────────────────────────────────────────────────────┘
  • Run = One agent execution (start to finish)
  • Span = One tool or model call within a run
  • Event = Custom occurrence you emit

See Guide for details.

Capture Modes

Mode What's Stored Use Case
full Everything (default as of v0.1.7) Development, debugging
evidence_only Small content + hashes (64KB limit) Production with audit needs
metadata_only Hashes, timings only High-security production

Default is full as of v0.1.7 - you install observability because you want to see what happened.

For minimal storage: observe.install(mode="metadata_only")

See Guide for details.

Risk Scoring

Automatic risk scoring (0-100) based on:

Signal Weight
Policy violations +40
Tool success rate < 90% +25
Repeated tool calls (loops) +15
5+ retries +10
Latency exceeds budget +10

How to Improve Your Agent

agent-observe provides the data and hooks you need to continuously improve your agents. Here's how:

1. Track Token Usage & Costs

Add token/cost attributes to any span:

from agent_observe.context import get_current_span

# Inside your model call wrapper
response = openai.chat.completions.create(model="gpt-4", messages=messages)

span = get_current_span()
span.set_attribute("tokens_input", response.usage.prompt_tokens)
span.set_attribute("tokens_output", response.usage.completion_tokens)
span.set_attribute("cost_usd", calculate_cost(response.usage))  # Your pricing logic

2. Run Evaluations

Emit eval events to track quality:

# After agent completes
score = my_evaluator(run.input, run.output)
observe.emit_event("eval", {
    "score": score.overall,
    "correctness": score.correctness,
    "helpfulness": score.helpfulness,
    "passed": score.overall > 0.7,
})

3. Collect User Feedback

Capture ratings and feedback:

# When user provides feedback
observe.emit_event("feedback", {
    "rating": 5,
    "comment": "This was helpful!",
    "run_id": run.run_id,
})

4. Audit Data Access

Log sensitive operations:

observe.emit_event("audit", {
    "action": "data_access",
    "resource": "users_table",
    "actor": run.user_id,
    "query": sanitized_query,
})

5. Use Lifecycle Hooks

Automate tasks with hooks that run at key points in the execution lifecycle:

from agent_observe import observe, HookResult

# Block dangerous operations
@observe.hooks.before_tool
def security_check(ctx):
    if "DROP" in str(ctx.args).upper():
        return HookResult.block("SQL DROP statements are blocked")
    return HookResult.proceed()

# Auto-eval after each run
@observe.hooks.on_run_end
def auto_eval(ctx):
    if ctx.status == "ok":
        score = evaluate(ctx.run.output)
        observe.emit_event("eval", {"score": score})

# Track cost on every model call
@observe.hooks.after_model
def track_cost(ctx, result):
    cost = calculate_cost(result.usage)
    ctx.span.set_attribute("cost_usd", cost)
    return result

# Modify inputs before execution
@observe.hooks.before_tool
def sanitize_inputs(ctx):
    if ctx.tool_name == "search":
        cleaned_query = sanitize(ctx.args[0])
        return HookResult.modify(args=(cleaned_query,), kwargs=ctx.kwargs)
    return HookResult.proceed()

6. Circuit Breaker for Hook Resilience

Protect your agent from failing hooks with automatic circuit breakers:

from agent_observe import observe, CircuitBreakerConfig

observe.install()

# Configure circuit breaker for hooks
observe.hooks.set_circuit_breaker(CircuitBreakerConfig(
    enabled=True,
    failure_threshold=5,    # Open circuit after 5 failures
    window_seconds=60,      # Within 60 seconds
    recovery_seconds=300,   # Try again after 5 minutes
))

@observe.hooks.before_tool
def flaky_external_check(ctx):
    # If this fails 5 times in 60s, it's automatically skipped
    # until the circuit breaker recovers
    return external_service.validate(ctx.tool_name)

Configuration

Zero-Config (Recommended)

observe.install()  # Reads from environment variables

Environment Variables

AGENT_OBSERVE_MODE=full             # Capture mode (default: full as of v0.1.7)
AGENT_OBSERVE_ENV=prod              # Environment
DATABASE_URL=postgresql://...       # Enables Postgres sink

See Configuration for all options.

Explicit Config

from agent_observe.config import Config, CaptureMode, SinkType

config = Config(
    mode=CaptureMode.FULL,
    sink_type=SinkType.POSTGRES,
    database_url=os.environ.get("DATABASE_URL"),
)
observe.install(config=config)

Sinks (Storage Backends)

Sink Use Case
SQLite Local development
PostgreSQL Production
JSONL Simple fallback
OTLP OpenTelemetry export (Jaeger, Honeycomb, Datadog)

Auto-selected based on available connections.

Policy Engine (Safety Guardrails)

Enterprises need guardrails. The policy engine lets you enforce rules before execution:

# .riff/observe.policy.yml
tools:
  allow:
    - "db.read_*"      # Allow read operations
    - "http.get_*"     # Allow GET requests
  deny:
    - "shell.*"        # Block all shell commands
    - "db.drop_*"      # Block destructive DB ops
    - "*.delete"       # Block anything ending in delete

limits:
  max_tool_calls: 100   # Prevent infinite loops
  max_model_calls: 50   # Cap LLM spend

When a policy violation occurs:

from agent_observe import PolicyViolationError

try:
    dangerous_tool()  # Blocked by policy
except PolicyViolationError as e:
    print(f"Blocked: {e.reason}")
    # Log for audit, alert security team, etc.

Compliance & Audit

User Attribution

Every run tracks who triggered it:

with observe.run("agent", user_id="jane@company.com", session_id="conv_123"):
    # All spans and events are attributed to this user

Immutable Audit Trail

All traces are stored with:

  • Timestamps (ms precision)
  • User ID
  • Session ID
  • Full input/output (configurable)
  • Policy violations

Query for Compliance

-- Find all runs by a specific user
SELECT * FROM runs WHERE user_id = 'jane@company.com';

-- Find all policy violations
SELECT * FROM spans WHERE violation_type IS NOT NULL;

-- Find all data access events
SELECT * FROM events WHERE event_type = 'audit';

PII Handling

Automatically redact or hash PII before it's stored:

from agent_observe import observe, PIIConfig

# Configure PII handling at install time
observe.install(
    pii=PIIConfig(
        enabled=True,
        action="redact",  # "redact", "hash", "tokenize", or "flag"
        patterns={
            "email": True,      # Built-in pattern
            "phone": True,      # Built-in pattern
            "ssn": True,        # Built-in pattern
            "credit_card": True,
            # Custom patterns
            "employee_id": r"EMP-\d{6}",
        },
    )
)

# All data is automatically processed before storage
with observe.run("support-agent", user_id="jane"):
    # Emails in tool args/results are redacted as [EMAIL_REDACTED]
    send_email("user@example.com", "Hello!")  # Stored as [EMAIL_REDACTED]

PII Actions:

Action Description
redact Replace with [EMAIL_REDACTED], [PHONE_REDACTED], etc.
hash Replace with consistent hash: [EMAIL:a1b2c3d4...]
tokenize Replace with reversible token (for later recovery)
flag Keep original but mark as [PII:email]user@example.com[/PII]

CLI

# Start viewer
agent-observe view

# Export to JSONL
agent-observe export-jsonl -o ./export

Architecture

agent_observe/
├── observe.py      # Core runtime
├── decorators.py   # @tool, @model_call
├── policy.py       # YAML policy engine
├── metrics.py      # Risk scoring
├── replay.py       # Tool result caching
├── sinks/          # Storage backends
└── viewer/         # FastAPI UI

Development

pip install -e ".[dev]"
pytest
ruff check .

Roadmap

v0.1.x - Observability Foundation ✅

  • Full tracing (tools, models, runs)
  • Multiple sinks (SQLite, Postgres, JSONL, OTLP)
  • Policy engine with allow/deny rules
  • Risk scoring
  • Replay mode for testing

v0.2 - Trustability + Extensibility ✅

  • Lifecycle hooks (before/after tool, model, run)
  • Hook actions (block, skip, modify execution)
  • Circuit breaker (auto-disable failing hooks)
  • PII handling (redact/hash/tokenize before storage)

v0.3 (Next) - Production Hardening

  • 🔄 Approval workflows (pause for human approval)
  • 📋 Enhanced policy engine (dynamic rules)
  • 📋 Session analytics
  • 📋 OpenTelemetry semantic conventions

Philosophy

We build what you can't do yourself.

You CAN add cost tracking, run evals, collect feedback, log audits using our existing APIs (set_attribute, emit_event). So we don't build those as features—we give you the hooks to build them your way.

What you CAN'T do yourself:

  • Intercept before execution → We provide before_tool, before_model hooks
  • Block or skip operations → We provide HookResult.block(), HookResult.skip()
  • Redact PII before storage → We provide pre-sink interception
  • Pause for approval → We provide HookResult.pending()

This keeps the library lightweight and flexible while giving you the power to build exactly what your enterprise needs.


License

MIT License - Use it, embed it, extend it. No restrictions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_observe-0.2.0.tar.gz (137.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agent_observe-0.2.0-py3-none-any.whl (90.9 kB view details)

Uploaded Python 3

File details

Details for the file agent_observe-0.2.0.tar.gz.

File metadata

  • Download URL: agent_observe-0.2.0.tar.gz
  • Upload date:
  • Size: 137.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.12.1.2 readme-renderer/44.0 requests/2.32.5 requests-toolbelt/1.0.0 urllib3/2.3.0 tqdm/4.67.1 importlib-metadata/8.5.0 keyring/25.6.0 rfc3986/1.5.0 colorama/0.4.6 CPython/3.12.9

File hashes

Hashes for agent_observe-0.2.0.tar.gz
Algorithm Hash digest
SHA256 4a6984c4b80c8fdbc6523098f16cd03b59e88247405840baf2e3abdf33c7bb9b
MD5 1859938b6f94b5b48baf3488795207d7
BLAKE2b-256 3b3460c49a19b1d23d85b3525c7f16c6496a63c4cae7d9378690b1390391b841

See more details on using hashes here.

File details

Details for the file agent_observe-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: agent_observe-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 90.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.12.1.2 readme-renderer/44.0 requests/2.32.5 requests-toolbelt/1.0.0 urllib3/2.3.0 tqdm/4.67.1 importlib-metadata/8.5.0 keyring/25.6.0 rfc3986/1.5.0 colorama/0.4.6 CPython/3.12.9

File hashes

Hashes for agent_observe-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0cf338351dc6b93722e3c72ea2c18635f237ff06c88d1922ae4a854f59c91839
MD5 5cee06ca83a61349ed19466300a7ea5c
BLAKE2b-256 7574f2c9c2a5896a4d4cf24b28cc47267b22879c2727a31ab79bf8a8d83ff85e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page