Skip to main content

Service for computing metrics on AI agent performance data

Project description

Metric Computation Engine

The Metric Computation Engine (MCE) is a tool for computing metrics from observability telemetry collected from our instrumentation SDK (https://github.com/agntcy/observe). The list of currently supported metrics is defined below, but the MCE was designed to make it easy to implement new metrics and extend the library over time.

Prerequisites

  • Python 3.11 or higher
  • uv package manager for dependency management
  • LLM API Key (OpenAI, or custom endpoint) for LLM-based metrics
  • Mock LLM Proxy (optional) use mock-llm-proxy CLI for local testing without real API keys
  • Agentic App: Get started with coffeeAgntcy reference Agentic App implementation using the AGNTCY ecosystem.
  • Instrumentation: Agentic apps must be instrumented with AGNTCY's observe SDK as the MCE relies on its observability data schema

Supported metrics

Metrics can be computed at three levels of aggregation: span level, session level and population level (which is a batch of sessions).

The current supported metrics are listed in the table below, along with their aggregation levels.

Core Metrics

Span-Level Metrics

Metric Name Description
Tool Utilization Accuracy Measures tool selection and usage efficiency

Session-Level Metrics

Metric Name Description
Agent to Agent Interactions Counts interactions between pairs of agents
Agent to Tool Interactions Counts interactions between agents and tools
Tool Error Rate Rate of tool errors throughout a session
Cycles Count How many times an entity returns to previous entity

Population-Level Metrics

Metric Name Description
Graph Determinism Score Measures variance in execution patterns across multiple sessions

Plugin Architecture

The MCE uses a plugin-based architecture for extensibility:

  • Native Metrics Plugins: Unique agent metrics to evaluate conversation, orchestration, tool usage quality
  • Third-party Adapter Plugins: Third-party framework integrations (RAGAS, DeepEval, Opik)

Native Metrics Plugin

The MCE includes a comprehensive native metrics plugin that provides 13 advanced session-level and span-level metrics for AI agent evaluation. These metrics use LLM-as-a-Judge techniques and confidence analysis for comprehensive assessment. For additional plugin metrics and detailed descriptions, see the Native Metrics Plugin README: plugins/mce_metrics_plugin/README.md.

Third-party Adapters

The MCE supports integration with popular evaluation frameworks through adapter plugins:

Python Package Installation

For local development or custom deployments, you can install the Metrics Computation Engine and its plugins directly via pip:

Quick Start - Complete Platform

# Install everything - core MCE + all adapters + native metrics
pip install "metrics-computation-engine[all]"

Selective Installation

# Core MCE only
pip install metrics-computation-engine

# Core + specific adapters
pip install "metrics-computation-engine[deepeval]"
pip install "metrics-computation-engine[ragas]"
pip install "metrics-computation-engine[opik]"

# Core + native LLM-based metrics
pip install "metrics-computation-engine[metrics-plugin]"

# Mix and match as needed
pip install "metrics-computation-engine[deepeval,metrics-plugin]"

Note for zsh users: If you encounter zsh: no matches found errors, quote the package name with extras (e.g., "metrics-computation-engine[opik]").

Getting started

Environment Configuration

Configure the following variables in your .env file:

# Server Configuration
HOST=0.0.0.0                    # MCE Server bind address
PORT=8000                       # MCE Server port
RELOAD=false                    # Enable auto-reload for development
LOG_LEVEL=info                  # Logging level (debug, info, warning, error)

# Data Access Configuration
API_BASE_URL=http://localhost:8080       # API-layer endpoint
PAGINATION_LIMIT=50                      # Max sessions per API request
PAGINATION_DEFAULT_MAX_SESSIONS=50       # Default max sessions when not specified
SESSIONS_TRACES_MAX=20                   # Max sessions per batch for trace retrieval

# LLM Configuration
LLM_BASE_MODEL_URL=https://api.openai.com/v1  # LLM API endpoint
LLM_MODEL_NAME=gpt-4o                          # LLM model name
LLM_API_KEY=sk-...                             # LLM API key

Note: LLM configuration can be provided via environment variables (global defaults) or per-request in the llm_judge_config parameter. Request-level configuration takes precedence.

Mock LiteLLM Proxy

For local development you can avoid using real API keys by starting the bundled mock proxy. It implements the POST /chat/completions endpoint expected by LiteLLM and returns deterministic scores.

uv run mock-llm-proxy --port 8010

Update your .env or per-request config to point at the proxy:

"llm_judge_config": {
  "LLM_BASE_MODEL_URL": "http://localhost:8010",
  "LLM_MODEL_NAME": "mock-model",
  "LLM_API_KEY": "test"
}

CLI options let you tune the score and reasoning. Run uv run mock-llm-proxy --help for the full list.

Examples Directory

Several example scripts are available to help you get started with the MCE:

  • Basic usage — service (service_test.py): Sends a request to a running MCE server (POST /compute_metrics) with metrics, llm_judge_config, and data_fetching_infos.batch_config.time_range.
  • Basic usage — library (mce-demo.py): Runs MCE in-process. Loads data/sample_data.json, builds a MetricRegistry, registers core and native plugin metrics, demonstrates 3rd‑party adapters (DeepEval, Opik), and executes MetricsProcessor with LLMJudgeConfig from .env.
  • Sample data (data/sample_data.json): Synthetic raw spans used by mce-demo.py.

MCE usage

The MCE can be used in two ways: as a REST API service or as a Python module. Both methods allow you to compute various metrics on your agent telemetry data.

There are three main input parameters to the MCE, as shown in the examples above: metrics, llm_judge_config, and data_fetching_infos.

Metrics Parameter

The metrics parameter is a list of metric names that you want to compute. Each metric operates at different levels (span, session, or population) and may have different computational requirements. You can specify any combination of the available metrics:

"metrics": [
    "ToolUtilizationAccuracy",
    "ToolError",
    "ToolErrorRate",
    "AgentToToolInteractions",
    "AgentToAgentInteractions",
    "CyclesCount",
    "Groundedness",
]
Using 3rd‑party adapters (RAGAS, DeepEval, Opik)

You can request 3rd‑party framework metrics through adapter plugins by using a dotted identifier in metrics:

  • deepeval.<MetricName> (e.g., deepeval.AnswerRelevancyMetric)
  • opik.<MetricName> (e.g., opik.Hallucination)
  • ragas.<MetricName> (see adapter README for available names)
"metrics": [
    "deepeval.AnswerRelevancyMetric",
]

LLM Judge Config

The llm_judge_config parameter configures the LLM used for metrics that require LLM-as-a-Judge evaluation (such as ToolUtilizationAccuracy and Groundedness):

"llm_judge_config": {
    "LLM_API_KEY": "your_api_key", # API key for your LLM provider
    "LLM_MODEL_NAME": "gpt-4o", # The specific model to use (e.g., "gpt-4o")
    "LLM_BASE_MODEL_URL": "https://api.openai.com/v1" # API endpoint URL (supports OpenAI-compatible APIs)
}

Data Fetching Infos

Use data_fetching_infos to select which sessions to evaluate. You can provide a time range via batch_config.time_range, explicit session_ids, or both.

By time range

"data_fetching_infos": {
  "batch_config": {
    "time_range": {
      "start": "2024-01-01T00:00:00Z",
      "end": "2024-12-31T23:59:59Z"
    }
  },
  "session_ids": []
}

By explicit session IDs

"data_fetching_infos": {
  "batch_config": {},
  "session_ids": ["<session_id_1>", "<session_id_2>", ... "<session_id_n>"]
}

Deployment as a service

There are two ways to run the MCE service:

  1. Docker Compose (recommended for a full local stack)

    • Use the provided docker compose file to start OTel Collector, ClickHouse, the API layer, and the MCE.
    • Once up, instrument an app with our Observe SDK to generate traces.
  2. Run the server locally

    • Activate your virtual environment and start the server:
      source .venv/bin/activate
      mce-server
      
      or
      .venv/bin/activate
      uv run --env-file .env mce-server
      

API Endpoints

  • GET / - Returns available endpoints
  • GET /metrics - List all available metrics and their metadata
  • GET /status - Health check and server status
  • POST /compute_metrics - Compute metrics from JSON configuration (see examples/service_test.py)

The server provides automatic OpenAPI documentation at http://<HOST>:<PORT>/docs when running.

You can run the MCE by making a curl call to the endpoint <HOST>:<PORT> as defined in the .env. Perform an evaluation by sending a POST request to /compute_metrics:

Example:

curl -sS -X POST "http://<HOST>:<PORT>/compute_metrics" \
  -H "Content-Type: application/json" \
  -d '{
    "metrics": [
      "Groundedness"
    ],
    "llm_judge_config": {
      "LLM_BASE_MODEL_URL": "https://api.openai.com/v1",
      "LLM_MODEL_NAME": "gpt-4o",
      "LLM_API_KEY": "api-key"
    },
    "data_fetching_infos": {
      "batch_config": {"time_range": {"start": "2000-06-20T15:04:05Z", "end": "2040-06-29T08:52:55Z"}},
      "session_ids": []
    },
    "metric_options": {
      "computation_level": ["session"],
      "write_to_db": false
    }
  }'

The payload for this POST request must be in JSON format, and contains at least the two following fields:

  • metrics: a list containing the name of the metrics that should be computed.
  • data_fetching_infos: a dictionary containing the information to select a set of sessions. This is achieved by either providing a batch_config, which consist of a time_range with a start and end time; or a list of session ids, through the session_ids field (see the example above).

In addition to this, there are two optional fields:

  • llm_judge_config: a dictionary that holds the information related to the configuration of the LLM as a Judge. if not provided, the information provided by the environment variables will be used.
  • metric_options: a dictionary for the different options for the metrics. Currently, there are two options, computation_level and write_to_db. The computation_level is a list of levels at which the metric computation should happen. The MCE currently supports session and agent levels. By default, the session level is enforced. The write_to_db is a boolean to indicate if the results of this query should be stored into the DB. By default, this is set to false, but if the environment variable METRICS_CACHE_ENABLED is set to true, the results will always be stored into the DB.

Troubleshooting

Common Issues:

  • ModuleNotFoundError: Ensure virtual environment is activated and dependencies installed via ./install.sh
  • LLM API Errors: Verify API keys in .env file and check rate limits
  • Plugin Load Failures: Run ./install-plugins.sh to install required adapter plugins
  • Memory Issues: Reduce batch sizes in configuration for large datasets
  • Docker Build Failures: Check Docker daemon is running and remove any cached layers

For detailed debugging, enable verbose logging by setting LOG_LEVEL=DEBUG in your environment.

Contributing

Contributions are welcome! Please follow these steps to contribute:

  1. Fork the repository.
  2. Create a new branch (git checkout -b feature-branch).
  3. Commit your changes (git commit -am 'Add new feature').
  4. Push to the branch (git push origin feature-branch).
  5. Create a new Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metrics_computation_engine-1.2.3.tar.gz (2.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

metrics_computation_engine-1.2.3-py3-none-any.whl (1.6 MB view details)

Uploaded Python 3

File details

Details for the file metrics_computation_engine-1.2.3.tar.gz.

File metadata

File hashes

Hashes for metrics_computation_engine-1.2.3.tar.gz
Algorithm Hash digest
SHA256 0d65e1adeedabf3cd0b23d8d2fdfe355927a66056659e8cef723cdee528da46f
MD5 cac512024e3f77cc925f13846aa6edfe
BLAKE2b-256 964097892bdab45ba1ced1131cd9640690f2de7cbe6cc05b6e4b4f83206f13c9

See more details on using hashes here.

File details

Details for the file metrics_computation_engine-1.2.3-py3-none-any.whl.

File metadata

File hashes

Hashes for metrics_computation_engine-1.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 4b84d21c263c40c26cb56af8b7451f593b99e361d77177f4857497baf416d00f
MD5 55fd96217dfac378a4fff6eb6e0c3b72
BLAKE2b-256 3ba4d14e78ef8b9c2c60b56c32f29b98696c4e7d2b3e7d57bf543984a5be5fdb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page