Service for computing metrics on AI agent performance data

These details have not been verified by PyPI

Project links

Project description

Metric Computation Engine

The Metric Computation Engine (MCE) is a tool for computing metrics from observability telemetry collected from our instrumentation SDK (https://github.com/agntcy/observe). The list of currently supported metrics is defined below, but the MCE was designed to make it easy to implement new metrics and extend the library over time.

Prerequisites

Python 3.11 or higher
uv package manager for dependency management
LLM API Key (OpenAI, or custom endpoint) for LLM-based metrics
Mock LLM Proxy (optional) use mock-llm-proxy CLI for local testing without real API keys
Agentic App: Get started with coffeeAgntcy reference Agentic App implementation using the AGNTCY ecosystem.
Instrumentation: Agentic apps must be instrumented with AGNTCY's observe SDK as the MCE relies on its observability data schema

Supported metrics

Metrics can be computed at three levels of aggregation: span level, session level and population level (which is a batch of sessions).

The current supported metrics are listed in the table below, along with their aggregation levels.

Core Metrics

Span-Level Metrics

Metric Name	Description
Tool Utilization Accuracy	Measures tool selection and usage efficiency

Session-Level Metrics

Metric Name	Description
Agent to Agent Interactions	Counts interactions between pairs of agents
Agent to Tool Interactions	Counts interactions between agents and tools
Tool Error Rate	Rate of tool errors throughout a session
Cycles Count	How many times an entity returns to previous entity

Population-Level Metrics

Metric Name	Description
Graph Determinism Score	Measures variance in execution patterns across multiple sessions

Plugin Architecture

The MCE uses a plugin-based architecture for extensibility:

Native Metrics Plugins: Unique agent metrics to evaluate conversation, orchestration, tool usage quality
Third-party Adapter Plugins: Third-party framework integrations (RAGAS, DeepEval, Opik)

Native Metrics Plugin

The MCE includes a comprehensive native metrics plugin that provides 13 advanced session-level and span-level metrics for AI agent evaluation. These metrics use LLM-as-a-Judge techniques and confidence analysis for comprehensive assessment. For additional plugin metrics and detailed descriptions, see the Native Metrics Plugin README: plugins/mce_metrics_plugin/README.md.

Third-party Adapters

The MCE supports integration with popular evaluation frameworks through adapter plugins:

DeepEval - plugins/adapters/deepeval_adapter/README.md
Opik - plugins/adapters/opik_adapter/README.md
RAGAS - plugins/adapters/ragas_adapter/README.md

Python Package Installation

For local development or custom deployments, you can install the Metrics Computation Engine and its plugins directly via pip:

Quick Start - Complete Platform

# Install everything - core MCE + all adapters + native metrics
pip install "metrics-computation-engine[all]"

Selective Installation

# Core MCE only
pip install metrics-computation-engine

# Core + specific adapters
pip install "metrics-computation-engine[deepeval]"
pip install "metrics-computation-engine[ragas]"
pip install "metrics-computation-engine[opik]"

# Core + native LLM-based metrics
pip install "metrics-computation-engine[metrics-plugin]"

# Mix and match as needed
pip install "metrics-computation-engine[deepeval,metrics-plugin]"

Note for zsh users: If you encounter zsh: no matches found errors, quote the package name with extras (e.g., "metrics-computation-engine[opik]").

Getting started

Environment Configuration

Configure the following variables in your .env file:

# Server Configuration
HOST=0.0.0.0                    # MCE Server bind address
PORT=8000                       # MCE Server port
RELOAD=false                    # Enable auto-reload for development
LOG_LEVEL=info                  # Logging level (debug, info, warning, error)

# Data Access Configuration
API_BASE_URL=http://localhost:8080       # API-layer endpoint
PAGINATION_LIMIT=50                      # Max sessions per API request
PAGINATION_DEFAULT_MAX_SESSIONS=50       # Default max sessions when not specified
SESSIONS_TRACES_MAX=20                   # Max sessions per batch for trace retrieval

# LLM Configuration
LLM_BASE_MODEL_URL=https://api.openai.com/v1  # LLM API endpoint
LLM_MODEL_NAME=gpt-4o                          # LLM model name
LLM_API_KEY=sk-...                             # LLM API key

Note: LLM configuration can be provided via environment variables (global defaults) or per-request in the llm_judge_config parameter. Request-level configuration takes precedence.

Mock LiteLLM Proxy

For local development you can avoid using real API keys by starting the bundled mock proxy. It implements the POST /chat/completions endpoint expected by LiteLLM and returns deterministic scores.

uv run mock-llm-proxy --port 8010

Update your .env or per-request config to point at the proxy:

"llm_judge_config": {
  "LLM_BASE_MODEL_URL": "http://localhost:8010",
  "LLM_MODEL_NAME": "mock-model",
  "LLM_API_KEY": "test"
}

CLI options let you tune the score and reasoning. Run uv run mock-llm-proxy --help for the full list.

Examples Directory

Several example scripts are available to help you get started with the MCE:

Basic usage — service (service_test.py): Sends a request to a running MCE server (POST /compute_metrics) with metrics, llm_judge_config, and data_fetching_infos.batch_config.time_range.
Basic usage — library (mce-demo.py): Runs MCE in-process. Loads data/sample_data.json, builds a MetricRegistry, registers core and native plugin metrics, demonstrates 3rd‑party adapters (DeepEval, Opik), and executes MetricsProcessor with LLMJudgeConfig from .env.
Sample data (data/sample_data.json): Synthetic raw spans used by mce-demo.py.

MCE usage

The MCE can be used in two ways: as a REST API service or as a Python module. Both methods allow you to compute various metrics on your agent telemetry data.

There are three main input parameters to the MCE, as shown in the examples above: metrics, llm_judge_config, and data_fetching_infos.

Metrics Parameter

The metrics parameter is a list of metric names that you want to compute. Each metric operates at different levels (span, session, or population) and may have different computational requirements. You can specify any combination of the available metrics:

"metrics": [
    "ToolUtilizationAccuracy",
    "ToolError",
    "ToolErrorRate",
    "AgentToToolInteractions",
    "AgentToAgentInteractions",
    "CyclesCount",
    "Groundedness",
]

Using 3rd‑party adapters (RAGAS, DeepEval, Opik)

You can request 3rd‑party framework metrics through adapter plugins by using a dotted identifier in metrics:

deepeval.<MetricName> (e.g., deepeval.AnswerRelevancyMetric)
opik.<MetricName> (e.g., opik.Hallucination)
ragas.<MetricName> (see adapter README for available names)

"metrics": [
    "deepeval.AnswerRelevancyMetric",
]

LLM Judge Config

The llm_judge_config parameter configures the LLM used for metrics that require LLM-as-a-Judge evaluation (such as ToolUtilizationAccuracy and Groundedness):

"llm_judge_config": {
    "LLM_API_KEY": "your_api_key", # API key for your LLM provider
    "LLM_MODEL_NAME": "gpt-4o", # The specific model to use (e.g., "gpt-4o")
    "LLM_BASE_MODEL_URL": "https://api.openai.com/v1" # API endpoint URL (supports OpenAI-compatible APIs)
}

Data Fetching Infos

Use data_fetching_infos to select which sessions to evaluate. You can provide a time range via batch_config.time_range, explicit session_ids, or both.

By time range

"data_fetching_infos": {
  "batch_config": {
    "time_range": {
      "start": "2024-01-01T00:00:00Z",
      "end": "2024-12-31T23:59:59Z"
    }
  },
  "session_ids": []
}

By explicit session IDs

"data_fetching_infos": {
  "batch_config": {},
  "session_ids": ["<session_id_1>", "<session_id_2>", ... "<session_id_n>"]
}

Deployment as a service

There are two ways to run the MCE service:

Docker Compose (recommended for a full local stack)
- Use the provided docker compose file to start OTel Collector, ClickHouse, the API layer, and the MCE.
- Once up, instrument an app with our Observe SDK to generate traces.

Run the server locally

Activate your virtual environment and start the server:

source .venv/bin/activate
mce-server

.venv/bin/activate
uv run --env-file .env mce-server

API Endpoints

GET / - Returns available endpoints
GET /metrics - List all available metrics and their metadata
GET /status - Health check and server status
POST /compute_metrics - Compute metrics from JSON configuration (see examples/service_test.py)

The server provides automatic OpenAPI documentation at http://<HOST>:<PORT>/docs when running.

You can run the MCE by making a curl call to the endpoint <HOST>:<PORT> as defined in the .env. Perform an evaluation by sending a POST request to /compute_metrics:

Example:

curl -sS -X POST "http://<HOST>:<PORT>/compute_metrics" \
  -H "Content-Type: application/json" \
  -d '{
    "metrics": [
      "Groundedness"
    ],
    "llm_judge_config": {
      "LLM_BASE_MODEL_URL": "https://api.openai.com/v1",
      "LLM_MODEL_NAME": "gpt-4o",
      "LLM_API_KEY": "api-key"
    },
    "data_fetching_infos": {
      "batch_config": {"time_range": {"start": "2000-06-20T15:04:05Z", "end": "2040-06-29T08:52:55Z"}},
      "session_ids": []
    },
    "metric_options": {
      "computation_level": ["session"],
      "write_to_db": false
    }
  }'

The payload for this POST request must be in JSON format, and contains at least the two following fields:

metrics: a list containing the name of the metrics that should be computed.
data_fetching_infos: a dictionary containing the information to select a set of sessions. This is achieved by either providing a batch_config, which consist of a time_range with a start and end time; or a list of session ids, through the session_ids field (see the example above).

In addition to this, there are two optional fields:

llm_judge_config: a dictionary that holds the information related to the configuration of the LLM as a Judge. if not provided, the information provided by the environment variables will be used.
metric_options: a dictionary for the different options for the metrics. Currently, there are two options, computation_level and write_to_db. The computation_level is a list of levels at which the metric computation should happen. The MCE currently supports session and agent levels. By default, the session level is enforced. The write_to_db is a boolean to indicate if the results of this query should be stored into the DB. By default, this is set to false, but if the environment variable METRICS_CACHE_ENABLED is set to true, the results will always be stored into the DB.

Troubleshooting

Common Issues:

ModuleNotFoundError: Ensure virtual environment is activated and dependencies installed via ./install.sh
LLM API Errors: Verify API keys in .env file and check rate limits
Plugin Load Failures: Run ./install-plugins.sh to install required adapter plugins
Memory Issues: Reduce batch sizes in configuration for large datasets
Docker Build Failures: Check Docker daemon is running and remove any cached layers

For detailed debugging, enable verbose logging by setting LOG_LEVEL=DEBUG in your environment.

Contributing

Contributions are welcome! Please follow these steps to contribute:

Fork the repository.
Create a new branch (git checkout -b feature-branch).
Commit your changes (git commit -am 'Add new feature').
Push to the branch (git push origin feature-branch).
Create a new Pull Request.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.2.9

Mar 25, 2026

1.2.8

Feb 12, 2026

1.2.7

Jan 8, 2026

1.2.6

Jan 6, 2026

1.2.5

Dec 22, 2025

1.2.4

Dec 5, 2025

This version

1.2.3

Nov 7, 2025

1.2.2

Oct 24, 2025

1.2.1

Oct 6, 2025

1.2.0

Sep 26, 2025

1.1.0

Sep 24, 2025

1.0.0

Sep 5, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metrics_computation_engine-1.2.3.tar.gz (2.3 MB view details)

Uploaded Nov 7, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

metrics_computation_engine-1.2.3-py3-none-any.whl (1.6 MB view details)

Uploaded Nov 7, 2025 Python 3

File details

Details for the file metrics_computation_engine-1.2.3.tar.gz.

File metadata

Download URL: metrics_computation_engine-1.2.3.tar.gz
Upload date: Nov 7, 2025
Size: 2.3 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for metrics_computation_engine-1.2.3.tar.gz
Algorithm	Hash digest
SHA256	`0d65e1adeedabf3cd0b23d8d2fdfe355927a66056659e8cef723cdee528da46f`
MD5	`cac512024e3f77cc925f13846aa6edfe`
BLAKE2b-256	`964097892bdab45ba1ced1131cd9640690f2de7cbe6cc05b6e4b4f83206f13c9`

See more details on using hashes here.

File details

Details for the file metrics_computation_engine-1.2.3-py3-none-any.whl.

File metadata

Download URL: metrics_computation_engine-1.2.3-py3-none-any.whl
Upload date: Nov 7, 2025
Size: 1.6 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for metrics_computation_engine-1.2.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4b84d21c263c40c26cb56af8b7451f593b99e361d77177f4857497baf416d00f`
MD5	`55fd96217dfac378a4fff6eb6e0c3b72`
BLAKE2b-256	`3ba4d14e78ef8b9c2c60b56c32f29b98696c4e7d2b3e7d57bf543984a5be5fdb`

See more details on using hashes here.

metrics-computation-engine 1.2.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Metric Computation Engine

Prerequisites

Supported metrics

Core Metrics

Span-Level Metrics

Session-Level Metrics

Population-Level Metrics

Plugin Architecture

Native Metrics Plugin

Third-party Adapters

Python Package Installation

Quick Start - Complete Platform

Selective Installation

Getting started

Environment Configuration

Mock LiteLLM Proxy

Examples Directory

MCE usage

Metrics Parameter

Using 3rd‑party adapters (RAGAS, DeepEval, Opik)

LLM Judge Config

Data Fetching Infos

Deployment as a service

Troubleshooting

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes