Service for computing metrics on AI agent performance data
Project description
Metric Computation Engine
The Metric Computation Engine (MCE) is a tool for computing metrics from observability telemetry collected from our instrumentation SDK (https://github.com/agntcy/observe). The list of currently supported metrics is defined below, but the MCE was designed to make it easy to implement new metrics and extend the library over time.
Prerequisites
- Python 3.11 or higher
- uv package manager for dependency management
- LLM API Key (OpenAI, or custom endpoint) for LLM-based metrics
- Mock LLM Proxy (optional) use
mock-llm-proxyCLI for local testing without real API keys - Agentic App: Get started with coffeeAgntcy reference Agentic App implementation using the AGNTCY ecosystem.
- Instrumentation: Agentic apps must be instrumented with AGNTCY's observe SDK as the MCE relies on its observability data schema
Supported metrics
Metrics can be computed at three levels of aggregation: span level, session level and population level (which is a batch of sessions).
The current supported metrics are listed in the table below, along with their aggregation levels.
Core Metrics
Span-Level Metrics
| Metric Name | Description |
|---|---|
| Tool Utilization Accuracy | Measures tool selection and usage efficiency |
Session-Level Metrics
| Metric Name | Description |
|---|---|
| Agent to Agent Interactions | Counts interactions between pairs of agents |
| Agent to Tool Interactions | Counts interactions between agents and tools |
| Tool Error Rate | Rate of tool errors throughout a session |
| Cycles Count | How many times an entity returns to previous entity |
Population-Level Metrics
| Metric Name | Description |
|---|---|
| Graph Determinism Score | Measures variance in execution patterns across multiple sessions |
Plugin Architecture
The MCE uses a plugin-based architecture for extensibility:
- Native Metrics Plugins: Unique agent metrics to evaluate conversation, orchestration, tool usage quality
- Third-party Adapter Plugins: Third-party framework integrations (RAGAS, DeepEval, Opik)
Native Metrics Plugin
The MCE includes a comprehensive native metrics plugin that provides 13 advanced session-level and span-level metrics for AI agent evaluation. These metrics use LLM-as-a-Judge techniques and confidence analysis for comprehensive assessment. For additional plugin metrics and detailed descriptions, see the Native Metrics Plugin README: plugins/mce_metrics_plugin/README.md.
Third-party Adapters
The MCE supports integration with popular evaluation frameworks through adapter plugins:
- DeepEval - plugins/adapters/deepeval_adapter/README.md
- Opik - plugins/adapters/opik_adapter/README.md
- RAGAS - plugins/adapters/ragas_adapter/README.md
Python Package Installation
For local development or custom deployments, you can install the Metrics Computation Engine and its plugins directly via pip:
Quick Start - Complete Platform
# Install everything - core MCE + all adapters + native metrics
pip install "metrics-computation-engine[all]"
Selective Installation
# Core MCE only
pip install metrics-computation-engine
# Core + specific adapters
pip install "metrics-computation-engine[deepeval]"
pip install "metrics-computation-engine[ragas]"
pip install "metrics-computation-engine[opik]"
# Core + native LLM-based metrics
pip install "metrics-computation-engine[metrics-plugin]"
# Mix and match as needed
pip install "metrics-computation-engine[deepeval,metrics-plugin]"
Note for zsh users: If you encounter zsh: no matches found errors, quote the package name with extras (e.g., "metrics-computation-engine[opik]").
Getting started
Environment Configuration
Configure the following variables in your .env file:
# Server Configuration
HOST=0.0.0.0 # MCE Server bind address
PORT=8000 # MCE Server port
RELOAD=false # Enable auto-reload for development
LOG_LEVEL=info # Logging level (debug, info, warning, error)
# Data Access Configuration
API_BASE_URL=http://localhost:8080 # API-layer endpoint
PAGINATION_LIMIT=50 # Max sessions per API request
PAGINATION_DEFAULT_MAX_SESSIONS=50 # Default max sessions when not specified
SESSIONS_TRACES_MAX=20 # Max sessions per batch for trace retrieval
# LLM Configuration
LLM_BASE_MODEL_URL=https://api.openai.com/v1 # LLM API endpoint
LLM_MODEL_NAME=gpt-4o # LLM model name
LLM_API_KEY=sk-... # LLM API key
Note: LLM configuration can be provided via environment variables (global defaults) or per-request in the llm_judge_config parameter. Request-level configuration takes precedence.
Mock LiteLLM Proxy
For local development you can avoid using real API keys by starting the bundled mock proxy. It implements the POST /chat/completions endpoint expected by LiteLLM and returns deterministic scores.
uv run mock-llm-proxy --port 8010
Update your .env or per-request config to point at the proxy:
"llm_judge_config": {
"LLM_BASE_MODEL_URL": "http://localhost:8010",
"LLM_MODEL_NAME": "openai/mock-model",
"LLM_API_KEY": "test"
}
CLI options let you tune the score and reasoning. Run uv run mock-llm-proxy --help for the full list.
Examples Directory
Several example scripts are available to help you get started with the MCE:
- Basic usage — service (
service_test.py): Sends a request to a running MCE server (POST/compute_metrics) withmetrics,llm_judge_config, anddata_fetching_infos.batch_config.time_range. - Basic usage — library (
mce-demo.py): Runs MCE in-process. Loadsdata/sample_data.json, builds aMetricRegistry, registers core and native plugin metrics, demonstrates 3rd‑party adapters (DeepEval, Opik), and executesMetricsProcessorwithLLMJudgeConfigfrom.env. - Sample data (
data/sample_data.json): Synthetic raw spans used bymce-demo.py.
MCE usage
The MCE can be used in two ways: as a REST API service or as a Python module. Both methods allow you to compute various metrics on your agent telemetry data.
There are three main input parameters to the MCE, as shown in the examples above: metrics, llm_judge_config, and data_fetching_infos.
Metrics Parameter
The metrics parameter is a list of metric names that you want to compute. Each metric operates at different levels (span, session, or population) and may have different computational requirements. You can specify any combination of the available metrics:
"metrics": [
"ToolUtilizationAccuracy",
"ToolError",
"ToolErrorRate",
"AgentToToolInteractions",
"AgentToAgentInteractions",
"CyclesCount",
"Groundedness",
]
Using 3rd‑party adapters (RAGAS, DeepEval, Opik)
You can request 3rd‑party framework metrics through adapter plugins by using a dotted identifier in metrics:
deepeval.<MetricName>(e.g.,deepeval.AnswerRelevancyMetric)opik.<MetricName>(e.g.,opik.Hallucination)ragas.<MetricName>(see adapter README for available names)
"metrics": [
"deepeval.AnswerRelevancyMetric",
]
LLM Judge Config
The llm_judge_config parameter configures the LLM used for metrics that require LLM-as-a-Judge evaluation (such as ToolUtilizationAccuracy and Groundedness):
"llm_judge_config": {
"LLM_API_KEY": "your_api_key", # API key for your LLM provider
"LLM_MODEL_NAME": "gpt-4o", # The specific model to use (e.g., "gpt-4o")
"LLM_BASE_MODEL_URL": "https://api.openai.com/v1" # API endpoint URL (supports OpenAI-compatible APIs)
}
Data Fetching Infos
Use data_fetching_infos to select which sessions to evaluate. You can provide a time range via batch_config.time_range, explicit session_ids, or both.
By time range
"data_fetching_infos": {
"batch_config": {
"time_range": {
"start": "2024-01-01T00:00:00Z",
"end": "2024-12-31T23:59:59Z"
}
},
"session_ids": []
}
By explicit session IDs
"data_fetching_infos": {
"batch_config": {},
"session_ids": ["<session_id_1>", "<session_id_2>", ... "<session_id_n>"]
}
Deployment as a service
There are two ways to run the MCE service:
-
Docker Compose (recommended for a full local stack)
- Use the provided docker compose file to start OTel Collector, ClickHouse, the API layer, and the MCE.
- Once up, instrument an app with our Observe SDK to generate traces.
-
Run the server locally
- Activate your virtual environment and start the server:
source .venv/bin/activate mce-server
or.venv/bin/activate uv run --env-file .env mce-server
- Activate your virtual environment and start the server:
API Endpoints
GET /- Returns available endpointsGET /metrics- List all available metrics and their metadataGET /status- Health check and server statusPOST /compute_metrics- Compute metrics from JSON configuration (see examples/service_test.py)
The server provides automatic OpenAPI documentation at http://<HOST>:<PORT>/docs when running.
You can run the MCE by making a curl call to the endpoint <HOST>:<PORT> as defined in the .env. Perform an evaluation by sending a POST request to /compute_metrics:
Example:
curl -sS -X POST "http://<HOST>:<PORT>/compute_metrics" \
-H "Content-Type: application/json" \
-d '{
"metrics": [
"Groundedness"
],
"llm_judge_config": {
"LLM_BASE_MODEL_URL": "https://api.openai.com/v1",
"LLM_MODEL_NAME": "gpt-4o",
"LLM_API_KEY": "api-key"
},
"data_fetching_infos": {
"batch_config": {"time_range": {"start": "2000-06-20T15:04:05Z", "end": "2040-06-29T08:52:55Z"}},
"session_ids": []
},
"metric_options": {
"computation_level": ["session"],
"write_to_db": false
}
}'
The payload for this POST request must be in JSON format, and contains at least the two following fields:
metrics: a list containing the name of the metrics that should be computed.data_fetching_infos: a dictionary containing the information to select a set of sessions. This is achieved by either providing abatch_config, which consist of atime_rangewith astartandendtime; or a list of session ids, through thesession_idsfield (see the example above).
In addition to this, there are two optional fields:
llm_judge_config: a dictionary that holds the information related to the configuration of the LLM as a Judge. if not provided, the information provided by the environment variables will be used.metric_options: a dictionary for the different options for the metrics. Currently, there are two options,computation_levelandwrite_to_db. Thecomputation_levelis a list of levels at which the metric computation should happen. The MCE currently supportssessionandagentlevels. By default, thesessionlevel is enforced. Thewrite_to_dbis a boolean to indicate if the results of this query should be stored into the DB. By default, this is set tofalse, but if the environment variableMETRICS_CACHE_ENABLEDis set to true, the results will always be stored into the DB.
Troubleshooting
Common Issues:
ModuleNotFoundError: Ensure virtual environment is activated and dependencies installed via./install.sh- LLM API Errors: Verify API keys in
.envfile and check rate limits - Plugin Load Failures: Run
./install-plugins.shto install required adapter plugins - Memory Issues: Reduce batch sizes in configuration for large datasets
- Docker Build Failures: Check Docker daemon is running and remove any cached layers
For detailed debugging, enable verbose logging by setting LOG_LEVEL=DEBUG in your environment.
Contributing
Contributions are welcome! Please follow these steps to contribute:
- Fork the repository.
- Create a new branch (
git checkout -b feature-branch). - Commit your changes (
git commit -am 'Add new feature'). - Push to the branch (
git push origin feature-branch). - Create a new Pull Request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file metrics_computation_engine-1.2.5.tar.gz.
File metadata
- Download URL: metrics_computation_engine-1.2.5.tar.gz
- Upload date:
- Size: 2.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5c63535a6009404b8e8430b9db1e01abd2525375f4553172ae9e8100ebbaa37a
|
|
| MD5 |
097e6d164f0e8a1706183e7eeea9225d
|
|
| BLAKE2b-256 |
be166dcafb5698e3ff50c9565c80e1c37bd771d5ba6668050ad875097e7631e3
|
File details
Details for the file metrics_computation_engine-1.2.5-py3-none-any.whl.
File metadata
- Download URL: metrics_computation_engine-1.2.5-py3-none-any.whl
- Upload date:
- Size: 1.6 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d5b363e96c4bd896fd33fd6daf957e3e2f729835db337b5c1d233f9787b9af4c
|
|
| MD5 |
1452dd50c2ba48e146a353c692d84551
|
|
| BLAKE2b-256 |
fb942e55554985f32e5d94f8efbf5363165c0bd9837491a522285ef066d0fff3
|