Skip to main content

MCP pipeline evaluation toolkit — grade AI agent workflows on accuracy, cost, and reliability

Project description

bifrost-eval

CI PyPI License: MIT Python

MCP pipeline evaluation toolkit — grade AI agent workflows on accuracy, cost, and reliability.

What It Does

bifrost-eval evaluates multi-agent MCP pipelines as complete workflows, not just individual prompts. It answers:

  • Did the pipeline get the right answer? (accuracy scoring)
  • Did agents use the right tools in the right order? (tool correctness)
  • How fast was it? (latency breakdown per agent/tool)
  • How much did it cost? (cost attribution per agent/tool)
  • How do different configurations compare? (A/B testing)

Install

pip install bifrost-eval

With agent-mcp-framework integration:

pip install bifrost-eval[amf]

Quick Start

import asyncio
from bifrost_eval import (
    AccuracyMetric,
    CostEfficiencyMetric,
    EvalRunner,
    EvalSuite,
    LatencyMetric,
    Scenario,
    ToolCorrectnessMetric,
)

# Define test scenarios
suite = EvalSuite(
    name="my-agent-eval",
    scenarios=[
        Scenario(
            name="basic-query",
            input_data={"query": "What is 2+2?"},
            expected_output=4,
            expected_tool_calls=["calculator"],
        ),
    ],
)

# Implement PipelineExecutor protocol for your agent
class MyExecutor:
    async def execute(self, scenario):
        from bifrost_eval import ExecutionTrace
        # Run your agent pipeline here
        return ExecutionTrace(output=4, success=True)

# Run evaluation
runner = EvalRunner(
    executor=MyExecutor(),
    metrics=[
        AccuracyMetric(weight=2.0),
        ToolCorrectnessMetric(weight=1.0),
        LatencyMetric(target_ms=5000),
        CostEfficiencyMetric(budget_usd=0.10),
    ],
)

result = asyncio.run(runner.run_suite(suite))
print(f"Pass rate: {result.pass_rate:.0%}")
print(f"Grade: {result.grade.value}")
print(f"Total cost: ${result.total_cost.total_usd:.4f}")

A/B Comparison

from bifrost_eval.adapters.comparison import ComparisonRunner

comparator = ComparisonRunner(metrics=[AccuracyMetric(), CostEfficiencyMetric()])
result = await comparator.compare(
    suite,
    {"config-a": executor_a, "config-b": executor_b},
)
print(f"Winner: {result.winner}")

agent-mcp-framework Integration

from agent_mcp_framework import SequentialPipeline
from bifrost_eval.adapters.amf_adapter import AMFAdapter

pipeline = SequentialPipeline("my-pipeline", agents=[...])
adapter = AMFAdapter(pipeline)
runner = EvalRunner(executor=adapter, metrics=[...])

CLI

# Validate a suite file
bifrost-eval validate suite.json

# Show version
bifrost-eval --version

Metrics

Metric What It Measures Default Weight
AccuracyMetric Output correctness 1.0
ToolCorrectnessMetric Right tools, right order 1.0
LatencyMetric Speed vs target 1.0
CostEfficiencyMetric Cost vs budget 1.0

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bifrost_eval-0.1.0.tar.gz (23.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bifrost_eval-0.1.0-py3-none-any.whl (15.2 kB view details)

Uploaded Python 3

File details

Details for the file bifrost_eval-0.1.0.tar.gz.

File metadata

  • Download URL: bifrost_eval-0.1.0.tar.gz
  • Upload date:
  • Size: 23.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for bifrost_eval-0.1.0.tar.gz
Algorithm Hash digest
SHA256 34cfd34d3f369458a5b3cedb0320978f4fa11073e850c20e2cd8dabdfa46d1cd
MD5 8f858bea5f9489af621deec9e86d453d
BLAKE2b-256 b003c8aee55c6c92a8e1f91c39212213013f8e2147765937b4061bdff38087a3

See more details on using hashes here.

File details

Details for the file bifrost_eval-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: bifrost_eval-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 15.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for bifrost_eval-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a03cc7a4ebfc2323641c7e6b6570e4f30873a70baed17236939eebafaa751926
MD5 17c9b088893a3b82a194b9db448a1adb
BLAKE2b-256 b0a03d348dbbfa4d32b73123ed8ad8bf3f18fb52be3f1ce34b7865d8026a0ac1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page