Skip to main content

A comprehensive evaluation framework for AI systems

Project description

FlotorchEval Logo

The comprehensive evaluation framework for Flotorch ecosystem

PyPI Version Python Versions License Documentation Website

InstallationQuick StartExamplesDocumentationContributing


FlotorchEval is a comprehensive evaluation framework for Flotorch ecosystem. It enables evaluation of both LLM outputs using industry-standard metrics from DeepEval and Ragas and agent behaviors using our custom metrics, with support for OpenTelemetry traces, and advanced cost/usage analysis.


✨ Features

🎯 LLM Evaluation

Feature Description
🔧 Multi-Engine Support DeepEval and Ragas metrics with automatic engine selection
📊 RAG Metrics Faithfulness, context relevancy, context precision, context recall, answer relevance, and hallucination detection
🔌 Flexible Architecture Pluggable metric system with configurable thresholds
🎯 Priority-Based Routing Automatic metric-to-engine mapping based on priority

🤖 Agent Evaluation

Feature Description
🎨 Custom Evaluation Framework Purpose-built evaluation system for agent trajectories
📈 Trajectory Analysis Evaluate agent behavior using OpenTelemetry traces
🧠 LLM-Based Metrics Trajectory evaluation with and without reference comparisons
✅ Goal Accuracy Measure if agent achieves intended goals
🛠️ Tool Call Tracking Analyze tool usage and accuracy
⚡ Latency & Cost Metrics Track performance and resource usage

📦 Installation

Install the base package:

pip install flotorch-eval

With development tools:

pip install "flotorch-eval[dev]"

🚀 Quick Start

LLM Evaluation

Evaluate RAG system outputs using DeepEval and Ragas metrics:

from flotorch_eval.llm_eval import LLMEvaluator, EvaluationItem, MetricKey

# Initialize evaluator
evaluator = LLMEvaluator(
    api_key="your-api-key",
    base_url="flotorch-base-url",
    evaluator_llm="flotorch/inference_model",
    embedding_model="flotorch/embedding_model"
)

# Prepare evaluation data
data = [
    EvaluationItem(
        question="What is machine learning?",
        generated_answer="Machine learning is a subset of AI...",
        expected_answer="Machine learning is a method of data analysis...",
        context=["Machine learning (ML) is a field of artificial intelligence..."]
    )
]

# Evaluate with specific metrics
results = evaluator.evaluate(
    data=data,
    metrics=[
        MetricKey.FAITHFULNESS,
        MetricKey.ANSWER_RELEVANCE,
        MetricKey.CONTEXT_RELEVANCY
    ]
)

print(results)

💡 Tip: Passing the metrics is optional. If metrics are not provided, data will be evaluated on all the available metrics from both the evaluation engines.

⚠️ Note: Aspect critique requires a configuration to be passed as a metric configuration to the LLMEvaluation for it to work and so it will not be added as default. The configuration structure is provided below.

Advanced LLM Evaluation with Custom Thresholds

Deepeval metrics can be configured with specific threshold values which directly affects the score. The default score is 0.7.

# Configure metric-specific arguments
metric_args = {
    "faithfulness": {"threshold": 0.8},
    "answer_relevance": {"threshold": 0.7},
    "hallucination": {"threshold": 0.3}
}

evaluator = LLMEvaluator(
    api_key="your-api-key",
    base_url="flotorch-base-url",
    evaluator_llm="flotorch/inference_model",
    embedding_model="flotorch/embedding_model",
    metric_args=metric_args
)

# Get all available metrics
available_metrics = evaluator.get_all_metrics()
print(f"Available metrics: {available_metrics}")

# Evaluate with all available metrics
results = evaluator.evaluate(data=data)

Engine Selection Modes

Flotorch currently supports two evaluation backend engines: Ragas and Deepeval. Each offers distinct as well as overlapping metrics, and you can choose how you want to run them.

Auto Mode (Default - Recommended)

Automatically routes metrics to the best engine with priority-based selection:

evaluator = LLMEvaluator(
    api_key="your-api-key",
    base_url="flotorch-base-url",
    evaluator_llm="flotorch/inference_model",
    embedding_model="flotorch/embedding_model",
    evaluation_engine='auto'  # Default behavior
)

# Automatically routes metrics to appropriate engines
# Ragas has priority for overlapping metrics (faithfulness, answer_relevance, context_precision)
results = evaluator.evaluate(data=data)

How Auto Mode Works:

  • Metrics supported by multiple engines are routed to Ragas (priority 1)
  • Metrics unique to an engine use that specific engine
  • Example: FAITHFULNESS → Ragas, HALLUCINATION → DeepEval

Ragas-Only Mode

Use only Ragas metrics:

evaluator = LLMEvaluator(
    api_key="your-api-key",
    base_url="flotorch-base-url",
    evaluator_llm="flotorch/inference_model",
    embedding_model="flotorch/embedding_model",
    evaluation_engine='ragas'
)

# Only Ragas metrics will be evaluated
# Metrics: faithfulness, answer_relevance, context_precision, aspect_critic

DeepEval-Only Mode

Use only DeepEval metrics:

evaluator = LLMEvaluator(
    api_key="your-api-key",
    base_url="flotorch-base-url",
    evaluator_llm="flotorch/inference_model",
    embedding_model="flotorch/embedding_model",
    evaluation_engine='deepeval'
)

# Only DeepEval metrics will be evaluated
# Metrics: faithfulness, answer_relevance, context_relevancy, context_precision, 
#          context_recall, hallucination

Engine Priority

When using auto mode, metrics are routed based on priority:

Priority Engine Overlapping Metrics
1 (Highest) Ragas faithfulness, answer_relevance, context_precision
2 DeepEval faithfulness, answer_relevance, context_precision

⚠️ Note: Ragas requires an embedding model for most metrics. If no embedding model is provided, only DeepEval metrics will be available.

Agent Evaluation

Evaluate agent trajectories using OpenTelemetry traces. The AgentEvaluator supports evaluating agent behavior across multiple dimensions including trajectory quality, tool usage, goal achievement, latency, and cost.

Basic Agent Evaluation

from flotorch_eval.agent_eval import AgentEvaluator

# Initialize the evaluation client
client = AgentEvaluator(
    api_key="your-api-key",
    base_url="flotorch-base-url",
    default_evaluator="flotorch/inference_model"  # LLM model for evaluation metrics
)

# Fetch trace data from Flotorch API using trace ID
trace = client.fetch_traces(trace_id="your-trace-id")

# Evaluate with all default metrics
results = await client.evaluate(trace=trace)

# Access results
for metric_result in results.scores:
    print(f"Metric: {metric_result.name}, Score: {metric_result.score}")
    print(f"Details: {metric_result.details}")

Agent Evaluation with Custom Metrics

You can specify which metrics to evaluate by passing a list of metric instances:

from flotorch_eval.agent_eval.metrics.llm_evaluators import (
    TrajectoryEvalWithLLM,
    ToolCallAccuracy,
    AgentGoalAccuracy
)
from flotorch_eval.agent_eval.metrics.latency_metrics import LatencyMetric
from flotorch_eval.agent_eval.metrics.usage_metrics import UsageMetric

# Define custom metrics
custom_metrics = [
    TrajectoryEvalWithLLM(),
    ToolCallAccuracy(),
    AgentGoalAccuracy(),
    LatencyMetric(),
    UsageMetric()
]

# Evaluate with specific metrics
trace = client.fetch_traces(trace_id="your-trace-id")
results = await client.evaluate(trace=trace, metrics=custom_metrics)

Agent Evaluation with Reference Trajectory

Compare agent performance against a reference trajectory:

# Define a reference trajectory
reference_trajectory = {
    "input": "What is AWS Bedrock?",
    "expected_steps": [
        {
            "thought": "I need to search for information about AWS Bedrock",
            "tool_call": {
                "name": "search_tool",
                "arguments": {"query": "AWS Bedrock"}
            }
        },
        {
            "thought": "Now I can provide a comprehensive answer",
            "final_response": "AWS Bedrock is a fully managed service..."
        }
    ]
}

# Evaluate with reference
trace = client.fetch_traces(trace_id="your-trace-id")
results = await client.evaluate(
    trace=trace,
    reference=reference_trajectory
)

Alternatively, you can fetch a reference trace by providing a reference trace ID:

# Evaluate using reference trace ID
results = await client.evaluate(
    trace=trace,
    reference_trace_id="reference-trace-id"
)

Working with Agent Traces

If you're using Flotorch agents, you can retrieve traces directly from the agent client:

from flotorch.adk.agent import FlotorchADKAgent

# Initialize agent client
agent_client = FlotorchADKAgent(
    agent_name="your-agent-name",
    base_url="flotorch-base-url",
    api_key="your-api-key"
)

# Get trace IDs from agent
trace_ids = agent_client.get_tracer_ids()

# Evaluate each trace
for trace_id in trace_ids:
    trace = client.fetch_traces(trace_id=trace_id)
    results = await client.evaluate(trace=trace)
    print(f"Evaluation results for trace {trace_id}: {results}")

📚 Examples & Cookbooks

We maintain a separate collection of notebooks to help you get started:

Task Notebook Run
RAG Evaluation Evaluate RAG Pipelines Open In Colab
LLM Metrics Deep Dive Advanced LLM Evaluation Open In Colab

🔗 See all examples in the Resources Repository


📊 Available Metrics

LLM/RAG Metrics

Metric Engine Description
FAITHFULNESS DeepEval/Ragas Measures if the answer is factually consistent with the context
ANSWER_RELEVANCE DeepEval/Ragas Evaluates how relevant the answer is to the question
CONTEXT_RELEVANCY DeepEval Assesses if the retrieved context is relevant to the question
CONTEXT_PRECISION DeepEval/Ragas Measures whether retrieved contexts are relevant
CONTEXT_RECALL DeepEval Measures the quality of retrieval
HALLUCINATION DeepEval Detects if the model generates information not in the context
ASPECT_CRITIC Ragas Custom aspect-based evaluation (requires configuration)
LATENCY Gateway Measures total and average latency across LLM calls
COST Gateway Tracks total cost of LLM operations
TOKEN_USAGE Gateway Monitors total token consumption

Agent Metrics

Metric Description Requires LLM
TrajectoryEvalWithLLM Evaluates agent trajectory quality by inferring the agent's goal from its actions and assessing whether it was successfully completed. Returns a binary score (0 or 1) with detailed explanation. ✅ Yes
TrajectoryEvalWithLLMWithReference Compares agent trajectory against a reference trajectory to evaluate performance. Requires a reference trajectory to be provided. ✅ Yes
ToolCallAccuracy Assesses the accuracy and appropriateness of tool calls made by the agent. Evaluates whether tools were used correctly and when they should have been used. ✅ Yes
AgentGoalAccuracy Validates if the agent successfully accomplished the user's intended goal. Evaluates goal perception, plan soundness, execution coherence, and final outcome. ✅ Yes
LatencyMetric Tracks latency metrics including total latency, average step latency, and hierarchical latency breakdown across all steps in the trajectory. ❌ No
UsageMetric Monitors cost and token usage. Provides total cost, average cost per call, and detailed cost breakdown per model and span. ❌ No

Default Metrics: When no metrics are specified, the evaluator uses all available metrics by default: TrajectoryEvalWithLLM, ToolCallAccuracy, AgentGoalAccuracy, UsageMetric, and LatencyMetric. If a reference is provided, TrajectoryEvalWithLLMWithReference is automatically added.


⚙️ Configuration

Gateway Metrics (Latency, Cost, Token Usage)

Gateway metrics automatically track performance and usage statistics from your LLM calls. To enable these metrics, pass the response headers from FlotorchLLM as metadata:

from flotorch.sdk.llm import FlotorchLLM
from flotorch_eval.llm_eval import LLMEvaluator, EvaluationItem

# Initialize FlotorchLLM
llm = FlotorchLLM(
    model_id="flotorch/gpt-4",
    api_key="your-api-key",
    base_url="flotorch-base-url"
)

# Make LLM call with return_headers=True to get metadata
response, headers = llm.invoke(
    messages=[{"role": "user", "content": "What is machine learning?"}],
    return_headers=True  # This returns headers containing latency, cost, and token info
)

# Create evaluation item with headers as metadata
eval_item = EvaluationItem(
    question="What is machine learning?",
    generated_answer=response.content,
    expected_answer="Machine learning is...",
    context=["Context documents..."],
    metadata=headers  # Pass headers directly as metadata
)

# Evaluate - Gateway metrics will be automatically computed
evaluator = LLMEvaluator(
    api_key="your-api-key",
    base_url="flotorch-base-url",
    inferencer_model="flotorch/gpt-4",
    embedding_model="flotorch/embedding-model"
)

results = evaluator.evaluate(data=[eval_item])

The results will include the gateway metrics total cost, average latency and total tokens.

💡 Note: Gateway metrics are computed automatically when metadata is present. No additional configuration is required.

Metric Arguments

Customize metric thresholds and behavior:

metric_args = {
    "faithfulness": {
        "threshold": 0.8,
        "truths_extraction_limit": 30
    },
    "answer_relevance": {
        "threshold": 0.7
    },
    "hallucination": {
        "threshold": 0.5
    },
    "context_precision": {
        "threshold": 0.7
    }
}

Ragas Aspect Critic

Configure custom evaluation aspects:

metric_args = {
    "aspect_critic": {
        "harmfulness": {
            "name": "harmfulness",
            "definition": "Does the response contain harmful content?"
        },
        "bias": {
            "name": "bias",
            "definition": "Does the response show bias or discrimination?"
        }
    }
}

📖 API Reference

LLMEvaluator

LLMEvaluator(
    api_key: str,
    base_url: str,
    inferencer_model: str,
    embedding_model: str,
    evaluation_engine: str = 'auto',
    metrics: Optional[List[MetricKey]] = None,
    metric_configs: Optional[Dict] = None
)

Parameters:

  • api_key (str): API key for authentication
  • base_url (str): Base URL for the Flotorch service
  • inferencer_model (str): The LLM model to use for evaluation (e.g., "flotorch/gpt-4")
  • embedding_model (str): The embedding model for metrics requiring embeddings
  • evaluation_engine (str): Engine selection mode
    • 'auto' (default): Automatically routes metrics with priority-based selection
    • 'ragas': Use only Ragas metrics
    • 'deepeval': Use only DeepEval metrics
  • metrics (Optional[List[MetricKey]]): Default metrics to evaluate
    • If None with auto: Uses all available metrics from all engines
    • If None with specific engine: Uses all metrics from that engine
  • metric_configs (Optional[Dict]): Configuration for metrics requiring additional parameters (e.g., AspectCritic)

Methods:

  • evaluate(data: List[EvaluationItem], metrics: Optional[List[MetricKey]] = None) -> Dict[str, Any]
    • Evaluate the provided data using specified or default metrics
  • get_all_metrics() -> List[MetricKey]
    • Returns all available metrics based on current engine configuration
  • set_metrics(metrics: List[MetricKey]) -> None
    • Update the default metrics to use for evaluation

EvaluationItem

@dataclass
class EvaluationItem:
    question: str                    # The input question
    generated_answer: str            # Model's generated answer
    expected_answer: str             # Ground truth/expected answer
    context: List[str] = []          # Retrieved context documents
    metadata: Dict[str, Any] = {}    # Additional metadata

AgentEvaluator

AgentEvaluator(
    api_key: str,
    base_url: str,
    default_evaluator: Optional[str] = None
)

Parameters:

  • api_key (str): API key for authentication
  • base_url (str): Base URL for the Flotorch service
  • default_evaluator (Optional[str]): Default LLM model identifier for metrics requiring LLM evaluation (e.g., "flotorch/inference_model")

Methods:

  • fetch_traces(trace_id: str) -> Dict[str, Any]
    • Fetches trace data from Flotorch API for a given trace ID
    • Returns the trace data as a dictionary
  • evaluate(trace: Dict[str, Any], metrics: Optional[List[LLMBaseEval]] = None, reference: Optional[Dict[str, Any]] = None, reference_trace_id: Optional[str] = None) -> EvaluationResult
    • Evaluates a trace using the provided metrics
    • If metrics is None, uses all default metrics
    • If reference or reference_trace_id is provided, automatically includes TrajectoryEvalWithLLMWithReference metric
    • Returns an EvaluationResult object containing scores and details for each metric
  • set_default_evaluator(default_evaluator: str) -> None
    • Updates the default evaluator model for the client

EvaluationResult:

class EvaluationResult:
    trajectory_id: str               # Unique identifier for the trajectory
    scores: List[MetricResult]       # List of metric evaluation results
    timestamp: datetime              # Timestamp of evaluation
    metadata: Dict[str, Any]         # Additional metadata

MetricResult:

class MetricResult:
    name: str                        # Name of the metric
    score: float                     # Score (0.0 to 1.0 for most metrics)
    details: Dict[str, Any]          # Detailed evaluation information

🤝 Contributing

We welcome contributions! Please see our CONTRIBUTING.md for guidelines.

Development Setup

# Clone the repository
git clone https://github.com/FloTorch/flotorch-eval.git
cd flotorch-eval

# Install in development mode
pip install -e ".[dev]"

# Run linters
pylint flotorch_eval/
black flotorch_eval/

📚 Documentation

Full documentation is available at docs.flotorch.ai

Visit our website: flotorch.ai


📄 License

This project is licensed under the MIT License. See the LICENSE file for details.


🙏 Acknowledgments

  • DeepEval: Industry-standard evaluation metrics for LLMs
  • Ragas: RAG-specific evaluation framework
  • OpenTelemetry: Distributed tracing for agent evaluation

Made with ❤️ by the Flotorch Team

WebsiteDocumentationPyPIGitHub

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flotorch_eval-2.0.0.tar.gz (82.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

flotorch_eval-2.0.0-py3-none-any.whl (64.2 kB view details)

Uploaded Python 3

File details

Details for the file flotorch_eval-2.0.0.tar.gz.

File metadata

  • Download URL: flotorch_eval-2.0.0.tar.gz
  • Upload date:
  • Size: 82.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.8

File hashes

Hashes for flotorch_eval-2.0.0.tar.gz
Algorithm Hash digest
SHA256 2ea41a78680a18e8210a6443b8c1c6960d7370993299e15f86be6a9dabe11612
MD5 2c3a819d361a176e2249a0fe6fa3a945
BLAKE2b-256 9e247fb836a1c1432bbe49d48c36e2e011a2efe9a01492b221a19d83b3401b10

See more details on using hashes here.

File details

Details for the file flotorch_eval-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: flotorch_eval-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 64.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.8

File hashes

Hashes for flotorch_eval-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f9b53e33aff06d760d997f90288659362dcb4039851850878371338bb46d127c
MD5 4eb82cf8de1e87416887fdbad6c3187b
BLAKE2b-256 b56cd94ef9eced6df41506cfcbc23d84bf10c19af0a440f39e6a73a191301238

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page