flotorch-eval

A comprehensive evaluation framework for AI systems

These details have not been verified by PyPI

Project links

Project description

The comprehensive evaluation framework for Flotorch ecosystem

Installation • Quick Start • Examples • Documentation • Contributing

FlotorchEval is a comprehensive evaluation framework for Flotorch ecosystem. It enables evaluation of both LLM outputs using industry-standard metrics from DeepEval and Ragas and agent behaviors using our custom metrics, with support for OpenTelemetry traces, and advanced cost/usage analysis.

✨ Features

🎯 LLM Evaluation

Feature	Description
🔧 Multi-Engine Support	DeepEval and Ragas metrics with automatic engine selection
📊 RAG Metrics	Faithfulness, context relevancy, context precision, context recall, answer relevance, and hallucination detection
🔌 Flexible Architecture	Pluggable metric system with configurable thresholds
🎯 Priority-Based Routing	Automatic metric-to-engine mapping based on priority

🤖 Agent Evaluation

Feature	Description
🎨 Custom Evaluation Framework	Purpose-built evaluation system for agent trajectories
📈 Trajectory Analysis	Evaluate agent behavior using OpenTelemetry traces
🧠 LLM-Based Metrics	Trajectory evaluation with and without reference comparisons
✅ Goal Accuracy	Measure if agent achieves intended goals
🛠️ Tool Call Tracking	Analyze tool usage and accuracy
⚡ Latency & Cost Metrics	Track performance and resource usage

📦 Installation

Install the base package:

pip install flotorch-eval

With development tools:

pip install "flotorch-eval[dev]"

🚀 Quick Start

LLM Evaluation

Evaluate RAG system outputs using DeepEval and Ragas metrics:

from flotorch_eval.llm_eval import LLMEvaluator, EvaluationItem, MetricKey

# Initialize evaluator
evaluator = LLMEvaluator(
    api_key="your-api-key",
    base_url="flotorch-base-url",
    evaluator_llm="flotorch/inference_model",
    embedding_model="flotorch/embedding_model"
)

# Prepare evaluation data
data = [
    EvaluationItem(
        question="What is machine learning?",
        generated_answer="Machine learning is a subset of AI...",
        expected_answer="Machine learning is a method of data analysis...",
        context=["Machine learning (ML) is a field of artificial intelligence..."]
    )
]

# Evaluate with specific metrics
results = evaluator.evaluate(
    data=data,
    metrics=[
        MetricKey.FAITHFULNESS,
        MetricKey.ANSWER_RELEVANCE,
        MetricKey.CONTEXT_RELEVANCY
    ]
)

print(results)

💡 Tip: Passing the metrics is optional. If metrics are not provided, data will be evaluated on all the available metrics from both the evaluation engines.

⚠️ Note: Aspect critique requires a configuration to be passed as a metric configuration to the LLMEvaluation for it to work and so it will not be added as default. The configuration structure is provided below.

Advanced LLM Evaluation with Custom Thresholds

Deepeval metrics can be configured with specific threshold values which directly affects the score. The default score is 0.7.

# Configure metric-specific arguments
metric_args = {
    "faithfulness": {"threshold": 0.8},
    "answer_relevance": {"threshold": 0.7},
    "hallucination": {"threshold": 0.3}
}

evaluator = LLMEvaluator(
    api_key="your-api-key",
    base_url="flotorch-base-url",
    evaluator_llm="flotorch/inference_model",
    embedding_model="flotorch/embedding_model",
    metric_args=metric_args
)

# Get all available metrics
available_metrics = evaluator.get_all_metrics()
print(f"Available metrics: {available_metrics}")

# Evaluate with all available metrics
results = evaluator.evaluate(data=data)

Engine Selection Modes

Flotorch currently supports two evaluation backend engines: Ragas and Deepeval. Each offers distinct as well as overlapping metrics, and you can choose how you want to run them.

Auto Mode (Default - Recommended)

Automatically routes metrics to the best engine with priority-based selection:

evaluator = LLMEvaluator(
    api_key="your-api-key",
    base_url="flotorch-base-url",
    evaluator_llm="flotorch/inference_model",
    embedding_model="flotorch/embedding_model",
    evaluation_engine='auto'  # Default behavior
)

# Automatically routes metrics to appropriate engines
# Ragas has priority for overlapping metrics (faithfulness, answer_relevance, context_precision)
results = evaluator.evaluate(data=data)

How Auto Mode Works:

Metrics supported by multiple engines are routed to Ragas (priority 1)
Metrics unique to an engine use that specific engine
Example: FAITHFULNESS → Ragas, HALLUCINATION → DeepEval

Ragas-Only Mode

Use only Ragas metrics:

evaluator = LLMEvaluator(
    api_key="your-api-key",
    base_url="flotorch-base-url",
    evaluator_llm="flotorch/inference_model",
    embedding_model="flotorch/embedding_model",
    evaluation_engine='ragas'
)

# Only Ragas metrics will be evaluated
# Metrics: faithfulness, answer_relevance, context_precision, aspect_critic

DeepEval-Only Mode

Use only DeepEval metrics:

evaluator = LLMEvaluator(
    api_key="your-api-key",
    base_url="flotorch-base-url",
    evaluator_llm="flotorch/inference_model",
    embedding_model="flotorch/embedding_model",
    evaluation_engine='deepeval'
)

# Only DeepEval metrics will be evaluated
# Metrics: faithfulness, answer_relevance, context_relevancy, context_precision, 
#          context_recall, hallucination

Engine Priority

When using auto mode, metrics are routed based on priority:

Priority	Engine	Overlapping Metrics
1 (Highest)	Ragas	faithfulness, answer_relevance, context_precision
2	DeepEval	faithfulness, answer_relevance, context_precision

⚠️ Note: Ragas requires an embedding model for most metrics. If no embedding model is provided, only DeepEval metrics will be available.

Agent Evaluation

Evaluate agent trajectories using OpenTelemetry traces. The AgentEvaluator supports evaluating agent behavior across multiple dimensions including trajectory quality, tool usage, goal achievement, latency, and cost.

Basic Agent Evaluation

from flotorch_eval.agent_eval import AgentEvaluator

# Initialize the evaluation client
client = AgentEvaluator(
    api_key="your-api-key",
    base_url="flotorch-base-url",
    default_evaluator="flotorch/inference_model"  # LLM model for evaluation metrics
)

# Fetch trace data from Flotorch API using trace ID
trace = client.fetch_traces(trace_id="your-trace-id")

# Evaluate with all default metrics
results = await client.evaluate(trace=trace)

# Access results
for metric_result in results.scores:
    print(f"Metric: {metric_result.name}, Score: {metric_result.score}")
    print(f"Details: {metric_result.details}")

Agent Evaluation with Custom Metrics

You can specify which metrics to evaluate by passing a list of metric instances:

from flotorch_eval.agent_eval.metrics.llm_evaluators import (
    TrajectoryEvalWithLLM,
    ToolCallAccuracy,
    AgentGoalAccuracy
)
from flotorch_eval.agent_eval.metrics.latency_metrics import LatencyMetric
from flotorch_eval.agent_eval.metrics.usage_metrics import UsageMetric

# Define custom metrics
custom_metrics = [
    TrajectoryEvalWithLLM(),
    ToolCallAccuracy(),
    AgentGoalAccuracy(),
    LatencyMetric(),
    UsageMetric()
]

# Evaluate with specific metrics
trace = client.fetch_traces(trace_id="your-trace-id")
results = await client.evaluate(trace=trace, metrics=custom_metrics)

Agent Evaluation with Reference Trajectory

Compare agent performance against a reference trajectory:

# Define a reference trajectory
reference_trajectory = {
    "input": "What is AWS Bedrock?",
    "expected_steps": [
        {
            "thought": "I need to search for information about AWS Bedrock",
            "tool_call": {
                "name": "search_tool",
                "arguments": {"query": "AWS Bedrock"}
            }
        },
        {
            "thought": "Now I can provide a comprehensive answer",
            "final_response": "AWS Bedrock is a fully managed service..."
        }
    ]
}

# Evaluate with reference
trace = client.fetch_traces(trace_id="your-trace-id")
results = await client.evaluate(
    trace=trace,
    reference=reference_trajectory
)

Alternatively, you can fetch a reference trace by providing a reference trace ID:

# Evaluate using reference trace ID
results = await client.evaluate(
    trace=trace,
    reference_trace_id="reference-trace-id"
)

Working with Agent Traces

If you're using Flotorch agents, you can retrieve traces directly from the agent client:

from flotorch.adk.agent import FlotorchADKAgent

# Initialize agent client
agent_client = FlotorchADKAgent(
    agent_name="your-agent-name",
    base_url="flotorch-base-url",
    api_key="your-api-key"
)

# Get trace IDs from agent
trace_ids = agent_client.get_tracer_ids()

# Evaluate each trace
for trace_id in trace_ids:
    trace = client.fetch_traces(trace_id=trace_id)
    results = await client.evaluate(trace=trace)
    print(f"Evaluation results for trace {trace_id}: {results}")

📚 Examples & Cookbooks

We maintain a separate collection of notebooks to help you get started:

Task	Notebook	Run
RAG Evaluation	Evaluate RAG Pipelines
LLM Metrics Deep Dive	Advanced LLM Evaluation

🔗 See all examples in the Resources Repository

📊 Available Metrics

LLM/RAG Metrics

Metric	Engine	Description
`FAITHFULNESS`	DeepEval/Ragas	Measures if the answer is factually consistent with the context
`ANSWER_RELEVANCE`	DeepEval/Ragas	Evaluates how relevant the answer is to the question
`CONTEXT_RELEVANCY`	DeepEval	Assesses if the retrieved context is relevant to the question
`CONTEXT_PRECISION`	DeepEval/Ragas	Measures whether retrieved contexts are relevant
`CONTEXT_RECALL`	DeepEval	Measures the quality of retrieval
`HALLUCINATION`	DeepEval	Detects if the model generates information not in the context
`ASPECT_CRITIC`	Ragas	Custom aspect-based evaluation (requires configuration)
`LATENCY`	Gateway	Measures total and average latency across LLM calls
`COST`	Gateway	Tracks total cost of LLM operations
`TOKEN_USAGE`	Gateway	Monitors total token consumption

Agent Metrics

Metric	Description	Requires LLM
`TrajectoryEvalWithLLM`	Evaluates agent trajectory quality by inferring the agent's goal from its actions and assessing whether it was successfully completed. Returns a binary score (0 or 1) with detailed explanation.	✅ Yes
`TrajectoryEvalWithLLMWithReference`	Compares agent trajectory against a reference trajectory to evaluate performance. Requires a reference trajectory to be provided.	✅ Yes
`ToolCallAccuracy`	Assesses the accuracy and appropriateness of tool calls made by the agent. Evaluates whether tools were used correctly and when they should have been used.	✅ Yes
`AgentGoalAccuracy`	Validates if the agent successfully accomplished the user's intended goal. Evaluates goal perception, plan soundness, execution coherence, and final outcome.	✅ Yes
`LatencyMetric`	Tracks latency metrics including total latency, average step latency, and hierarchical latency breakdown across all steps in the trajectory.	❌ No
`UsageMetric`	Monitors cost and token usage. Provides total cost, average cost per call, and detailed cost breakdown per model and span.	❌ No

Default Metrics: When no metrics are specified, the evaluator uses all available metrics by default: TrajectoryEvalWithLLM, ToolCallAccuracy, AgentGoalAccuracy, UsageMetric, and LatencyMetric. If a reference is provided, TrajectoryEvalWithLLMWithReference is automatically added.

⚙️ Configuration

Gateway Metrics (Latency, Cost, Token Usage)

Gateway metrics automatically track performance and usage statistics from your LLM calls. To enable these metrics, pass the response headers from FlotorchLLM as metadata:

from flotorch.sdk.llm import FlotorchLLM
from flotorch_eval.llm_eval import LLMEvaluator, EvaluationItem

# Initialize FlotorchLLM
llm = FlotorchLLM(
    model_id="flotorch/gpt-4",
    api_key="your-api-key",
    base_url="flotorch-base-url"
)

# Make LLM call with return_headers=True to get metadata
response, headers = llm.invoke(
    messages=[{"role": "user", "content": "What is machine learning?"}],
    return_headers=True  # This returns headers containing latency, cost, and token info
)

# Create evaluation item with headers as metadata
eval_item = EvaluationItem(
    question="What is machine learning?",
    generated_answer=response.content,
    expected_answer="Machine learning is...",
    context=["Context documents..."],
    metadata=headers  # Pass headers directly as metadata
)

# Evaluate - Gateway metrics will be automatically computed
evaluator = LLMEvaluator(
    api_key="your-api-key",
    base_url="flotorch-base-url",
    inferencer_model="flotorch/gpt-4",
    embedding_model="flotorch/embedding-model"
)

results = evaluator.evaluate(data=[eval_item])

The results will include the gateway metrics total cost, average latency and total tokens.

💡 Note: Gateway metrics are computed automatically when metadata is present. No additional configuration is required.

Metric Arguments

Customize metric thresholds and behavior:

metric_args = {
    "faithfulness": {
        "threshold": 0.8,
        "truths_extraction_limit": 30
    },
    "answer_relevance": {
        "threshold": 0.7
    },
    "hallucination": {
        "threshold": 0.5
    },
    "context_precision": {
        "threshold": 0.7
    }
}

Ragas Aspect Critic

Configure custom evaluation aspects:

metric_args = {
    "aspect_critic": {
        "harmfulness": {
            "name": "harmfulness",
            "definition": "Does the response contain harmful content?"
        },
        "bias": {
            "name": "bias",
            "definition": "Does the response show bias or discrimination?"
        }
    }
}

📖 API Reference

LLMEvaluator

LLMEvaluator(
    api_key: str,
    base_url: str,
    inferencer_model: str,
    embedding_model: str,
    evaluation_engine: str = 'auto',
    metrics: Optional[List[MetricKey]] = None,
    metric_configs: Optional[Dict] = None
)

Parameters:

api_key (str): API key for authentication
base_url (str): Base URL for the Flotorch service
inferencer_model (str): The LLM model to use for evaluation (e.g., "flotorch/gpt-4")
embedding_model (str): The embedding model for metrics requiring embeddings
evaluation_engine (str): Engine selection mode
- 'auto' (default): Automatically routes metrics with priority-based selection
- 'ragas': Use only Ragas metrics
- 'deepeval': Use only DeepEval metrics
metrics (Optional[List[MetricKey]]): Default metrics to evaluate
- If None with auto: Uses all available metrics from all engines
- If None with specific engine: Uses all metrics from that engine
metric_configs (Optional[Dict]): Configuration for metrics requiring additional parameters (e.g., AspectCritic)

Methods:

evaluate(data: List[EvaluationItem], metrics: Optional[List[MetricKey]] = None) -> Dict[str, Any]
- Evaluate the provided data using specified or default metrics
get_all_metrics() -> List[MetricKey]
- Returns all available metrics based on current engine configuration
set_metrics(metrics: List[MetricKey]) -> None
- Update the default metrics to use for evaluation

EvaluationItem

@dataclass
class EvaluationItem:
    question: str                    # The input question
    generated_answer: str            # Model's generated answer
    expected_answer: str             # Ground truth/expected answer
    context: List[str] = []          # Retrieved context documents
    metadata: Dict[str, Any] = {}    # Additional metadata

AgentEvaluator

AgentEvaluator(
    api_key: str,
    base_url: str,
    default_evaluator: Optional[str] = None
)

Parameters:

api_key (str): API key for authentication
base_url (str): Base URL for the Flotorch service
default_evaluator (Optional[str]): Default LLM model identifier for metrics requiring LLM evaluation (e.g., "flotorch/inference_model")

Methods:

fetch_traces(trace_id: str) -> Dict[str, Any]
- Fetches trace data from Flotorch API for a given trace ID
- Returns the trace data as a dictionary
evaluate(trace: Dict[str, Any], metrics: Optional[List[LLMBaseEval]] = None, reference: Optional[Dict[str, Any]] = None, reference_trace_id: Optional[str] = None) -> EvaluationResult
- Evaluates a trace using the provided metrics
- If metrics is None, uses all default metrics
- If reference or reference_trace_id is provided, automatically includes TrajectoryEvalWithLLMWithReference metric
- Returns an EvaluationResult object containing scores and details for each metric
set_default_evaluator(default_evaluator: str) -> None
- Updates the default evaluator model for the client

EvaluationResult:

class EvaluationResult:
    trajectory_id: str               # Unique identifier for the trajectory
    scores: List[MetricResult]       # List of metric evaluation results
    timestamp: datetime              # Timestamp of evaluation
    metadata: Dict[str, Any]         # Additional metadata

MetricResult:

class MetricResult:
    name: str                        # Name of the metric
    score: float                     # Score (0.0 to 1.0 for most metrics)
    details: Dict[str, Any]          # Detailed evaluation information

🤝 Contributing

We welcome contributions! Please see our CONTRIBUTING.md for guidelines.

Development Setup

# Clone the repository
git clone https://github.com/FloTorch/flotorch-eval.git
cd flotorch-eval

# Install in development mode
pip install -e ".[dev]"

# Run linters
pylint flotorch_eval/
black flotorch_eval/

📚 Documentation

Full documentation is available at docs.flotorch.ai

Visit our website: flotorch.ai

📄 License

This project is licensed under the MIT License. See the LICENSE file for details.

🙏 Acknowledgments

DeepEval: Industry-standard evaluation metrics for LLMs
Ragas: RAG-specific evaluation framework
OpenTelemetry: Distributed tracing for agent evaluation

Made with ❤️ by the Flotorch Team

Website • Documentation • PyPI • GitHub

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.0.0

Dec 15, 2025

2.0.0b3 pre-release

Dec 12, 2025

2.0.0b2 pre-release

Dec 12, 2025

2.0.0b1 pre-release

Nov 24, 2025

2.0.0a1 pre-release

Nov 24, 2025

1.2.0b1 pre-release

Nov 18, 2025

1.1.0b1 pre-release

Nov 4, 2025

1.0.1

Jun 13, 2025

1.0.0

Jun 13, 2025

0.2.4

Jun 12, 2025

0.2.3

Jun 12, 2025

0.2.2

Jun 11, 2025

0.2.1

Jun 11, 2025

0.2.0

Jun 11, 2025

0.1.0

Jun 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flotorch_eval-2.0.0.tar.gz (82.2 kB view details)

Uploaded Dec 15, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

flotorch_eval-2.0.0-py3-none-any.whl (64.2 kB view details)

Uploaded Dec 15, 2025 Python 3

File details

Details for the file flotorch_eval-2.0.0.tar.gz.

File metadata

Download URL: flotorch_eval-2.0.0.tar.gz
Upload date: Dec 15, 2025
Size: 82.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.8

File hashes

Hashes for flotorch_eval-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`2ea41a78680a18e8210a6443b8c1c6960d7370993299e15f86be6a9dabe11612`
MD5	`2c3a819d361a176e2249a0fe6fa3a945`
BLAKE2b-256	`9e247fb836a1c1432bbe49d48c36e2e011a2efe9a01492b221a19d83b3401b10`

See more details on using hashes here.

File details

Details for the file flotorch_eval-2.0.0-py3-none-any.whl.

File metadata

Download URL: flotorch_eval-2.0.0-py3-none-any.whl
Upload date: Dec 15, 2025
Size: 64.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.8

File hashes

Hashes for flotorch_eval-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f9b53e33aff06d760d997f90288659362dcb4039851850878371338bb46d127c`
MD5	`4eb82cf8de1e87416887fdbad6c3187b`
BLAKE2b-256	`b56cd94ef9eced6df41506cfcbc23d84bf10c19af0a440f39e6a73a191301238`

See more details on using hashes here.

flotorch-eval 2.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

✨ Features

🎯 LLM Evaluation

🤖 Agent Evaluation

📦 Installation

🚀 Quick Start

LLM Evaluation

Advanced LLM Evaluation with Custom Thresholds

Engine Selection Modes

Auto Mode (Default - Recommended)

Ragas-Only Mode

DeepEval-Only Mode

Engine Priority

Agent Evaluation

Basic Agent Evaluation

Agent Evaluation with Custom Metrics

Agent Evaluation with Reference Trajectory

Working with Agent Traces

📚 Examples & Cookbooks

📊 Available Metrics

LLM/RAG Metrics

Agent Metrics

⚙️ Configuration

Gateway Metrics (Latency, Cost, Token Usage)

Metric Arguments

Ragas Aspect Critic

📖 API Reference

LLMEvaluator

EvaluationItem

AgentEvaluator

🤝 Contributing

Development Setup

📚 Documentation

📄 License

🙏 Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes