Skip to main content

A multi-backend evaluation framework for LLM, RAG, and agentic systems.

Project description

Floeval

Multi-backend evaluation framework for LLM, RAG, prompt, and agent systems.

Overview

Floeval supports five evaluation types:

Eval type What you are scoring Key dataset fields
LLM Direct question-answer quality without retrieval user_input, llm_response
RAG Answer quality and retrieval performance with context user_input, llm_response, contexts
Prompt One or more system prompts against the same dataset Partial dataset + prompts_file (with or without RAG)
Agent Single-agent trace quality, tool use, and goal achievement AgentDataset (full or partial)
Agentic Workflow Multi-agent DAG pipelines scored end-to-end AgentDataset + DAG config

Floeval supports the following workflows:

  • evaluating full datasets that already contain llm_response
  • generating responses from partial datasets and scoring them in the same run
  • expanding partial datasets across prompt variants with prompt_ids and prompts_file
  • routing metrics across ragas, deepeval, builtin, and custom
  • evaluating single-agent traces (pre-captured, Python callable, or FloTorch runner)
  • evaluating multi-agent DAG workflows with WorkflowRunner
  • capturing traces from Python callables, LangChain-style agents, or optional FloTorch runners

Features

  • CLI and Python API: run evaluations from config files or integrate directly into code
  • Five eval types: LLM, RAG, Prompt (with and without RAG), Agent, and Agentic Workflow
  • Multi-provider metrics: mix ragas, deepeval, builtin, and custom metrics in one evaluation
  • Prompt-aware generation: compare system-prompt variants at scale with prompt_ids and prompts_file
  • Agent evaluation: score pre-captured traces or collect traces at runtime
  • Agentic workflow evaluation: evaluate multi-agent DAG pipelines with WorkflowRunner
  • Custom metrics: define function-based metrics or LLM-as-judge criteria
  • Dataset format flexibility: accepts {"samples": [...]}, JSON array, or JSONL; field aliases question/answer supported

Installation

Version 0.2.0b1 is a pre-release. Installing from PyPI may require --pre:

pip install --pre floeval

Optional FloTorch support for agent Mode 4 and agentic workflow evaluation:

pip install "floeval[flotorch]"

Development install:

pip install -e .
pip install -e .[dev]

Quick Start

Python API — LLM / RAG evaluation

from floeval import Evaluation, DatasetLoader
from floeval.config.schemas.io.llm import OpenAIProviderConfig

llm_config = OpenAIProviderConfig(
    base_url="https://api.openai.com/v1",
    api_key="your-api-key",
    chat_model="gpt-4o-mini",
    embedding_model="text-embedding-3-small",
)

dataset = DatasetLoader.from_samples(
    [
        {
            "user_input": "What is RAG?",
            "llm_response": "RAG stands for Retrieval-Augmented Generation.",
            "contexts": ["RAG combines retrieval with generation."],
        }
    ],
    partial_dataset=False,
)

evaluation = Evaluation(
    dataset=dataset,
    llm_config=llm_config,
    metrics=["answer_relevancy", "faithfulness"],
    default_provider="ragas",
)

results = evaluation.run()
print(results.aggregate_scores)

Python API — Prompt evaluation (multi-prompt)

from floeval import Evaluation, DatasetLoader
from floeval.config.schemas.io.llm import OpenAIProviderConfig

llm_config = OpenAIProviderConfig(
    base_url="https://api.openai.com/v1",
    api_key="your-api-key",
    chat_model="gpt-4o-mini",
    embedding_model="text-embedding-3-small",
)

partial_dataset = DatasetLoader.from_samples(
    [
        {
            "user_input": "Summarize this customer ticket.",
            "prompt_ids": ["concise", "detailed"]
        }
    ],
    partial_dataset=True,
)

evaluation = Evaluation(
    dataset=partial_dataset,
    llm_config=llm_config,
    metrics=["answer_relevancy"],
    default_provider="ragas",
    dataset_generator_model="gpt-4o-mini",
    prompts_file="prompts.yaml",
)

results = evaluation.run()
for row in results.sample_results:
    print(row["prompt_id"], row["metrics"]["answer_relevancy"]["score"])

Python API — Agent evaluation

from floeval.api.agent_evaluation import AgentEvaluation
from floeval.config.schemas.io.agent_dataset import AgentDataset, PartialAgentSample
from floeval.config.schemas.io.llm import OpenAIProviderConfig
from floeval.utils.agent_trace import capture_trace, log_turn

llm_config = OpenAIProviderConfig(
    base_url="https://api.openai.com/v1",
    api_key="your-api-key",
    chat_model="gpt-4o-mini",
)

@capture_trace
def my_agent(user_input: str) -> str:
    response = f"Handled: {user_input}"
    log_turn(response)
    return response

dataset = AgentDataset(
    samples=[
        PartialAgentSample(
            user_input="Reset my password",
            reference_outcome="Password reset instructions were provided.",
        )
    ]
)

evaluation = AgentEvaluation(
    dataset=dataset,
    agent=my_agent,
    llm_config=llm_config,
    metrics=["goal_achievement"],
)

results = evaluation.run()
print(results.summary)

Python API — Agentic Workflow evaluation

import json
from floeval.api.agent_evaluation import AgentEvaluation
from floeval.config.schemas.io.agent_dataset import AgentDataset, PartialAgentSample
from floeval.config.schemas.io.llm import OpenAIProviderConfig
from floeval.flotorch import WorkflowRunner  # requires floeval[flotorch]

llm_config = OpenAIProviderConfig(
    base_url="https://gateway.example/openai/v1",
    api_key="your-gateway-key",
    chat_model="gpt-4o-mini",
)

dag_config = json.loads(open("workflow_config.json").read())
runner = WorkflowRunner(dag_config=dag_config, llm_config=llm_config)

dataset = AgentDataset(
    samples=[
        PartialAgentSample(
            user_input="What is the status of order #12345?",
            reference_outcome="The order is shipped and arriving tomorrow.",
        )
    ]
)

evaluation = AgentEvaluation(
    dataset=dataset,
    agent_runner=runner,
    llm_config=llm_config,
    metrics=["goal_achievement", "ragas:agent_goal_accuracy"],
)

results = evaluation.run()
print(results.summary)

CLI

# Evaluate a full LLM/RAG dataset
floeval evaluate -c config.yaml -d dataset.json -o results.json

# Evaluate a partial dataset (generate + score in one run)
floeval evaluate -c config.yaml -d partial_dataset.json -o results.json

# Generate first, then evaluate later
floeval generate -c config.yaml -d partial_dataset.json -o complete.json
floeval evaluate -c config.yaml -d complete.json -o results.json

# Prompt evaluation with a prompts file
floeval evaluate -c prompt_config.yaml -d partial_dataset.json -o prompt_results.json

# Single-agent evaluation
floeval evaluate -c agent_config.yaml -d agent_dataset.json --agent -o agent_results.json

# Agentic workflow evaluation
floeval evaluate -c workflow_config.yaml -d agent_dataset.json --agent -o workflow_results.json

Project Structure

  • api/ - public evaluation APIs and dataset loaders
  • core/execution/ - response generation and execution internals
  • metric_providers/ - provider-specific metric implementations
  • config/schemas/ - config, dataset, and prompt schemas
  • cli/ - command-line entry points
  • utils/ - trace capture, loaders, and helper utilities
  • flotorch/ - optional FloTorch integration (WorkflowRunner, FloTorchRunner)

Documentation

Detailed docs live in docs/:

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

floeval-1.1.1b1.tar.gz (85.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

floeval-1.1.1b1-py3-none-any.whl (111.5 kB view details)

Uploaded Python 3

File details

Details for the file floeval-1.1.1b1.tar.gz.

File metadata

  • Download URL: floeval-1.1.1b1.tar.gz
  • Upload date:
  • Size: 85.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for floeval-1.1.1b1.tar.gz
Algorithm Hash digest
SHA256 9cc3ad65314c4766680f17a9cfef064c3b8f904a54bf812463fbc0ba55a02aa1
MD5 74ae23ac6993e56557060a362554486f
BLAKE2b-256 5764dc89ad3bb032cec0a928618f637db07c8d986ee13d06f8e99ccc3ff4d485

See more details on using hashes here.

File details

Details for the file floeval-1.1.1b1-py3-none-any.whl.

File metadata

  • Download URL: floeval-1.1.1b1-py3-none-any.whl
  • Upload date:
  • Size: 111.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for floeval-1.1.1b1-py3-none-any.whl
Algorithm Hash digest
SHA256 e3d72a34131c8d05f163db24adb8f4a02e29988799694e84144a7b4a3d5f5b03
MD5 1ac11ee669552beaebd81cd143fc2578
BLAKE2b-256 5d02922f6fc4ef08f2856ef5fc87f82afa374e0ca23fbf6faed551d11f090ae3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page