Skip to main content

A multi-backend evaluation framework for LLM, RAG, and agentic systems.

Project description

Floeval

Multi-backend evaluation framework for LLM, RAG, prompt, and agent systems.

Overview

Floeval supports five evaluation types:

Eval type What you are scoring Key dataset fields
LLM Direct question-answer quality without retrieval user_input, llm_response
RAG Answer quality and retrieval performance with context user_input, llm_response, contexts
Prompt One or more system prompts against the same dataset Partial dataset + prompts_file (with or without RAG)
Agent Single-agent trace quality, tool use, and goal achievement AgentDataset (full or partial)
Agentic Workflow Multi-agent DAG pipelines scored end-to-end AgentDataset + DAG config

Floeval supports the following workflows:

  • evaluating full datasets that already contain llm_response
  • generating responses from partial datasets and scoring them in the same run
  • expanding partial datasets across prompt variants with prompt_ids and prompts_file
  • routing metrics across ragas, deepeval, builtin, and custom
  • evaluating single-agent traces (pre-captured, Python callable, or FloTorch runner)
  • evaluating multi-agent DAG workflows with WorkflowRunner
  • capturing traces from Python callables, LangChain-style agents, or optional FloTorch runners

Features

  • CLI and Python API: run evaluations from config files or integrate directly into code
  • Five eval types: LLM, RAG, Prompt (with and without RAG), Agent, and Agentic Workflow
  • Multi-provider metrics: mix ragas, deepeval, builtin, and custom metrics in one evaluation
  • Prompt-aware generation: compare system-prompt variants at scale with prompt_ids and prompts_file
  • Agent evaluation: score pre-captured traces or collect traces at runtime
  • Agentic workflow evaluation: evaluate multi-agent DAG pipelines with WorkflowRunner
  • Custom metrics: define function-based metrics or LLM-as-judge criteria
  • Dataset format flexibility: accepts {"samples": [...]}, JSON array, or JSONL; field aliases question/answer supported

Installation

Version 0.2.0b1 is a pre-release. Installing from PyPI may require --pre:

pip install --pre floeval

Optional FloTorch support for agent Mode 4 and agentic workflow evaluation:

pip install "floeval[flotorch]"

Development install:

pip install -e .
pip install -e .[dev]

Quick Start

Python API — LLM / RAG evaluation

from floeval import Evaluation, DatasetLoader
from floeval.config.schemas.io.llm import OpenAIProviderConfig

llm_config = OpenAIProviderConfig(
    base_url="https://api.openai.com/v1",
    api_key="your-api-key",
    chat_model="gpt-4o-mini",
    embedding_model="text-embedding-3-small",
)

dataset = DatasetLoader.from_samples(
    [
        {
            "user_input": "What is RAG?",
            "llm_response": "RAG stands for Retrieval-Augmented Generation.",
            "contexts": ["RAG combines retrieval with generation."],
        }
    ],
    partial_dataset=False,
)

evaluation = Evaluation(
    dataset=dataset,
    llm_config=llm_config,
    metrics=["answer_relevancy", "faithfulness"],
    default_provider="ragas",
)

results = evaluation.run()
print(results.aggregate_scores)

Python API — Prompt evaluation (multi-prompt)

from floeval import Evaluation, DatasetLoader
from floeval.config.schemas.io.llm import OpenAIProviderConfig

llm_config = OpenAIProviderConfig(
    base_url="https://api.openai.com/v1",
    api_key="your-api-key",
    chat_model="gpt-4o-mini",
    embedding_model="text-embedding-3-small",
)

partial_dataset = DatasetLoader.from_samples(
    [
        {
            "user_input": "Summarize this customer ticket.",
            "prompt_ids": ["concise", "detailed"]
        }
    ],
    partial_dataset=True,
)

evaluation = Evaluation(
    dataset=partial_dataset,
    llm_config=llm_config,
    metrics=["answer_relevancy"],
    default_provider="ragas",
    dataset_generator_model="gpt-4o-mini",
    prompts_file="prompts.yaml",
)

results = evaluation.run()
for row in results.sample_results:
    print(row["prompt_id"], row["metrics"]["answer_relevancy"]["score"])

Python API — Agent evaluation

from floeval.api.agent_evaluation import AgentEvaluation
from floeval.config.schemas.io.agent_dataset import AgentDataset, PartialAgentSample
from floeval.config.schemas.io.llm import OpenAIProviderConfig
from floeval.utils.agent_trace import capture_trace, log_turn

llm_config = OpenAIProviderConfig(
    base_url="https://api.openai.com/v1",
    api_key="your-api-key",
    chat_model="gpt-4o-mini",
)

@capture_trace
def my_agent(user_input: str) -> str:
    response = f"Handled: {user_input}"
    log_turn(response)
    return response

dataset = AgentDataset(
    samples=[
        PartialAgentSample(
            user_input="Reset my password",
            reference_outcome="Password reset instructions were provided.",
        )
    ]
)

evaluation = AgentEvaluation(
    dataset=dataset,
    agent=my_agent,
    llm_config=llm_config,
    metrics=["goal_achievement"],
)

results = evaluation.run()
print(results.summary)

Python API — Agentic Workflow evaluation

import json
from floeval.api.agent_evaluation import AgentEvaluation
from floeval.config.schemas.io.agent_dataset import AgentDataset, PartialAgentSample
from floeval.config.schemas.io.llm import OpenAIProviderConfig
from floeval.flotorch import WorkflowRunner  # requires floeval[flotorch]

llm_config = OpenAIProviderConfig(
    base_url="https://gateway.example/openai/v1",
    api_key="your-gateway-key",
    chat_model="gpt-4o-mini",
)

dag_config = json.loads(open("workflow_config.json").read())
runner = WorkflowRunner(dag_config=dag_config, llm_config=llm_config)

dataset = AgentDataset(
    samples=[
        PartialAgentSample(
            user_input="What is the status of order #12345?",
            reference_outcome="The order is shipped and arriving tomorrow.",
        )
    ]
)

evaluation = AgentEvaluation(
    dataset=dataset,
    agent_runner=runner,
    llm_config=llm_config,
    metrics=["goal_achievement", "ragas:agent_goal_accuracy"],
)

results = evaluation.run()
print(results.summary)

CLI

# Evaluate a full LLM/RAG dataset
floeval evaluate -c config.yaml -d dataset.json -o results.json

# Evaluate a partial dataset (generate + score in one run)
floeval evaluate -c config.yaml -d partial_dataset.json -o results.json

# Generate first, then evaluate later
floeval generate -c config.yaml -d partial_dataset.json -o complete.json
floeval evaluate -c config.yaml -d complete.json -o results.json

# Prompt evaluation with a prompts file
floeval evaluate -c prompt_config.yaml -d partial_dataset.json -o prompt_results.json

# Single-agent evaluation
floeval evaluate -c agent_config.yaml -d agent_dataset.json --agent -o agent_results.json

# Agentic workflow evaluation
floeval evaluate -c workflow_config.yaml -d agent_dataset.json --agent -o workflow_results.json

Project Structure

  • api/ - public evaluation APIs and dataset loaders
  • core/execution/ - response generation and execution internals
  • metric_providers/ - provider-specific metric implementations
  • config/schemas/ - config, dataset, and prompt schemas
  • cli/ - command-line entry points
  • utils/ - trace capture, loaders, and helper utilities
  • flotorch/ - optional FloTorch integration (WorkflowRunner, FloTorchRunner)

Documentation

Detailed docs live in docs/:

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

floeval-0.3.0b1.tar.gz (83.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

floeval-0.3.0b1-py3-none-any.whl (108.7 kB view details)

Uploaded Python 3

File details

Details for the file floeval-0.3.0b1.tar.gz.

File metadata

  • Download URL: floeval-0.3.0b1.tar.gz
  • Upload date:
  • Size: 83.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for floeval-0.3.0b1.tar.gz
Algorithm Hash digest
SHA256 3baacbc393e583c8a49049724fdf9f7b753a9c559ea57cb740eb984ca619ecfc
MD5 1e317787340ee1b066c42ac42d3a5ef1
BLAKE2b-256 0ab000848675c096968e03bf6af4c179abac92a9a35f24a66327b7ddd22aef43

See more details on using hashes here.

File details

Details for the file floeval-0.3.0b1-py3-none-any.whl.

File metadata

  • Download URL: floeval-0.3.0b1-py3-none-any.whl
  • Upload date:
  • Size: 108.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for floeval-0.3.0b1-py3-none-any.whl
Algorithm Hash digest
SHA256 9ce686e7a4256a92c3c14b7da74e452e77fc41a90dda334cf3a397871fd9afce
MD5 8a98f6011cce11ee376a1e3c16c80f34
BLAKE2b-256 b39afec83553877a351bacde3a7e96444ce6b3ba74a79488fb81e90cb4f64462

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page