A multi-backend evaluation framework for LLM, RAG, and agentic systems.
Project description
Floeval
Multi-backend evaluation framework for LLM, RAG, prompt, and agent systems.
Overview
Floeval supports five evaluation types:
| Eval type | What you are scoring | Key dataset fields |
|---|---|---|
| LLM | Direct question-answer quality without retrieval | user_input, llm_response |
| RAG | Answer quality and retrieval performance with context | user_input, llm_response, contexts |
| Prompt | One or more system prompts against the same dataset | Partial dataset + prompts_file (with or without RAG) |
| Agent | Single-agent trace quality, tool use, and goal achievement | AgentDataset (full or partial) |
| Agentic Workflow | Multi-agent DAG pipelines scored end-to-end | AgentDataset + DAG config |
Floeval supports the following workflows:
- evaluating full datasets that already contain
llm_response - generating responses from partial datasets and scoring them in the same run
- expanding partial datasets across prompt variants with
prompt_idsandprompts_file - routing metrics across
ragas,deepeval,builtin, andcustom - evaluating single-agent traces (pre-captured, Python callable, or FloTorch runner)
- evaluating multi-agent DAG workflows with
WorkflowRunner - capturing traces from Python callables, LangChain-style agents, or optional FloTorch runners
Features
- CLI and Python API: run evaluations from config files or integrate directly into code
- Five eval types: LLM, RAG, Prompt (with and without RAG), Agent, and Agentic Workflow
- Multi-provider metrics: mix
ragas,deepeval,builtin, andcustommetrics in one evaluation - Prompt-aware generation: compare system-prompt variants at scale with
prompt_idsandprompts_file - Agent evaluation: score pre-captured traces or collect traces at runtime
- Agentic workflow evaluation: evaluate multi-agent DAG pipelines with
WorkflowRunner - Custom metrics: define function-based metrics or LLM-as-judge criteria
- Dataset format flexibility: accepts
{"samples": [...]}, JSON array, or JSONL; field aliasesquestion/answersupported
Installation
Version 0.2.0b1 is a pre-release. Installing from PyPI may require --pre:
pip install --pre floeval
Optional FloTorch support for agent Mode 4 and agentic workflow evaluation:
pip install "floeval[flotorch]"
Development install:
pip install -e .
pip install -e .[dev]
Quick Start
Python API — LLM / RAG evaluation
from floeval import Evaluation, DatasetLoader
from floeval.config.schemas.io.llm import OpenAIProviderConfig
llm_config = OpenAIProviderConfig(
base_url="https://api.openai.com/v1",
api_key="your-api-key",
chat_model="gpt-4o-mini",
embedding_model="text-embedding-3-small",
)
dataset = DatasetLoader.from_samples(
[
{
"user_input": "What is RAG?",
"llm_response": "RAG stands for Retrieval-Augmented Generation.",
"contexts": ["RAG combines retrieval with generation."],
}
],
partial_dataset=False,
)
evaluation = Evaluation(
dataset=dataset,
llm_config=llm_config,
metrics=["answer_relevancy", "faithfulness"],
default_provider="ragas",
)
results = evaluation.run()
print(results.aggregate_scores)
Python API — Prompt evaluation (multi-prompt)
from floeval import Evaluation, DatasetLoader
from floeval.config.schemas.io.llm import OpenAIProviderConfig
llm_config = OpenAIProviderConfig(
base_url="https://api.openai.com/v1",
api_key="your-api-key",
chat_model="gpt-4o-mini",
embedding_model="text-embedding-3-small",
)
partial_dataset = DatasetLoader.from_samples(
[
{
"user_input": "Summarize this customer ticket.",
"prompt_ids": ["concise", "detailed"]
}
],
partial_dataset=True,
)
evaluation = Evaluation(
dataset=partial_dataset,
llm_config=llm_config,
metrics=["answer_relevancy"],
default_provider="ragas",
dataset_generator_model="gpt-4o-mini",
prompts_file="prompts.yaml",
)
results = evaluation.run()
for row in results.sample_results:
print(row["prompt_id"], row["metrics"]["answer_relevancy"]["score"])
Python API — Agent evaluation
from floeval.api.agent_evaluation import AgentEvaluation
from floeval.config.schemas.io.agent_dataset import AgentDataset, PartialAgentSample
from floeval.config.schemas.io.llm import OpenAIProviderConfig
from floeval.utils.agent_trace import capture_trace, log_turn
llm_config = OpenAIProviderConfig(
base_url="https://api.openai.com/v1",
api_key="your-api-key",
chat_model="gpt-4o-mini",
)
@capture_trace
def my_agent(user_input: str) -> str:
response = f"Handled: {user_input}"
log_turn(response)
return response
dataset = AgentDataset(
samples=[
PartialAgentSample(
user_input="Reset my password",
reference_outcome="Password reset instructions were provided.",
)
]
)
evaluation = AgentEvaluation(
dataset=dataset,
agent=my_agent,
llm_config=llm_config,
metrics=["goal_achievement"],
)
results = evaluation.run()
print(results.summary)
Python API — Agentic Workflow evaluation
import json
from floeval.api.agent_evaluation import AgentEvaluation
from floeval.config.schemas.io.agent_dataset import AgentDataset, PartialAgentSample
from floeval.config.schemas.io.llm import OpenAIProviderConfig
from floeval.flotorch import WorkflowRunner # requires floeval[flotorch]
llm_config = OpenAIProviderConfig(
base_url="https://gateway.example/openai/v1",
api_key="your-gateway-key",
chat_model="gpt-4o-mini",
)
dag_config = json.loads(open("workflow_config.json").read())
runner = WorkflowRunner(dag_config=dag_config, llm_config=llm_config)
dataset = AgentDataset(
samples=[
PartialAgentSample(
user_input="What is the status of order #12345?",
reference_outcome="The order is shipped and arriving tomorrow.",
)
]
)
evaluation = AgentEvaluation(
dataset=dataset,
agent_runner=runner,
llm_config=llm_config,
metrics=["goal_achievement", "ragas:agent_goal_accuracy"],
)
results = evaluation.run()
print(results.summary)
CLI
# Evaluate a full LLM/RAG dataset
floeval evaluate -c config.yaml -d dataset.json -o results.json
# Evaluate a partial dataset (generate + score in one run)
floeval evaluate -c config.yaml -d partial_dataset.json -o results.json
# Generate first, then evaluate later
floeval generate -c config.yaml -d partial_dataset.json -o complete.json
floeval evaluate -c config.yaml -d complete.json -o results.json
# Prompt evaluation with a prompts file
floeval evaluate -c prompt_config.yaml -d partial_dataset.json -o prompt_results.json
# Single-agent evaluation
floeval evaluate -c agent_config.yaml -d agent_dataset.json --agent -o agent_results.json
# Agentic workflow evaluation
floeval evaluate -c workflow_config.yaml -d agent_dataset.json --agent -o workflow_results.json
Project Structure
- api/ - public evaluation APIs and dataset loaders
- core/execution/ - response generation and execution internals
- metric_providers/ - provider-specific metric implementations
- config/schemas/ - config, dataset, and prompt schemas
- cli/ - command-line entry points
- utils/ - trace capture, loaders, and helper utilities
- flotorch/ - optional FloTorch integration (WorkflowRunner, FloTorchRunner)
Documentation
Detailed docs live in docs/:
- Setup & Prerequisites
- Examples
- Prompt Evaluation
- Agent Evaluation
- Agentic Workflow
- Agent Tracing
- Metrics
- Custom Metrics
- API Reference
- Troubleshooting
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file floeval-1.1.6b1.tar.gz.
File metadata
- Download URL: floeval-1.1.6b1.tar.gz
- Upload date:
- Size: 88.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2943c84c78f63cf30ccf7ed865322b163bb622938fbe966ba54871c4b073413f
|
|
| MD5 |
ea268fbd455bd975e9e4e79b286020fd
|
|
| BLAKE2b-256 |
90658ec18791b6eb68bed0e89bb60a9d9d83b345ae66ad06662266a6f8aec7e2
|
File details
Details for the file floeval-1.1.6b1-py3-none-any.whl.
File metadata
- Download URL: floeval-1.1.6b1-py3-none-any.whl
- Upload date:
- Size: 114.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
95d49c40fc6c435c03f76e9ae073cffbbd2d00304583f09f68148ac278589aaa
|
|
| MD5 |
27399e6270e04cb9d6af13d614b78e31
|
|
| BLAKE2b-256 |
1a8101481797c7fe1b259f3f82feed6b7fe68378e7d5c6a8de42a235e5d133a9
|