TBD

Project description

flex-evals

A Python implementation of the Flexible Evaluation Protocol (FEP) - a vendor-neutral, schema-driven standard for evaluating any system that produces complex or variable outputs, from deterministic APIs to non-deterministic LLMs and agentic workflows.

Quick Start

from flex_evals import evaluate, TestCase, Output, Check, CheckType

# Define your test cases
test_cases = [
    TestCase(
        id="test_001",
        input="What is the capital of France?",
        expected="Paris"
    )
]

# System outputs to evaluate
outputs = [
    Output(value="The capital of France is Paris.")
]

# Define evaluation criteria using enums (strings also supported)
checks = [
    Check(
        type=CheckType.EXACT_MATCH,  # Can also use 'exact_match' string
        arguments={
            "actual": "$.output.value",  # JSONPath expression
            "expected": "$.test_case.expected",
            "case_sensitive": False
        }
    )
]

# Run evaluation
results = evaluate(test_cases, outputs, checks)
print(f"Evaluation completed: {results.status}")
print(f"Passed: {results.results[0].check_results[0].results['passed']}")

Features
Installation
Core Concepts
Usage Examples
Available Checks
JSONPath Support
Async Evaluation
Architecture
Development
Contributing
License

Features

Protocol Compliance

Full FEP Implementation - Complete implementation of the Flexible Evaluation Protocol specification
Structured Results - Comprehensive result format with metadata, timestamps, and error details
Reproducible Evaluations - Consistent, auditable evaluation runs

Flexible Data Access

JSONPath Expressions - Dynamic data extraction with $.test_case.input, $.output.value, etc.
Multiple Input Types - Support for strings, objects, and complex nested data structures
Custom Metadata - Attach arbitrary metadata to test cases, outputs, and evaluations

Built-in Checks

Standard Checks - exact_match, contains, regex, threshold
LLM Checks - semantic_similarity, llm_judge (with user-provided async functions)
Extensible - Easy to add custom check implementations

Performance & Scalability

Async Support - Automatic detection and optimal execution of sync/async checks
Parallel Execution - Batch processing for large evaluation runs
Memory Efficient - Streaming support for large datasets

Developer Experience

Pythonic API - Clean, type-safe interfaces with excellent IDE support
Test-Friendly - Easy unit testing of individual checks
Comprehensive Documentation - Detailed examples and API reference

Installation

uv add flex-evals
pip install flex-evals

Requirements

Python 3.11+
Dependencies: jsonpath-ng, pydantic, pyyaml, requests

Core Concepts

Test Cases

Define the inputs and expected outputs for evaluation:

test_case = TestCase(
    id="unique_identifier",
    input="System input data",
    expected="Expected output",  # Optional
    metadata={"category": "reasoning"}  # Optional
)

Outputs

Represent the actual system responses being evaluated:

output = Output(
    value="System generated response",
    metadata={"model": "gpt-4", "tokens": 150}  # Optional
)

Checks

Define evaluation criteria with flexible argument resolution:

from flex_evals import Check, CheckType

check = Check(
    type=CheckType.EXACT_MATCH,  # Can also use 'exact_match' string
    arguments={
        "actual": "$.output.value",  # JSONPath to extract data
        "expected": "Paris",         # Literal value
        "case_sensitive": False
    },
)

Usage Examples

Simple Text Comparison

from flex_evals import evaluate, TestCase, Output, Check, CheckType

# Geography quiz evaluation
test_cases = [TestCase(id="q1", input="Capital of France?", expected="Paris")]
outputs = [Output(value="Paris")]
checks = [Check(type=CheckType.EXACT_MATCH, arguments={"actual": "$.output.value", "expected": "$.test_case.expected"})]

results = evaluate(test_cases, outputs, checks)

Multi-Criteria Evaluation

# Evaluate both correctness and format using enums (strings also supported)
checks = [
    # Check if answer is correct
    Check(
        type=CheckType.CONTAINS,  # Can also use 'contains' string
        arguments={
            "text": "$.output.value",
            "phrases": ["Paris"],
            "case_sensitive": False
        }
    ),
    # Check if response is properly formatted
    Check(
        type=CheckType.REGEX,  # Can also use 'regex' string
        arguments={
            "text": "$.output.value",
            "pattern": r"The capital of .+ is .+\.",
            "flags": {"case_insensitive": True}
        }
    )
]

results = evaluate(test_cases, outputs, checks)

Per-Test-Case Checks

# Different evaluation criteria for each test case using enums (strings also supported)
test_cases = [
    TestCase(
        id="math_problem",
        input="What is 2+2?",
        checks=[
            Check(type=CheckType.EXACT_MATCH, arguments={"actual": "$.output.value", "expected": "4"})
        ]
    ),
    TestCase(
        id="creative_writing",
        input="Write a haiku about code",
        checks=[
            Check(type=CheckType.REGEX, arguments={"text": "$.output.value", "pattern": r"(.+\n){2}.+"})
        ]
    )
]

outputs = [Output(value="4"), Output(value="Code flows like stream\nBugs dance in morning sunlight\nCommit, push, deploy")]

# No global checks needed - using per-test-case checks
results = evaluate(test_cases, outputs, checks=None)

Complex Data Structures

# Evaluate structured outputs
test_case = TestCase(
    id="api_test",
    input={"endpoint": "/users", "method": "GET"},
    expected={"status": 200, "count": 5}
)

output = Output(
    value={"status": 200, "data": {"users": [...]}, "count": 5},
    metadata={"response_time": 245}
)

checks = [
    Check(
        type=CheckType.EXACT_MATCH,  # Can also use 'exact_match' string
        arguments={
            "actual": "$.output.value.status",
            "expected": "$.test_case.expected.status"
        }
    ),
    Check(
        type=CheckType.THRESHOLD,  # Can also use 'threshold' string
        arguments={
            "value": "$.output.metadata.response_time",
            "max_value": 500
        }
    )
]

LLM Evaluation with Semantic Similarity

import openai

async def get_embedding(text):
    """User-provided embedding function"""
    response = await openai.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

# Semantic similarity check using enums (strings also supported)
checks = [
    Check(
        type=CheckType.SEMANTIC_SIMILARITY,  # Can also use 'semantic_similarity' string
        arguments={
            "text": "$.output.value",
            "reference": "$.test_case.expected",
            "embedding_function": get_embedding,
            'threshold': {"min_value": 0.8}
        }
    )
]

results = evaluate(test_cases, outputs, checks)  # Automatically runs async

Available Checks

Standard Checks

`exact_match`

Compare two values for exact equality:

Check(type='exact_match', arguments={
    "actual": "$.output.value",
    "expected": "Paris",
    "case_sensitive": True,  # Default
    "negate": False          # Default
})

`contains`

Check if text contains all specified phrases:

Check(type='contains', arguments={
    "text": "$.output.value",
    "phrases": ["Paris", "France"],
    "case_sensitive": True,  # Default
    "negate": False          # Pass if ALL phrases found
})

`regex`

Test text against regular expression patterns:

Check(type='regex', arguments={
    "text": "$.output.value",
    "pattern": r"^[A-Z][a-z]+$",
    "flags": {
        "case_insensitive": False,
        "multiline": False,
        "dot_all": False
    },
    "negate": False
})

`threshold`

Validate numeric values against bounds:

Check(type='threshold', arguments={
    "value": "$.output.confidence",
    "min_value": 0.8,
    "max_value": 1.0,
    "min_inclusive": True,   # Default
    "max_inclusive": True,   # Default
    "negate": False
})

Extended Checks (Async)

`semantic_similarity`

Measure semantic similarity using embeddings:

Check(type='semantic_similarity', arguments={
    "text": "$.output.value",
    "reference": "$.test_case.expected",
    "embedding_function": your_async_embedding_function,
    'threshold': {"min_value": 0.8},
    "similarity_metric": 'cosine'  # Default
})

`llm_judge`

Use LLM for qualitative evaluation:

Check(type='llm_judge', arguments={
    "prompt": "Rate this response for helpfulness: {{$.output.value}}",
    "response_format": {
        "type": "object",
        "properties": {
            "score": {"type": "number"},
            "reasoning": {"type": "string"}
        }
    },
    "llm_function": your_async_llm_function
})

🔍 JSONPath Support

Access data anywhere in the evaluation context using JSONPath expressions:

# Evaluation context structure:
{
    "test_case": {
        "id": "test_001",
        "input": "What is the capital of France?",
        "expected": "Paris",
        "metadata": {"category": "geography"}
    },
    "output": {
        "value": "The capital of France is Paris",
        "metadata": {"model": "gpt-4", "tokens": 25}
    }
}

# JSONPath examples:
"$.test_case.input"              # "What is the capital of France?"
"$.test_case.expected"           # "Paris"
"$.output.value"                 # "The capital of France is Paris"
"$.output.metadata.model"        # "gpt-4"
"$.test_case.metadata.category"  # "geography"

Literal vs JSONPath

Strings starting with $. are JSONPath expressions
Use \\$. to escape literal strings that start with $.
All other values are treated as literals

⚡ Async Evaluation

flex-evals automatically detects and optimizes async checks:

# Mix of sync and async checks
checks = [
    Check(type='exact_match', arguments={"actual": "$.output.value", "expected": "Paris"}),  # Sync
    Check(type='semantic_similarity', arguments={...})  # Async
]

# Engine automatically:
# 1. Detects async checks  
# 2. Runs everything in async context
# 3. Maintains proper result ordering
results = evaluate(test_cases, outputs, checks)

Custom Async Checks

from flex_evals.checks.base import BaseAsyncCheck
from flex_evals.registry import register

@register("custom_async_check", version="1.0.0")
class CustomAsyncCheck(BaseAsyncCheck):
    async def __call__(self, text: str, api_endpoint: str) -> dict:
        # Your async implementation
        async with httpx.AsyncClient() as client:
            response = await client.post(api_endpoint, json={"text": text})
            return {"score": response.json()["score"]}

Architecture

Core Components

src/flex_evals/
├── schemas/          # Pydantic models for FEP protocol
├── engine.py         # Main evaluate() function
├── checks/
│   ├── base.py       # BaseCheck and BaseAsyncCheck
│   ├── standard/     # Built-in synchronous checks
│   └── extended/     # Async checks (LLM, API calls)
├── jsonpath_resolver.py  # JSONPath expression handling
├── registry.py       # Check registration and discovery
└── exceptions.py     # Custom exception hierarchy

Evaluation Flow

Validation - Ensure inputs meet protocol requirements
Check Resolution - Map check types to implementations
Async Detection - Determine execution strategy
Execution - Run checks with proper error handling
Aggregation - Collect results and compute summaries

Result Format

EvaluationRunResult(
    evaluation_id="uuid",
    started_at="2025-01-01T00:00:00Z",
    completed_at="2025-01-01T00:00:05Z", 
    status='completed',  # completed | error | skip
    summary=EvaluationSummary(
        total_test_cases=100,
        completed_test_cases=95,
        error_test_cases=3,
        skipped_test_cases=2
    ),
    results=[TestCaseResult(...), ...],
    experiment=ExperimentMetadata(...)
)

Development

Setup

# Clone repository
git clone https://github.com/your-org/flex-evals.git
cd flex-evals

# Install with development dependencies
uv install --dev

# Run tests
make unittests

# Run linting
make linting

# Run all quality checks
make tests

Project Commands

# Development workflow
make linting          # Run ruff linting
make unittests        # Run pytest with coverage  
make tests            # Run all quality checks

# Package management
uv add <package>      # Add dependency
uv add --dev <tool>   # Add development dependency
uv run <command>      # Run command in environment

Adding Custom Checks

Create check implementation:

from flex_evals.checks.base import BaseCheck
from flex_evals.registry import register

@register("my_check", version="1.0.0")
class MyCheck(BaseCheck):
    def __call__(self, text: str, pattern: str, threshold: float = 0.5) -> dict:
        # Your check logic here
        score = your_analysis(text, pattern)
        return {"score": score, "passed": score >= threshold}

Write tests:

def test_my_check():
    check = MyCheck()
    result = check(text="test input", pattern="test")
    assert "score" in result
    assert "passed" in result

Register and use:

# Import registers the check automatically
from my_package.my_check import MyCheck

check = Check(type="my_check", arguments={
    "text": "$.output.value",
    "pattern": "success",
    'threshold': 0.8
})

Contributing

We welcome contributions! Please see our Contributing Guide for details.

Quick Contribution Steps

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes with tests
Run quality checks (make tests)
Commit changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open a Pull Request

Development Principles

Test-driven development - Write tests first, run frequently
Protocol compliance - Maintain full FEP specification adherence
Clean interfaces - Pythonic, type-safe APIs
Performance - Optimize for large-scale evaluations
Documentation - Clear examples and comprehensive docs

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Support

Issues: GitHub Issues

Project details

Release history Release notifications | RSS feed

0.4.0

Apr 12, 2026

0.3.30

Apr 11, 2026

0.3.29

Feb 23, 2026

0.3.28

Feb 23, 2026

0.3.27

Feb 23, 2026

0.3.26

Feb 23, 2026

0.3.25

Feb 22, 2026

0.3.24

Feb 21, 2026

0.3.23

Feb 21, 2026

0.3.22

Jan 21, 2026

0.3.21

Dec 5, 2025

0.3.20

Dec 3, 2025

0.3.19

Nov 28, 2025

0.3.18

Nov 13, 2025

0.3.17

Nov 13, 2025

0.3.16

Oct 28, 2025

0.3.15

Oct 28, 2025

0.3.14

Oct 28, 2025

0.3.13

Oct 26, 2025

0.3.12

Oct 26, 2025

0.3.11

Oct 26, 2025

0.3.10

Oct 26, 2025

0.3.9

Oct 15, 2025

0.3.8

Sep 29, 2025

0.3.7

Sep 12, 2025

0.3.6

Sep 11, 2025

0.3.5

Sep 11, 2025

0.3.4

Sep 10, 2025

0.3.3

Sep 5, 2025

0.3.2

Aug 31, 2025

0.3.1

Aug 15, 2025

0.3.0

Aug 15, 2025

0.2.4

Aug 14, 2025

0.2.3

Aug 14, 2025

0.2.2

Aug 14, 2025

0.2.1

Aug 13, 2025

0.2.0

Aug 13, 2025

0.1.10

Aug 7, 2025

0.1.9

Aug 7, 2025

0.1.8

Aug 7, 2025

0.1.7

Aug 6, 2025

0.1.6

Aug 5, 2025

0.1.5

Aug 4, 2025

0.1.4

Jul 21, 2025

0.1.3

Jul 21, 2025

0.1.2

Jul 5, 2025

0.1.1

Jul 5, 2025

This version

0.0.2

Jun 29, 2025

0.0.1

Jun 28, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flex_evals-0.0.2.tar.gz (236.8 kB view details)

Uploaded Jun 29, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

flex_evals-0.0.2-py3-none-any.whl (35.8 kB view details)

Uploaded Jun 29, 2025 Python 3

File details

Details for the file flex_evals-0.0.2.tar.gz.

File metadata

Download URL: flex_evals-0.0.2.tar.gz
Upload date: Jun 29, 2025
Size: 236.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.14

File hashes

Hashes for flex_evals-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`99164bd78a57e6ecab3e628637e4d3b7a91d672dceaacf7e2e396f37320a847a`
MD5	`04ac8d94f27ece1d57766ec44e9b645c`
BLAKE2b-256	`da57b0adddf4abf80933e08233231d605a3f55708cbda3a066e5899d90d741e8`

See more details on using hashes here.

File details

Details for the file flex_evals-0.0.2-py3-none-any.whl.

File metadata

Download URL: flex_evals-0.0.2-py3-none-any.whl
Upload date: Jun 29, 2025
Size: 35.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.14

File hashes

Hashes for flex_evals-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`92a6ae22576f1e8f437ea98455415d0292200987be297b7c9e56a67d9c0ca989`
MD5	`d9c6d7805c14cf0752f0477b65e4940c`
BLAKE2b-256	`030c97225bc236f532515331fbf860d844e87d8e0fd624ef5818c9a054344e14`

See more details on using hashes here.

flex-evals 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

flex-evals

Quick Start

Table of Contents

Features

Protocol Compliance

Flexible Data Access

Built-in Checks

Performance & Scalability

Developer Experience

Installation

Requirements

Core Concepts

Test Cases

Outputs

Checks

Usage Examples

Simple Text Comparison

Multi-Criteria Evaluation

Per-Test-Case Checks

Complex Data Structures

LLM Evaluation with Semantic Similarity

Available Checks

Standard Checks

exact_match

contains

regex

threshold

Extended Checks (Async)

semantic_similarity

llm_judge

🔍 JSONPath Support

Literal vs JSONPath

⚡ Async Evaluation

Custom Async Checks

Architecture

Core Components

Evaluation Flow

Result Format

Development

Setup

Project Commands

Adding Custom Checks

Contributing

Quick Contribution Steps

Development Principles

License

Related Links

Support

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`exact_match`

`contains`

`regex`

`threshold`

`semantic_similarity`

`llm_judge`