Python SDK for evaluating LLM Applications

These details have not been verified by PyPI

Project links

Project description

Fiddler Evals SDK

A comprehensive toolkit for evaluating Large Language Model (LLM) applications, RAG systems, and AI agents. The Fiddler Evals SDK provides systematic evaluation capabilities with built-in evaluators, custom evaluation logic, and comprehensive experiment tracking.

Key Features

🧪 Systematic Evaluation: Run structured experiments on your AI applications
📊 Built-in Evaluators: Production-ready evaluators for common AI tasks
🔧 Custom Evaluators: Build evaluation logic for your specific use cases
📈 Result Tracking: Comprehensive experiment tracking and analysis
🚀 Scale: Evaluate across large datasets with concurrent processing
📁 Multiple Data Sources: Import test cases from CSV, JSONL, and pandas DataFrames

Requirements

Python 3.10 or higher
Access to a Fiddler Platform instance
API token from Fiddler Platform

Installation

pip install fiddler-evals

For pre-release versions:

pip install --upgrade --pre fiddler-evals

Quick Start

1. Connect to Fiddler

from fiddler_evals import init

# Initialize connection
init(url='https://your-instance.fiddler.ai', token='your-api-token')

2. Create Project Structure

from fiddler_evals import Project, Application, Dataset

# Create project and application
project = Project.get_or_create(name='my-eval-project')
app = Application.get_or_create(
    name='my-llm-app',
    project_id=project.id
)

# Create dataset
dataset = Dataset.create(
    name='evaluation-dataset',
    application_id=app.id,
    description='Test cases for LLM evaluation'
)

3. Add Test Cases

from fiddler_evals.pydantic_models.dataset import NewDatasetItem


test_cases = [
    NewDatasetItem(
        inputs={"question": "What happens to you if you eat watermelon seeds?"},
        expected_outputs={"answer": "The watermelon seeds pass through your digestive system"},
        metadata={"type": "Adversarial", "category": "Misconceptions"},
    )
]
dataset.insert(test_cases)

4. Use Built-in Evaluators

Configure LLM Gateway provider:

Add an LLM provider via UI (Settings > LLM Gateway) to use Fiddler's pre-build LLM-as-a-Judge evaluators. LLM-as-a-Judge evaluators require a model parameter in the format {provider}/{model} (e.g., openai/gpt-4o) and an optional credential parameter for LLM Gateway authentication.

from fiddler_evals.evaluators import (
    AnswerRelevance, Coherence, Conciseness,
    Sentiment, RegexSearch
)

# Test LLM-as-a-Judge evaluators (require model parameter)
relevance_evaluator = AnswerRelevance(
    model="openai/gpt-4o",           # Required: LLM Gateway model in {provider}/{model} format
    credential="my-openai-cred"      # Optional: LLM Gateway credential name
)
score = relevance_evaluator.score(
    prompt="What is the capital of France?",
    response="Paris is the capital of France."
)
print(f"Score: {score.value} - {score.reasoning}")

# Test other evaluators (no model parameter needed)
sentiment_evaluator = Sentiment()
scores = sentiment_evaluator.score(text="This is a helpful response.")
print("Sentiments:", [f'{score.name}: {score.value}' for score in scores])

5. Create Custom Evaluators

from fiddler_evals.evaluators.base import Evaluator
from fiddler_evals.pydantic_models.score import Score

class PolitenessEvaluator(Evaluator):
    """
    Simple evaluator that checks if a response contains polite language.
    Useful for customer service or chatbot applications.
    """

    def __init__(self, score_name_prefix: str = None, score_fn_kwargs_mapping: dict = None):
        super().__init__(
            score_name_prefix=score_name_prefix,
            score_fn_kwargs_mapping=score_fn_kwargs_mapping
        )
        self.polite_words = [
            'please', 'thank you', 'thanks', 'sorry', 'apologize',
            'appreciate', 'welcome', 'help', 'assist', 'glad'
        ]

    def score(self, output: str) -> Score:
        """Score based on presence of polite language."""
        output_lower = output.lower()

        # Count polite words
        polite_count = sum(1 for word in self.polite_words if word in output_lower)

        # Simple scoring: 1.0 if any polite words found, 0.0 otherwise
        if polite_count > 0:
            score_value = 1.0
            reasoning = f"Contains {polite_count} polite word(s)"
        else:
            score_value = 0.0
            reasoning = "No polite language detected"

        return Score(
            name=f"{self.score_name_prefix}politeness",
            evaluator_name=self.name,
            value=score_value,
            reasoning=reasoning
        )

# Test the evaluator with different configurations
politeness_evaluator = PolitenessEvaluator()

polite_response = "Thank you for your question! I'd be happy to help you with that."
impolite_response = "I don't know. Figure it out yourself."

print(f"Polite response score: {politeness_evaluator.score(polite_response).value}")
print(f"Impolite response score: {politeness_evaluator.score(impolite_response).value}")

# Use with different configurations
customer_service_evaluator = PolitenessEvaluator(
    score_name_prefix="customer_service",
    score_fn_kwargs_mapping={"output": "response"}
)

support_evaluator = PolitenessEvaluator(
    score_name_prefix="support",
    score_fn_kwargs_mapping={"output": "answer"}
)

5.1. Function-Based Evaluators

You can also use simple functions as evaluators instead of creating full evaluator classes. Functions are automatically wrapped with EvalFn internally:

def word_count_evaluator(output: str) -> float:
    """Simple function that returns word count as a score."""
    word_count = len(output.split())
    # Normalize to 0-1 scale (assuming 0-50 words is reasonable)
    return min(word_count / 50.0, 1.0)

def contains_number_evaluator(output: str) -> float:
    """Check if response contains any numbers."""
    import re
    return 1.0 if re.search(r'\d+', output) else 0.0

# Use functions directly in evaluators list
evaluators = [
    AnswerRelevance(model="openai/gpt-4o", credential="my-openai-cred"),
    Conciseness(model="openai/gpt-4o", credential="my-openai-cred"),
    word_count_evaluator,        # Function evaluator
    contains_number_evaluator,   # Function evaluator
]

# The evaluate() function automatically wraps these with EvalFn
experiment_result = evaluate(
    dataset=dataset,
    task=my_llm_task,
    evaluators=evaluators,
    score_fn_kwargs_mapping={
        "output": "answer",      # Maps to function parameter
        "response": "answer",    # Maps to class evaluator parameter
    }
)

5.2. Custom LLM-as-a-Judge (CustomJudge)

The CustomJudge evaluator lets you define your own LLM-based evaluation criteria with a custom prompt and structured output schema.

Simple usage — string prompt with output fields:

from fiddler_evals.evaluators import CustomJudge

evaluator = CustomJudge(
    model="openai/gpt-4o",
    credential="my-openai-cred",
    prompt_template="""
        Analyze the sentiment of this customer review:

        Review: {{ review_text }}
    """,
    output_fields={
        "sentiment": {
            "type": "string",
            "choices": ["positive", "negative", "neutral"],
        },
        "confidence": {
            "type": "number",
            "minimum": 0,
            "maximum": 1,
            "description": "Confidence score between 0 and 1",
        },
    },
)

scores = evaluator.score(inputs={"review_text": "Great product, fast shipping!"})
for s in scores:
    print(f"{s.name}: {s.value or s.label}")

Multi-message prompts — use Message for system/user/assistant roles, and InputFieldSpec for input metadata:

from fiddler_evals.evaluators.custom_judge import (
    CustomJudge, Message, InputFieldSpec,
)

evaluator = CustomJudge(
    model="openai/gpt-4o",
    prompt_template=[
        Message(role="system", content="You are an expert code reviewer."),
        Message(role="user", content="Review this code:\n{{ code }}"),
    ],
    inputs={
        "code": InputFieldSpec(required=True, description="The code to review"),
    },
    output_fields={
        "quality": {
            "type": "integer",
            "minimum": 1,
            "maximum": 10,
            "description": "Code quality score from 1 to 10",
        },
        "feedback": {"type": "string"},
    },
)

scores = evaluator.score(inputs={"code": "def add(a, b): return a + b"})

Output transforms — sometimes the field names or values you want the LLM to produce aren't the same as what you want in your final output. For example, LLMs are more reliable at classifying into labeled categories than producing precise numeric scores — so you can have the LLM return a sentiment label and map it to a number. Use llm_response_fields to define what the LLM returns, then use OutputFieldTransform in output_fields to map those values:

from fiddler_evals.evaluators.custom_judge import (
    CustomJudge, Message, InputFieldSpec, OutputFieldTransform,
)

evaluator = CustomJudge(
    model="openai/gpt-4o",
    prompt_template=[
        Message(role="system", content="Classify the sentiment of the text."),
        Message(role="user", content="Text: {{ text }}"),
    ],
    inputs={
        "text": InputFieldSpec(required=True),
    },
    # What the LLM returns: a categorical label
    llm_response_fields={
        "sentiment": {
            "type": "string",
            "choices": ["negative", "neutral", "positive"],
        },
        "reasoning": {"type": "string"},
    },
    # Final output: map labels to a 0-1 score
    output_fields={
        "score": {
            "type": "number",
            "minimum": 0,
            "maximum": 1,
            "transform": OutputFieldTransform(
                source_field="sentiment",
                value_map={"negative": 0, "neutral": 0.5, "positive": 1},
            ),
        },
        "reasoning": {"type": "string"},
    },
)

scores = evaluator.score(inputs={"text": "Great product, fast shipping!"})
# score.value = 1.0 (mapped from "positive")

6. Run Experiments

from fiddler_evals import evaluate

# Define your AI application task
def my_llm_task(inputs: dict, extras: dict, metadata: dict) -> dict:
    question = inputs.get("question", "")
    # Your LLM API call here
    answer = call_your_llm(question)
    return {"answer": answer}

# Set up evaluators with different configurations
evaluators = [
    # LLM-as-a-Judge evaluators (require model parameter)
    AnswerRelevance(
        model="openai/gpt-4o",
        credential="my-openai-cred",
        score_name_prefix="primary"
    ),
    Conciseness(
        model="openai/gpt-4o",
        credential="my-openai-cred",
        score_name_prefix="primary"
    ),

    # Other evaluators
    Sentiment(score_name_prefix="primary"),

    # Custom evaluators with specific mappings
    PolitenessEvaluator(
        score_name_prefix="quality",
        score_fn_kwargs_mapping={"output": "answer"}
    ),

    # Multiple instances of same evaluator for different fields
    RegexSearch(
        pattern=r"\d+",
        score_name_prefix="question",
        score_name="has_number",
        score_fn_kwargs_mapping={"output": "question"}
    ),
    RegexSearch(
        pattern=r"\d+",
        score_name_prefix="answer",
        score_name="has_number",
        score_fn_kwargs_mapping={"output": "answer"}
    ),
]

# Run evaluation
experiment_result = evaluate(
    dataset=dataset,
    task=my_llm_task,
    evaluators=evaluators,
    name_prefix="my_evaluation",
    description="Comprehensive LLM evaluation",
    score_fn_kwargs_mapping={
        "question": lambda x: x["inputs"]["question"],
        "response": "answer",
        "text": "answer",
        "prompt": lambda x: x["inputs"]["question"],
    }
)

print(f"Evaluated {len(experiment_result.results)} test cases")
print(f"Generated {sum(len(result.scores) for result in experiment_result.results)} scores")

# Results in organized score names:
# "primary_answer_relevance", "primary_conciseness", "primary_sentiment",
# "quality_politeness", "question_has_number", "answer_has_number"

Built-in Evaluators

Evaluator	Purpose	Constructor Parameters	Score Parameters
`AnswerRelevance`	Checks if response addresses the question	`model` (required), `credential` (required)	`prompt`, `response`
`Coherence`	Evaluates logical flow and consistency	`model` (required), `credential` (required)	`response`, `prompt` (optional)
`Conciseness`	Measures response brevity and clarity	`model` (required), `credential` (required)	`response`
`Sentiment`	Analyzes emotional tone	-	`text`
`RegexSearch`	Pattern matching for specific formats	`pattern` (required)	`output`
`FTLPromptSafety`	Compute safety scores for prompts	-	`text`
`FTLResponseFaithfulness`	Evaluate faithfulness of LLM responses	-	`response`, `context`
`RAGFaithfulness`	Checks if response is grounded in context	`model` (required), `credential` (optional)	`user_query`, `rag_response`, `retrieved_documents`
`CustomJudge`	Fully customizable LLM-as-a-Judge	`model` (required), `credential` (optional), `prompt_template` + `output_fields` or `prompt_spec`	`inputs` (dict)

Note: Evaluators marked with model and credential parameters are LLM-as-a-Judge evaluators that require an LLM Gateway model. The model parameter should be in {provider}/{model} format (e.g., openai/gpt-4o). The credential parameter is the name of the LLM Gateway credential for authentication.

Data Import Options

CSV Files

dataset.insert_from_csv_file(
    file_path='data.csv',
    input_columns=['question'],
    expected_output_columns=['answer'],
    metadata_columns=['category']
)

JSONL Files

dataset.insert_from_jsonl_file(
    file_path='data.jsonl',
    input_keys=['question'],
    expected_output_keys=['answer'],
    metadata_keys=['category']
)

Pandas DataFrames

dataset.insert_from_pandas(
    df=df,
    input_columns=['question'],
    expected_output_columns=['answer'],
    metadata_columns=['category']
)

Advanced Usage

Concurrent Processing

experiment_result = evaluate(
    dataset=dataset,
    task=my_llm_task,
    evaluators=evaluators,
    max_workers=4  # Process 4 test cases concurrently
)

Custom Score Mapping

The score_fn_kwargs_mapping parameter is essential for connecting your task outputs to evaluator inputs. Different evaluators expect different parameter names, but your task function returns outputs with specific keys.

# Your task returns:
{"answer": "Paris is the capital of France"}

# But evaluators expect different parameter names:
AnswerRelevance.score(prompt="...", response="...")  # Needs 'prompt' and 'response'
Conciseness.score(response="...")                    # Needs 'response'
Sentiment.score(text="...")                         # Needs 'text'

The Solution: Map your output keys to evaluator parameter names:

score_fn_kwargs_mapping={
    "question": "question",           # Map 'question' parameter to 'question' key
    "response": "answer",            # Map 'response' parameter to 'answer' key
    "text": "answer",                # Map 'text' parameter to 'answer' key
    "prompt": lambda x: x["inputs"]["question"],  # Map 'prompt' to input question
    "context": lambda x: x["extras"]["context"]   # Map 'context' to extras
}

Multiple Evaluator Instances with Different Mappings

You can create multiple instances of the same evaluator with different parameter mappings and score name prefixes to evaluate different aspects of your outputs. Use score_name_prefix to organize and distinguish scores when using multiple evaluator instances:

from fiddler_evals.evaluators import RegexSearch

# Create multiple RegexSearch evaluators for different fields
evaluators = [
    # Check for numbers in the question
    RegexSearch(
        pattern=r"\d+",
        score_name_prefix="question",
        score_name="has_number",
        score_fn_kwargs_mapping={"output": "question"}
    ),
    # Check for numbers in the answer
    RegexSearch(
        pattern=r"\d+",
        score_name_prefix="answer",
        score_name="has_number",
        score_fn_kwargs_mapping={"output": "answer"}
    ),
    # Check for capital letters in the answer
    RegexSearch(
        pattern=r"[A-Z]",
        score_name_prefix="answer",
        score_name="has_caps",
        score_fn_kwargs_mapping={"output": "answer"}
    )
]

# Run evaluation
experiment_result = evaluate(
    dataset=dataset,
    task=my_llm_task,
    evaluators=evaluators,
    score_fn_kwargs_mapping={
        "question": lambda x: x["inputs"]["question"]
    }
)

# Results in scores named:
# "question_has_number", "answer_has_number", "answer_has_caps"

Parameter Mapping Priority

When both evaluator-level and evaluation-level mappings are present, evaluator-level mappings take precedence:

# Evaluator-level mapping (higher priority)
evaluator = RegexSearch(
    pattern=r"\d+",
    score_fn_kwargs_mapping={"output": "answer"}  # This takes precedence
)

# Evaluation-level mapping (lower priority)
experiment_result = evaluate(
    dataset=dataset,
    task=my_llm_task,
    evaluators=[evaluator],
    score_fn_kwargs_mapping={
        "output": "question"  # This is ignored due to evaluator-level mapping
    }
)

Mapping Priority (highest to lowest):

Evaluator-level score_fn_kwargs_mapping (set in evaluator constructor)
Evaluation-level score_fn_kwargs_mapping (passed to evaluate function)
Default parameter resolution

Experiment Metadata

experiment_result = evaluate(
    dataset=dataset,
    task=my_llm_task,
    evaluators=evaluators,
    metadata={
        "model_version": "gpt-4",
        "evaluation_date": "2024-01-15",
        "temperature": 0.7
    }
)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.0

Feb 11, 2026

This version

0.3.0rc2 pre-release

Apr 30, 2026

0.3.0rc1 pre-release

Feb 11, 2026

0.2.0

Nov 19, 2025

0.2.0rc2 pre-release

Nov 11, 2025

0.2.0rc1 pre-release

Nov 10, 2025

0.1.1

Oct 8, 2025

0.1.1.dev14 pre-release

Oct 8, 2025

0.1.1.dev13 pre-release

Oct 1, 2025

0.1.1.dev12 pre-release

Oct 1, 2025

0.1.1.dev11 pre-release

Oct 1, 2025

0.1.1.dev10 pre-release

Oct 1, 2025

0.1.1.dev9 pre-release

Oct 1, 2025

0.1.1.dev2 pre-release

Sep 29, 2025

0.1.1.dev1 pre-release

Sep 29, 2025

0.1.0

Sep 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fiddler_evals-0.3.0rc2.tar.gz (130.6 kB view details)

Uploaded Apr 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fiddler_evals-0.3.0rc2-py3-none-any.whl (168.7 kB view details)

Uploaded Apr 30, 2026 Python 3

File details

Details for the file fiddler_evals-0.3.0rc2.tar.gz.

File metadata

Download URL: fiddler_evals-0.3.0rc2.tar.gz
Upload date: Apr 30, 2026
Size: 130.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fiddler_evals-0.3.0rc2.tar.gz
Algorithm	Hash digest
SHA256	`7f5a6d56f8d56858a874f12ccb63b6ec267ecac2e4d541c1908c1581ebe589fd`
MD5	`a143b730c4a5700276d40c211e368472`
BLAKE2b-256	`3dfcb5a77a6c30be7d16bd226ce99687433f93c8d5ed59f1ea6e2d5ef42efedf`

See more details on using hashes here.

Provenance

The following attestation bundles were made for fiddler_evals-0.3.0rc2.tar.gz:

Publisher: release.yaml on fiddler-labs/fiddler-evals-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: fiddler_evals-0.3.0rc2.tar.gz
- Subject digest: 7f5a6d56f8d56858a874f12ccb63b6ec267ecac2e4d541c1908c1581ebe589fd
- Sigstore transparency entry: 1413968827
- Sigstore integration time: Apr 30, 2026
Source repository:
- Permalink: fiddler-labs/fiddler-evals-sdk@0af6653ff363235af90b006a9afc36fa9d4e9bf9
- Branch / Tag: refs/heads/release/0.3
- Owner: https://github.com/fiddler-labs
- Access: internal
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: self-hosted
- Publication workflow: release.yaml@0af6653ff363235af90b006a9afc36fa9d4e9bf9
- Trigger Event: workflow_dispatch

File details

Details for the file fiddler_evals-0.3.0rc2-py3-none-any.whl.

File metadata

Download URL: fiddler_evals-0.3.0rc2-py3-none-any.whl
Upload date: Apr 30, 2026
Size: 168.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fiddler_evals-0.3.0rc2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9162c8bfe98d0be2ad3ed986878d0a5ee5cdb7892f3ddffdbcd5e5d089ab20e5`
MD5	`d087d2390018064341ed40f0c9613125`
BLAKE2b-256	`cf418dbe93f221be36b78ab0000ba51c5eef580d19ad3c040d46cdd50c88a48b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for fiddler_evals-0.3.0rc2-py3-none-any.whl:

Publisher: release.yaml on fiddler-labs/fiddler-evals-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: fiddler_evals-0.3.0rc2-py3-none-any.whl
- Subject digest: 9162c8bfe98d0be2ad3ed986878d0a5ee5cdb7892f3ddffdbcd5e5d089ab20e5
- Sigstore transparency entry: 1413968986
- Sigstore integration time: Apr 30, 2026
Source repository:
- Permalink: fiddler-labs/fiddler-evals-sdk@0af6653ff363235af90b006a9afc36fa9d4e9bf9
- Branch / Tag: refs/heads/release/0.3
- Owner: https://github.com/fiddler-labs
- Access: internal
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: self-hosted
- Publication workflow: release.yaml@0af6653ff363235af90b006a9afc36fa9d4e9bf9
- Trigger Event: workflow_dispatch

fiddler-evals 0.3.0rc2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Fiddler Evals SDK

Key Features

Requirements

Installation

Quick Start

1. Connect to Fiddler

2. Create Project Structure

3. Add Test Cases

4. Use Built-in Evaluators

5. Create Custom Evaluators

5.1. Function-Based Evaluators

5.2. Custom LLM-as-a-Judge (CustomJudge)

6. Run Experiments

Built-in Evaluators

Data Import Options

CSV Files

JSONL Files

Pandas DataFrames

Advanced Usage

Concurrent Processing

Custom Score Mapping

Multiple Evaluator Instances with Different Mappings

Parameter Mapping Priority

Experiment Metadata

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance