Skip to main content

Python SDK for evaluating LLM Applications

Project description

Fiddler Evals SDK

A comprehensive toolkit for evaluating Large Language Model (LLM) applications, RAG systems, and AI agents. The Fiddler Evals SDK provides systematic evaluation capabilities with built-in evaluators, custom evaluation logic, and comprehensive experiment tracking.

Key Features

  • 🧪 Systematic Evaluation: Run structured experiments on your AI applications
  • 📊 Built-in Evaluators: Production-ready evaluators for common AI tasks
  • 🔧 Custom Evaluators: Build evaluation logic for your specific use cases
  • 📈 Result Tracking: Comprehensive experiment tracking and analysis
  • 🚀 Scale: Evaluate across large datasets with concurrent processing
  • 📁 Multiple Data Sources: Import test cases from CSV, JSONL, and pandas DataFrames

Requirements

  • Python 3.10 or higher
  • Access to a Fiddler Platform instance
  • API token from Fiddler Platform

Installation

pip install fiddler-evals

For pre-release versions:

pip install --upgrade --pre fiddler-evals

Quick Start

1. Connect to Fiddler

from fiddler_evals import init

# Initialize connection
init(url='https://your-org.fiddler.ai', token='your-api-token')

2. Create Project Structure

from fiddler_evals import Project, Application, Dataset

# Create project and application
project = Project.get_or_create(name='my-eval-project')
app = Application.get_or_create(
    name='my-llm-app',
    project_id=project.id
)

# Create dataset
dataset = Dataset.create(
    name='evaluation-dataset',
    application_id=app.id,
    description='Test cases for LLM evaluation'
)

3. Add Test Cases

from fiddler_evals.pydantic_models.dataset import NewDatasetItem


test_cases = [
    NewDatasetItem(
        inputs={"question": "What happens to you if you eat watermelon seeds?"},
        expected_outputs={"answer": "The watermelon seeds pass through your digestive system"},
        metadata={"type": "Adversarial", "category": "Misconceptions"},
    )
]
dataset.insert(test_cases)

4. Use Built-in Evaluators

from fiddler_evals.evaluators import (
    AnswerRelevance, Coherence, Conciseness,
    Toxicity, Sentiment, RegexSearch
)

# Test individual evaluators
relevance_evaluator = AnswerRelevance()
score = relevance_evaluator.score(
    prompt="What is the capital of France?",
    response="Paris is the capital of France."
)
print(f"Score: {score.value} - {score.reasoning}")

5. Create Custom Evaluators

from fiddler_evals.evaluators.base import Evaluator
from fiddler_evals.pydantic_models.score import Score

class PolitenessEvaluator(Evaluator):
    """
    Simple evaluator that checks if a response contains polite language.
    Useful for customer service or chatbot applications.
    """

    def __init__(self):
        super().__init__()
        self.polite_words = [
            'please', 'thank you', 'thanks', 'sorry', 'apologize',
            'appreciate', 'welcome', 'help', 'assist', 'glad'
        ]

    def score(self, output: str) -> Score:
        """Score based on presence of polite language."""
        output_lower = output.lower()

        # Count polite words
        polite_count = sum(1 for word in self.polite_words if word in output_lower)

        # Simple scoring: 1.0 if any polite words found, 0.0 otherwise
        if polite_count > 0:
            score_value = 1.0
            reasoning = f"Contains {polite_count} polite word(s)"
        else:
            score_value = 0.0
            reasoning = "No polite language detected"

        return Score(
            name="politeness",
            evaluator_name=self.name,
            value=score_value,
            reasoning=reasoning
        )

# Test the evaluator
politeness_evaluator = PolitenessEvaluator()

polite_response = "Thank you for your question! I'd be happy to help you with that."
impolite_response = "I don't know. Figure it out yourself."

print(f"Polite response score: {politeness_evaluator.score(polite_response).value}")
print(f"Impolite response score: {politeness_evaluator.score(impolite_response).value}")

5.1. Function-Based Evaluators

You can also use simple functions as evaluators instead of creating full evaluator classes. Functions are automatically wrapped with EvalFn internally:

def word_count_evaluator(output: str) -> float:
    """Simple function that returns word count as a score."""
    word_count = len(output.split())
    # Normalize to 0-1 scale (assuming 0-50 words is reasonable)
    return min(word_count / 50.0, 1.0)

def contains_number_evaluator(output: str) -> float:
    """Check if response contains any numbers."""
    import re
    return 1.0 if re.search(r'\d+', output) else 0.0

# Use functions directly in evaluators list
evaluators = [
    AnswerRelevance(),
    Conciseness(),
    word_count_evaluator,        # Function evaluator
    contains_number_evaluator,   # Function evaluator
]

# The evaluate() function automatically wraps these with EvalFn
experiment_result = evaluate(
    dataset=dataset,
    task=my_llm_task,
    evaluators=evaluators,
    score_fn_kwargs_mapping={
        "output": "answer",      # Maps to function parameter
        "response": "answer",    # Maps to class evaluator parameter
    }
)

6. Run Experiments

from fiddler_evals import evaluate

# Define your AI application task
def my_llm_task(inputs: dict, extras: dict, metadata: dict) -> dict:
    question = inputs.get("question", "")
    # Your LLM API call here
    answer = call_your_llm(question)
    return {"answer": answer}

# Set up evaluators
evaluators = [
    AnswerRelevance(),
    Conciseness(),
    Sentiment(),
    PolitenessEvaluator(),
]

# Run evaluation
experiment_result = evaluate(
    dataset=dataset,
    task=my_llm_task,
    evaluators=evaluators,
    name_prefix="my_evaluation",
    description="Comprehensive LLM evaluation",
    score_fn_kwargs_mapping={
        "question": "question",
        "response": "answer",
        "output": "answer",
        "text": "answer",
        "prompt": lambda x: x["inputs"]["question"],
    }
)

print(f"Evaluated {len(experiment_result.results)} test cases")
print(f"Generated {sum(len(result.scores) for result in experiment_result.results)} scores")

Built-in Evaluators

Evaluator Purpose Key Parameters
AnswerRelevance Checks if response addresses the question prompt, response
Coherence Evaluates logical flow and consistency response, prompt
Conciseness Measures response brevity and clarity response
Toxicity Detects harmful or toxic content text
Sentiment Analyzes emotional tone text
RegexSearch Pattern matching for specific formats output, pattern
FTLPromptSafety Compute safety scores for prompts text
FTLResponseFaithfulness Evaluate faithfulness of LLM responses response, context

Data Import Options

CSV Files

dataset.insert_from_csv_file(
    file_path='data.csv',
    input_columns=['question'],
    expected_output_columns=['answer'],
    metadata_columns=['category']
)

JSONL Files

dataset.insert_from_jsonl_file(
    file_path='data.jsonl',
    input_keys=['question'],
    expected_output_keys=['answer'],
    metadata_keys=['category']
)

Pandas DataFrames

dataset.insert_from_pandas(
    df=df,
    input_columns=['question'],
    expected_output_columns=['answer'],
    metadata_columns=['category']
)

Advanced Usage

Concurrent Processing

experiment_result = evaluate(
    dataset=dataset,
    task=my_llm_task,
    evaluators=evaluators,
    max_workers=4  # Process 4 test cases concurrently
)

Custom Score Mapping

The score_fn_kwargs_mapping parameter is essential for connecting your task outputs to evaluator inputs. Different evaluators expect different parameter names, but your task function returns outputs with specific keys.

# Your task returns:
{"answer": "Paris is the capital of France"}

# But evaluators expect different parameter names:
AnswerRelevance.score(prompt="...", response="...")  # Needs 'prompt' and 'response'
Conciseness.score(response="...")                    # Needs 'response'
Sentiment.score(text="...")                         # Needs 'text'

The Solution: Map your output keys to evaluator parameter names:

score_fn_kwargs_mapping={
    "question": "question",           # Map 'question' parameter to 'question' key
    "response": "answer",            # Map 'response' parameter to 'answer' key
    "text": "answer",                # Map 'text' parameter to 'answer' key
    "prompt": lambda x: x["inputs"]["question"],  # Map 'prompt' to input question
    "context": lambda x: x["extras"]["context"]   # Map 'context' to extras
}

Experiment Metadata

experiment_result = evaluate(
    dataset=dataset,
    task=my_llm_task,
    evaluators=evaluators,
    metadata={
        "model_version": "gpt-4",
        "evaluation_date": "2024-01-15",
        "temperature": 0.7
    }
)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fiddler_evals-0.1.1.dev11.tar.gz (109.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fiddler_evals-0.1.1.dev11-py3-none-any.whl (147.6 kB view details)

Uploaded Python 3

File details

Details for the file fiddler_evals-0.1.1.dev11.tar.gz.

File metadata

  • Download URL: fiddler_evals-0.1.1.dev11.tar.gz
  • Upload date:
  • Size: 109.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fiddler_evals-0.1.1.dev11.tar.gz
Algorithm Hash digest
SHA256 17dfb4e1148a78d7aada6764aeb277915fd55aba3212ad3ba00b01d312f77f2d
MD5 c67b30f68f9e6e7692b9c0bb597359f6
BLAKE2b-256 79c45b35879b1f5656feb59a8eef2b13ebd8cc9ec1463f0b3e4277c72cafd657

See more details on using hashes here.

Provenance

The following attestation bundles were made for fiddler_evals-0.1.1.dev11.tar.gz:

Publisher: release.yaml on fiddler-labs/fiddler-evals-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fiddler_evals-0.1.1.dev11-py3-none-any.whl.

File metadata

File hashes

Hashes for fiddler_evals-0.1.1.dev11-py3-none-any.whl
Algorithm Hash digest
SHA256 4cf9015fc467d2492061a4827e79313937e04f08e462696c662deff65a95a323
MD5 069442cf4ac96d262f3bb359c33ad6e6
BLAKE2b-256 a7d3fb5931a29be1e7eb495d9d061f05c9ed10959ded8a5ef91e6b2301d6e0c2

See more details on using hashes here.

Provenance

The following attestation bundles were made for fiddler_evals-0.1.1.dev11-py3-none-any.whl:

Publisher: release.yaml on fiddler-labs/fiddler-evals-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page