Skip to main content

LLM Evaluations

Project description

arize-phoenix-evals

Phoenix provides tooling to evaluate LLM applications, including tools to determine the relevance or irrelevance of documents retrieved by retrieval-augmented generation (RAG) application, whether or not the response is toxic, and much more.

Phoenix's approach to LLM evals is notable for the following reasons:

  • Includes pre-tested templates and convenience functions for a set of common Eval "tasks"
  • Data science rigor applied to the testing of model and template combinations
  • Designed to run as fast as possible on batches of data
  • Includes benchmark datasets and tests for each eval function

Installation

Install the arize-phoenix-evals sub-package via pip

pip install arize-phoenix-evals

Note you will also have to install the LLM vendor SDK you would like to use with LLM Evals. For example, to use OpenAI's GPT-4, you will need to install the OpenAI Python SDK:

pip install 'openai>=1.0.0'

Usage

Here is an example of running the RAG relevance eval on a dataset of Wikipedia questions and answers:

This example uses scikit-learn, so install it via pip

pip install scikit-learn
import os
from phoenix.evals import (
    RAG_RELEVANCY_PROMPT_TEMPLATE,
    RAG_RELEVANCY_PROMPT_RAILS_MAP,
    OpenAIModel,
    download_benchmark_dataset,
    llm_classify,
)
from sklearn.metrics import precision_recall_fscore_support

os.environ["OPENAI_API_KEY"] = "<your-openai-key>"

# Choose a model to evaluate on question-answering relevancy classification
model = OpenAIModel(
    model="o3-mini",
    temperature=0.0,
)

# Choose 100 examples from a small dataset of question-answer pairs
df = download_benchmark_dataset(
    task="binary-relevance-classification", dataset_name="wiki_qa-train"
)
df = df.sample(100)
df = df.rename(
    columns={
        "query_text": "input",
        "document_text": "reference",
    },
)

# Use the language model to classify each example in the dataset
rails_map = RAG_RELEVANCY_PROMPT_RAILS_MAP
class_names = list(rails_map.values())
result_df = llm_classify(df, model, RAG_RELEVANCY_PROMPT_TEMPLATE, class_names)

# Map the true labels to the class names for comparison
y_true = df["relevant"].map(rails_map)
# Get the labels generated by the model being evaluated
y_pred = result_df["label"]

# Evaluate the classification results of the model
precision, recall, f1, support = precision_recall_fscore_support(y_true, y_pred, labels=class_names)
print("Classification Results:")
for idx, label in enumerate(class_names):
    print(f"Class: {label} (count: {support[idx]})")
    print(f"  Precision: {precision[idx]:.2f}")
    print(f"  Recall:    {recall[idx]:.2f}")
    print(f"  F1 Score:  {f1[idx]:.2f}\n")

To learn more about LLM Evals, see the LLM Evals documentation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arize_phoenix_evals-0.22.0.tar.gz (50.1 kB view details)

Uploaded Source

Built Distribution

arize_phoenix_evals-0.22.0-py3-none-any.whl (64.3 kB view details)

Uploaded Python 3

File details

Details for the file arize_phoenix_evals-0.22.0.tar.gz.

File metadata

  • Download URL: arize_phoenix_evals-0.22.0.tar.gz
  • Upload date:
  • Size: 50.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for arize_phoenix_evals-0.22.0.tar.gz
Algorithm Hash digest
SHA256 bc84fafc53ffc01f7832c3313ee22418d397543d0b20520b29f175d45968ef23
MD5 d02f2adc8af3ac845fb3724bac71a5f4
BLAKE2b-256 e0bb2bac66f6471f87a765cacda532e5200a94adfc3dbce7ec8c0a6aafdf1f18

See more details on using hashes here.

File details

Details for the file arize_phoenix_evals-0.22.0-py3-none-any.whl.

File metadata

File hashes

Hashes for arize_phoenix_evals-0.22.0-py3-none-any.whl
Algorithm Hash digest
SHA256 30bde46b3c79b15cc8bce42dc2fb213242a30f820f9d7e4717c3192ebb3158a8
MD5 86884bc742b7a075770219c39636617b
BLAKE2b-256 a2ca838c91b093bb4f56458eb79c3482d104a5554575bef8a70b97aa11cf0c36

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page