LLM Evaluations
Project description
arize-phoenix-evals
Phoenix provides tooling to evaluate LLM applications, including tools to determine the relevance or irrelevance of documents retrieved by retrieval-augmented generation (RAG) application, whether or not the response is toxic, and much more.
Phoenix's approach to LLM evals is notable for the following reasons:
- Includes pre-tested templates and convenience functions for a set of common Eval "tasks"
- Data science rigor applied to the testing of model and template combinations
- Designed to run as fast as possible on batches of data
- Includes benchmark datasets and tests for each eval function
Installation
Install the arize-phoenix-evals sub-package via pip
pip install arize-phoenix-evals
Note you will also have to install the LLM vendor SDK you would like to use with LLM Evals. For example, to use OpenAI's GPT-4, you will need to install the OpenAI Python SDK:
pip install 'openai>=1.0.0'
Usage
Here is an example of running the RAG relevance eval on a dataset of Wikipedia questions and answers:
This example uses scikit-learn, so install it via pip
pip install scikit-learn
import os
from phoenix.evals import (
RAG_RELEVANCY_PROMPT_TEMPLATE,
RAG_RELEVANCY_PROMPT_RAILS_MAP,
OpenAIModel,
download_benchmark_dataset,
llm_classify,
)
from sklearn.metrics import precision_recall_fscore_support
os.environ["OPENAI_API_KEY"] = "<your-openai-key>"
# Choose a model to evaluate on question-answering relevancy classification
model = OpenAIModel(
model="o3-mini",
temperature=0.0,
)
# Choose 100 examples from a small dataset of question-answer pairs
df = download_benchmark_dataset(
task="binary-relevance-classification", dataset_name="wiki_qa-train"
)
df = df.sample(100)
df = df.rename(
columns={
"query_text": "input",
"document_text": "reference",
},
)
# Use the language model to classify each example in the dataset
rails_map = RAG_RELEVANCY_PROMPT_RAILS_MAP
class_names = list(rails_map.values())
result_df = llm_classify(df, model, RAG_RELEVANCY_PROMPT_TEMPLATE, class_names)
# Map the true labels to the class names for comparison
y_true = df["relevant"].map(rails_map)
# Get the labels generated by the model being evaluated
y_pred = result_df["label"]
# Evaluate the classification results of the model
precision, recall, f1, support = precision_recall_fscore_support(y_true, y_pred, labels=class_names)
print("Classification Results:")
for idx, label in enumerate(class_names):
print(f"Class: {label} (count: {support[idx]})")
print(f" Precision: {precision[idx]:.2f}")
print(f" Recall: {recall[idx]:.2f}")
print(f" F1 Score: {f1[idx]:.2f}\n")
To learn more about LLM Evals, see the LLM Evals documentation.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file arize_phoenix_evals-0.22.0.tar.gz
.
File metadata
- Download URL: arize_phoenix_evals-0.22.0.tar.gz
- Upload date:
- Size: 50.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
bc84fafc53ffc01f7832c3313ee22418d397543d0b20520b29f175d45968ef23
|
|
MD5 |
d02f2adc8af3ac845fb3724bac71a5f4
|
|
BLAKE2b-256 |
e0bb2bac66f6471f87a765cacda532e5200a94adfc3dbce7ec8c0a6aafdf1f18
|
File details
Details for the file arize_phoenix_evals-0.22.0-py3-none-any.whl
.
File metadata
- Download URL: arize_phoenix_evals-0.22.0-py3-none-any.whl
- Upload date:
- Size: 64.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
30bde46b3c79b15cc8bce42dc2fb213242a30f820f9d7e4717c3192ebb3158a8
|
|
MD5 |
86884bc742b7a075770219c39636617b
|
|
BLAKE2b-256 |
a2ca838c91b093bb4f56458eb79c3482d104a5554575bef8a70b97aa11cf0c36
|