LLM Evaluations

These details have not been verified by PyPI

Project links

Project description

arize-phoenix-evals

Phoenix provides tooling to evaluate LLM applications, including tools to determine the relevance or irrelevance of documents retrieved by retrieval-augmented generation (RAG) application, whether or not the response is toxic, and much more.

Phoenix's approach to LLM evals is notable for the following reasons:

Includes pre-tested templates and convenience functions for a set of common Eval "tasks"
Data science rigor applied to the testing of model and template combinations
Designed to run as fast as possible on batches of data
Includes benchmark datasets and tests for each eval function

Installation

Install the arize-phoenix-evals sub-package via pip

pip install arize-phoenix-evals

Note you will also have to install the LLM vendor SDK you would like to use with LLM Evals. For example, to use OpenAI's GPT-4, you will need to install the OpenAI Python SDK:

pip install 'openai>=1.0.0'

Usage

Here is an example of running the RAG relevance eval on a dataset of Wikipedia questions and answers:

This example uses scikit-learn, so install it via pip

pip install scikit-learn

import os
from phoenix.evals import (
    RAG_RELEVANCY_PROMPT_TEMPLATE,
    RAG_RELEVANCY_PROMPT_RAILS_MAP,
    OpenAIModel,
    download_benchmark_dataset,
    llm_classify,
)
from sklearn.metrics import precision_recall_fscore_support

os.environ["OPENAI_API_KEY"] = "<your-openai-key>"

# Choose a model to evaluate on question-answering relevancy classification
model = OpenAIModel(
    model="o3-mini",
    temperature=0.0,
)

# Choose 100 examples from a small dataset of question-answer pairs
df = download_benchmark_dataset(
    task="binary-relevance-classification", dataset_name="wiki_qa-train"
)
df = df.sample(100)
df = df.rename(
    columns={
        "query_text": "input",
        "document_text": "reference",
    },
)

# Use the language model to classify each example in the dataset
rails_map = RAG_RELEVANCY_PROMPT_RAILS_MAP
class_names = list(rails_map.values())
result_df = llm_classify(df, model, RAG_RELEVANCY_PROMPT_TEMPLATE, class_names)

# Map the true labels to the class names for comparison
y_true = df["relevant"].map(rails_map)
# Get the labels generated by the model being evaluated
y_pred = result_df["label"]

# Evaluate the classification results of the model
precision, recall, f1, support = precision_recall_fscore_support(y_true, y_pred, labels=class_names)
print("Classification Results:")
for idx, label in enumerate(class_names):
    print(f"Class: {label} (count: {support[idx]})")
    print(f"  Precision: {precision[idx]:.2f}")
    print(f"  Recall:    {recall[idx]:.2f}")
    print(f"  F1 Score:  {f1[idx]:.2f}\n")

To learn more about LLM Evals, see the LLM Evals documentation.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.22.0

Jul 2, 2025

0.21.1

Jul 2, 2025

0.21.0

Jun 21, 2025

0.20.8

Jun 4, 2025

0.20.7

May 28, 2025

0.20.6

Apr 17, 2025

0.20.5

Apr 16, 2025

0.20.4

Mar 24, 2025

0.20.3

Feb 13, 2025

0.20.2

Feb 6, 2025

0.20.1

Feb 6, 2025

0.20.0

Feb 5, 2025

0.19.0

Jan 16, 2025

0.18.1

Jan 7, 2025

0.18.0

Dec 20, 2024

0.17.5

Nov 19, 2024

0.17.4

Nov 12, 2024

0.17.3

Nov 6, 2024

0.17.2

Oct 18, 2024

0.17.1

Oct 17, 2024

0.17.0

Oct 9, 2024

0.16.1

Sep 27, 2024

0.16.0

Sep 17, 2024

0.15.1

Aug 27, 2024

0.15.0

Aug 15, 2024

0.14.1

Jul 16, 2024

0.14.0

Jul 12, 2024

0.13.2

Jul 3, 2024

0.13.1

Jun 30, 2024

0.13.0

Jun 26, 2024

0.12.0

Jun 6, 2024

0.11.0

May 31, 2024

0.10.0

May 29, 2024

0.9.2

May 21, 2024

0.9.1

May 21, 2024

0.9.0

May 17, 2024

0.8.2

May 14, 2024

0.8.1

May 4, 2024

0.8.0

Apr 22, 2024

0.7.0

Apr 13, 2024

0.6.1

Apr 4, 2024

0.6.0

Mar 29, 2024

0.5.0

Mar 20, 2024

0.4.0

Mar 20, 2024

0.3.1

Mar 16, 2024

0.3.0

Mar 13, 2024

0.2.0

Mar 7, 2024

0.1.0

Mar 5, 2024

0.0.5

Feb 24, 2024

0.0.4

Feb 24, 2024

0.0.3

Feb 23, 2024

0.0.2

Feb 23, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arize_phoenix_evals-0.22.0.tar.gz (50.1 kB view details)

Uploaded Jul 2, 2025 Source

Built Distribution

arize_phoenix_evals-0.22.0-py3-none-any.whl (64.3 kB view details)

Uploaded Jul 2, 2025 Python 3

File details

Details for the file arize_phoenix_evals-0.22.0.tar.gz.

File metadata

Download URL: arize_phoenix_evals-0.22.0.tar.gz
Upload date: Jul 2, 2025
Size: 50.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for arize_phoenix_evals-0.22.0.tar.gz
Algorithm	Hash digest
SHA256	`bc84fafc53ffc01f7832c3313ee22418d397543d0b20520b29f175d45968ef23`
MD5	`d02f2adc8af3ac845fb3724bac71a5f4`
BLAKE2b-256	`e0bb2bac66f6471f87a765cacda532e5200a94adfc3dbce7ec8c0a6aafdf1f18`

See more details on using hashes here.

File details

Details for the file arize_phoenix_evals-0.22.0-py3-none-any.whl.

File metadata

Download URL: arize_phoenix_evals-0.22.0-py3-none-any.whl
Upload date: Jul 2, 2025
Size: 64.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for arize_phoenix_evals-0.22.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`30bde46b3c79b15cc8bce42dc2fb213242a30f820f9d7e4717c3192ebb3158a8`
MD5	`86884bc742b7a075770219c39636617b`
BLAKE2b-256	`a2ca838c91b093bb4f56458eb79c3482d104a5554575bef8a70b97aa11cf0c36`

See more details on using hashes here.

arize-phoenix-evals 0.22.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

arize-phoenix-evals

Installation

Usage

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes