Skip to main content

LLM Evaluations

Project description

arize-phoenix-evals

Phoenix provides tooling to evaluate LLM applications, including tools to determine the relevance or irrelevance of documents retrieved by retrieval-augmented generation (RAG) application, whether or not the response is toxic, and much more.

Phoenix's approach to LLM evals is notable for the following reasons:

  • Includes pre-tested templates and convenience functions for a set of common Eval “tasks”
  • Data science rigor applied to the testing of model and template combinations
  • Designed to run as fast as possible on batches of data
  • Includes benchmark datasets and tests for each eval function

Installation

Install the arize-phoenix sub-package via pip

pip install arize-phoenix-evals

Note you will also have to install the LLM vendor SDK you would like to use with LLM Evals. For example, to use OpenAI's GPT-4, you will need to install the OpenAI Python SDK:

pip install 'openai>=1.0.0'

Usage

Here is an example of running the RAG relevance eval on a dataset of Wikipedia questions and answers:

import os
from phoenix.evals import (
    RAG_RELEVANCY_PROMPT_TEMPLATE,
    RAG_RELEVANCY_PROMPT_RAILS_MAP,
    OpenAIModel,
    download_benchmark_dataset,
    llm_classify,
)
from sklearn.metrics import precision_recall_fscore_support, confusion_matrix, ConfusionMatrixDisplay

os.environ["OPENAI_API_KEY"] = "<your-openai-key>"

# Download the benchmark golden dataset
df = download_benchmark_dataset(
    task="binary-relevance-classification", dataset_name="wiki_qa-train"
)
# Sample and re-name the columns to match the template
df = df.sample(100)
df = df.rename(
    columns={
        "query_text": "input",
        "document_text": "reference",
    },
)
model = OpenAIModel(
    model="gpt-4",
    temperature=0.0,
)


rails =list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values())
df[["eval_relevance"]] = llm_classify(df, model, RAG_RELEVANCY_PROMPT_TEMPLATE, rails)
#Golden dataset has True/False map to -> "irrelevant" / "relevant"
#we can then scikit compare to output of template - same format
y_true = df["relevant"].map({True: "relevant", False: "irrelevant"})
y_pred = df["eval_relevance"]

# Compute Per-Class Precision, Recall, F1 Score, Support
precision, recall, f1, support = precision_recall_fscore_support(y_true, y_pred)

To learn more about LLM Evals, see the LLM Evals documentation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arize_phoenix_evals-0.15.1.tar.gz (42.9 kB view details)

Uploaded Source

Built Distribution

arize_phoenix_evals-0.15.1-py3-none-any.whl (56.5 kB view details)

Uploaded Python 3

File details

Details for the file arize_phoenix_evals-0.15.1.tar.gz.

File metadata

  • Download URL: arize_phoenix_evals-0.15.1.tar.gz
  • Upload date:
  • Size: 42.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.5

File hashes

Hashes for arize_phoenix_evals-0.15.1.tar.gz
Algorithm Hash digest
SHA256 ea6a5abadd31ef8c329386c1a5e29f06ad52b93dc424622f0b035574ff943a90
MD5 5be1a91b1721b8b4d0c319eb65eaf413
BLAKE2b-256 17cffc66dfecfdcebefd91bc6a3fe57897538ac4277a374e39cc4db6a0a1f162

See more details on using hashes here.

Provenance

File details

Details for the file arize_phoenix_evals-0.15.1-py3-none-any.whl.

File metadata

File hashes

Hashes for arize_phoenix_evals-0.15.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2c5210f6ea6b9b19f10cb1df5c5ad750fc573358b7e0b5bdc2b741e7b1aa93a7
MD5 c470083e13286ce031ded04b8d2d9b93
BLAKE2b-256 6bca5c1d6d9172508a68cceca7e74d2dc5af686c75f90260ec22e2c97132d52f

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page