Library with langchain instrumentation to evaluate LLM based applications.

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Welcome to TruLens-Eval!

TruLens

Evaluate and track your LLM experiments with TruLens. As you work on your models and prompts TruLens-Eval supports the iterative development and of a wide range of LLM applications by wrapping your application to log key metadata across the entire chain (or off chain if your project does not use chains) on your local machine.

Using feedback functions, you can objectively evaluate the quality of the responses provided by an LLM to your requests. This is completed with minimal latency, as this is achieved in a sequential call for your application, and evaluations are logged to your local machine. Finally, we provide an easy to use Streamlit dashboard run locally on your machine for you to better understand your LLM’s performance.

Architecture Diagram

Quick Usage

To quickly play around with the TruLens Eval library, download this notebook: quickstart.ipynb.

Installation and Setup

Install the trulens-eval pip package from PyPI.

    pip install trulens-eval

API Keys

Our example chat app and feedback functions call external APIs such as OpenAI or HuggingFace. You can add keys by setting the environment variables.

In Python

import os
os.environ["OPENAI_API_KEY"] = "..."

In Terminal

export OPENAI_API_KEY = "..."

Quickstart

In this quickstart you will create a simple LLM Chain and learn how to log it and get feedback on an LLM response.

Setup

Add API keys

For this quickstart you will need Open AI and Huggingface keys

import os
os.environ["OPENAI_API_KEY"] = "..."
os.environ["HUGGINGFACE_API_KEY"] = "..."

Import from LangChain and TruLens

from IPython.display import JSON

# Imports main tools:
from trulens_eval import TruChain, Feedback, Huggingface, Tru
tru = Tru()

# imports from langchain to build app
from langchain.chains import LLMChain
from langchain.llms import OpenAI
from langchain.prompts.chat import ChatPromptTemplate, PromptTemplate
from langchain.prompts.chat import HumanMessagePromptTemplate

Create Simple LLM Application

This example uses a LangChain framework and OpenAI LLM

full_prompt = HumanMessagePromptTemplate(
    prompt=PromptTemplate(
        template=
        "Provide a helpful response with relevant background information for the following: {prompt}",
        input_variables=["prompt"],
    )
)

chat_prompt_template = ChatPromptTemplate.from_messages([full_prompt])

llm = OpenAI(temperature=0.9, max_tokens=128)

chain = LLMChain(llm=llm, prompt=chat_prompt_template, verbose=True)

Send your first request

prompt_input = '¿que hora es?'

llm_response = chain(prompt_input)

display(llm_response)

Initialize Feedback Function(s)

# Initialize Huggingface-based feedback function collection class:
hugs = Huggingface()

# Define a language match feedback function using HuggingFace.
f_lang_match = Feedback(hugs.language_match).on(
    text1="prompt", text2="response"
)

Instrument chain for logging with TruLens

truchain = TruChain(chain,
    chain_id='Chain3_ChatApplication',
    feedbacks=[f_lang_match],
    tru = tru)

# Instrumented chain can operate like the original:
llm_response = truchain(prompt_input)

display(llm_response)

Explore in a Dashboard

tru.run_dashboard() # open a local streamlit app to explore

# tru.run_dashboard(_dev=True) # if running from repo
# tru.stop_dashboard() # stop if needed

Chain Leaderboard

Understand how your LLM application is performing at a glance. Once you've set up logging and evaluation in your application, you can view key performance statistics including cost and average feedback value across all of your LLM apps using the chain leaderboard. As you iterate new versions of your LLM application, you can compare their performance across all of the different quality metrics you've set up.

Note: Average feedback values are returned and displayed in a range from 0 (worst) to 1 (best).

Chain Leaderboard

To dive deeper on a particular chain, click "Select Chain".

Understand chain performance with Evaluations

To learn more about the performance of a particular chain or LLM model, we can select it to view its evaluations at the record level. LLM quality is assessed through the use of feedback functions. Feedback functions are extensible methods for determining the quality of LLM responses and can be applied to any downstream LLM task. Out of the box we provide a number of feedback functions for assessing model agreement, sentiment, relevance and more.

The evaluations tab provides record-level metadata and feedback on the quality of your LLM application.

Evaluations

Deep dive into full chain metadata

Click on a record to dive deep into all of the details of your chain stack and underlying LLM, captured by tru_chain.

Explore a Chain

If you prefer the raw format, you can quickly get it using the "Display full chain json" or "Display full record json" buttons at the bottom of the page.

Note: Feedback functions evaluated in the deferred manner can be seen in the "Progress" page of the TruLens dashboard.

Or view results directly in your notebook

tru.get_records_and_feedback(chain_ids=[])[0] # pass an empty list of chain_ids to get all

Logging

Automatic Logging

The simplest method for logging with TruLens is by wrapping with TruChain and including the tru argument, as shown in the quickstart.

This is done like so:

truchain = TruChain(
    chain,
    chain_id='Chain1_ChatApplication',
    tru=tru
)
truchain("This will be automatically logged.")

Feedback functions can also be logged automatically by providing them in a list to the feedbacks arg.

truchain = TruChain(
    chain,
    chain_id='Chain1_ChatApplication',
    feedbacks=[f_lang_match], # feedback functions
    tru=tru
)
truchain("This will be automatically logged.")

Manual Logging

Wrap with TruChain to instrument your chain

tc = tru_chain.TruChain(chain, chain_id='Chain1_ChatApplication')

Set up logging and instrumentation

Making the first call to your wrapped LLM Application will now also produce a log or "record" of the chain execution.

prompt_input = 'que hora es?'
gpt3_response, record = tc(prompt_input)

We can log the records but first we need to log the chain itself.

tru.add_chain(chain_json=truchain.json)

Then we can log the record:

tru.add_record(
    prompt=prompt_input, # prompt input
    response=gpt3_response['text'], # LLM response
    record_json=record # record is returned by the TruChain wrapper
)

Evaluate Quality

Following the request to your app, you can then evaluate LLM quality using feedback functions. This is completed in a sequential call to minimize latency for your application, and evaluations will also be logged to your local machine.

To get feedback on the quality of your LLM, you can use any of the provided feedback functions or add your own.

To assess your LLM quality, you can provide the feedback functions to tru.run_feedback() in a list provided to feedback_functions.

feedback_results = tru.run_feedback_functions(
    record_json=record,
    feedback_functions=[f_lang_match]
)
display(feedback_results)

After capturing feedback, you can then log it to your local database.

tru.add_feedback(feedback_results)

Out-of-band Feedback evaluation

In the above example, the feedback function evaluation is done in the same process as the chain evaluation. The alternative approach is the use the provided persistent evaluator started via tru.start_deferred_feedback_evaluator. Then specify the feedback_mode for TruChain as deferred to let the evaluator handle the feedback functions.

For demonstration purposes, we start the evaluator here but it can be started in another process.

truchain: TruChain = TruChain(
    chain,
    chain_id='Chain1_ChatApplication',
    feedbacks=[f_lang_match],
    tru=tru,
    feedback_mode="deferred"
)

tru.start_evaluator()
truchain("This will be logged by deferred evaluator.")
tru.stop_evaluator()

Out-of-the-box Feedback Functions

See: https://www.trulens.org/trulens_eval/api/tru_feedback/

Relevance

This evaluates the relevance of the LLM response to the given text by LLM prompting.

Relevance is currently only available with OpenAI ChatCompletion API.

Sentiment

This evaluates the positive sentiment of either the prompt or response.

Sentiment is currently available to use with OpenAI, HuggingFace or Cohere as the model provider.

The OpenAI sentiment feedback function prompts a Chat Completion model to rate the sentiment from 1 to 10, and then scales the response down to 0-1.
The HuggingFace sentiment feedback function returns a raw score from 0 to 1.
The Cohere sentiment feedback function uses the classification endpoint and a small set of examples stored in feedback_prompts.py to return either a 0 or a 1.

Model Agreement

Model agreement uses OpenAI to attempt an honest answer at your prompt with system prompts for correctness, and then evaluates the agreement of your LLM response to this model on a scale from 1 to 10. The agreement with each honest bot is then averaged and scaled from 0 to 1.

Language Match

This evaluates if the language of the prompt and response match.

Language match is currently only available to use with HuggingFace as the model provider. This feedback function returns a score in the range from 0 to 1, where 1 indicates match and 0 indicates mismatch.

Toxicity

This evaluates the toxicity of the prompt or response.

Toxicity is currently only available to be used with HuggingFace, and uses a classification endpoint to return a score from 0 to 1. The feedback function is negated as not_toxicity, and returns a 1 if not toxic and a 0 if toxic.

Moderation

The OpenAI Moderation API is made available for use as feedback functions. This includes hate, hate/threatening, self-harm, sexual, sexual/minors, violence, and violence/graphic. Each is negated (ex: not_hate) so that a 0 would indicate that the moderation rule is violated. These feedback functions return a score in the range 0 to 1.

Adding new feedback functions

Feedback functions are an extensible framework for evaluating LLMs. You can add your own feedback functions to evaluate the qualities required by your application by updating trulens_eval/tru_feedback.py. If your contributions would be useful for others, we encourage you to contribute to TruLens!

Feedback functions are organized by model provider into Provider classes.

The process for adding new feedback functions is:

Create a new Provider class or locate an existing one that applies to your feedback function. If your feedback function does not rely on a model provider, you can create a standalone class:

class StandAlone(Provider):
    def __init__(self):
        pass

Add a new feedback function method to your selected class. Your new method can either take a single text (str) as a parameter or both prompt (str) and response (str). It should return a float between 0 (worst) and 1 (best).

def feedback(self, text: str) -> float:
        """
        Describe how the model works

        Parameters:
            text (str): Text to evaluate.
            Can also be prompt (str) and response (str).

        Returns:
            float: A value between 0 (worst) and 1 (best).
        """
        return float

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

1.5.3

Jul 1, 2025

1.5.2

Jun 18, 2025

1.5.1

Jun 5, 2025

1.5.0

Jun 2, 2025

1.4.9

Apr 10, 2025

1.4.8

Apr 3, 2025

1.4.7

Mar 20, 2025

1.4.6

Mar 13, 2025

1.4.5

Mar 6, 2025

1.4.4

Feb 27, 2025

1.4.3

Feb 26, 2025

1.4.2

Feb 20, 2025

1.4.1

Feb 20, 2025

1.4.1a1 pre-release

Feb 18, 2025

1.4.1a0 pre-release

Feb 18, 2025

1.4.0

Feb 13, 2025

1.4.0a0 pre-release

Feb 12, 2025

1.3.5

Feb 6, 2025

1.3.4

Feb 6, 2025

1.3.3

Jan 27, 2025

1.3.3a0 pre-release

Jan 24, 2025

1.3.2

Jan 16, 2025

1.3.1

Jan 16, 2025

1.3.0

Jan 10, 2025

1.2.11

Dec 16, 2024

1.2.10

Dec 5, 2024

1.2.9

Nov 27, 2024

1.2.8

Nov 19, 2024

1.2.7

Nov 14, 2024

1.2.6

Nov 6, 2024

1.2.5

Nov 6, 2024

1.2.4

Oct 31, 2024

1.2.3

Oct 31, 2024

1.2.2

Oct 30, 2024

1.2.1

Oct 29, 2024

1.2.0

Oct 28, 2024

1.1.0

Oct 10, 2024

1.1.0a4 pre-release

Oct 8, 2024

1.1.0a3 pre-release

Oct 8, 2024

1.1.0a2 pre-release

Oct 8, 2024

1.1.0a1 pre-release

Oct 3, 2024

1.1.0a0 pre-release

Sep 21, 2024

1.0.11

Oct 9, 2024

1.0.10

Oct 7, 2024

1.0.9

Oct 7, 2024

1.0.8

Oct 4, 2024

1.0.7

Oct 2, 2024

1.0.6

Sep 26, 2024

1.0.5

Sep 26, 2024

1.0.4

Sep 25, 2024

1.0.3

Sep 21, 2024

1.0.2

Sep 18, 2024

1.0.1

Aug 30, 2024

1.0.1a6 pre-release

Aug 30, 2024

1.0.1a5 pre-release

Aug 29, 2024

1.0.1a4 pre-release

Aug 28, 2024

1.0.1a1 pre-release

Aug 28, 2024

0.33.1

Aug 28, 2024

0.33.0

Jul 16, 2024

0.32.1

Aug 28, 2024

0.32.0

Jun 24, 2024

0.31.1

Aug 28, 2024

0.31.0

Jun 10, 2024

0.30.1

May 25, 2024

0.30.0

May 25, 2024

0.29.0

May 16, 2024

0.28.2

Apr 24, 2024

0.28.1

Apr 22, 2024

0.28.0

Apr 17, 2024

0.27.2

Apr 4, 2024

0.27.1

Apr 4, 2024

0.27.0

Mar 23, 2024

0.26.0

Mar 15, 2024

0.25.1

Mar 8, 2024

0.25.0

Mar 7, 2024

0.24.1

Feb 23, 2024

0.24.0

Feb 23, 2024

0.23.0

Feb 16, 2024

0.22.2

Feb 13, 2024

0.22.1

Feb 9, 2024

0.22.0

Feb 3, 2024

0.21.0

Jan 26, 2024

0.20.3

Jan 10, 2024

0.20.2

Jan 9, 2024

0.20.1

Jan 5, 2024

0.20.0

Dec 23, 2023

0.19.2

Dec 18, 2023

0.19.1

Dec 15, 2023

0.19.0

Dec 15, 2023

0.18.3

Dec 7, 2023

0.18.2

Dec 1, 2023

0.18.1

Nov 23, 2023

0.18.0

Nov 16, 2023

0.17.0

Nov 2, 2023

0.17.0b0 pre-release

Nov 2, 2023

0.17.0a0 pre-release

Nov 2, 2023

0.16.0

Oct 20, 2023

0.15.3

Oct 11, 2023

0.15.1 yanked

Oct 6, 2023

Reason this release was yanked:

Unstable release

0.15.0 yanked

Oct 6, 2023

Reason this release was yanked:

Unstable release

0.14.0

Sep 28, 2023

0.14.0b0 pre-release

Sep 28, 2023

0.14.0a0 pre-release

Sep 28, 2023

0.13.0

Sep 22, 2023

0.13.0a0 pre-release

Sep 22, 2023

0.12.0

Sep 7, 2023

0.12.0a0 pre-release

Sep 7, 2023

0.11.0

Aug 31, 2023

0.11.0b0 pre-release

Aug 31, 2023

0.11.0a0 pre-release

Aug 31, 2023

0.10.0

Aug 18, 2023

0.9.0

Aug 10, 2023

0.9.0a0 pre-release

Aug 10, 2023

0.8.0

Aug 3, 2023

0.8.0a0 pre-release

Aug 3, 2023

0.7.0

Jul 27, 2023

0.7.0a0 pre-release

Jul 27, 2023

0.6.0

Jul 21, 2023

0.6.0a0 pre-release

Jul 21, 2023

0.5.0

Jul 12, 2023

0.5.0a0 pre-release

Jul 12, 2023

0.4.1b0 pre-release

Jun 30, 2023

0.4.1a0 pre-release

Jun 30, 2023

0.4.0

Jun 29, 2023

0.4.0a0 pre-release

Jun 29, 2023

0.3.0

Jun 23, 2023

0.3.0rc0 pre-release

Jun 23, 2023

0.3.0b0 pre-release

Jun 22, 2023

0.3.0a0 pre-release

Jun 22, 2023

0.2.2

Jun 15, 2023

0.2.2b0 pre-release

Jun 15, 2023

0.2.2a0 pre-release

Jun 15, 2023

0.2.1

Jun 14, 2023

0.2.1a0 pre-release

Jun 14, 2023

0.2.0

Jun 14, 2023

0.2.0a0 pre-release

Jun 14, 2023

This version

0.1.2

Jun 7, 2023

0.1.2a0 pre-release

Jun 2, 2023

0.1.1

May 24, 2023

0.1.1a0 pre-release

May 24, 2023

0.0.1

May 23, 2023

0.0.1a0 pre-release

May 24, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

trulens_eval-0.1.2-py3-none-any.whl (78.7 kB view details)

Uploaded Jun 7, 2023 Python 3

File details

Details for the file trulens_eval-0.1.2-py3-none-any.whl.

File metadata

Download URL: trulens_eval-0.1.2-py3-none-any.whl
Upload date: Jun 7, 2023
Size: 78.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.6

File hashes

Hashes for trulens_eval-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7e11589366e17d2ad46eb9a8191f82b239e0fd2deca47ba14888651f1f50e398`
MD5	`0905b141898ced2e194b14a2b78e79cf`
BLAKE2b-256	`0ad74c480d047a313d71519776130bd4b7a0f7eddef6ef8b059600b76e3c53df`

See more details on using hashes here.

trulens-eval 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Welcome to TruLens-Eval!

Quick Usage

Installation and Setup

API Keys

In Python

In Terminal

Quickstart

Setup

Add API keys

Import from LangChain and TruLens

Create Simple LLM Application

Send your first request

Initialize Feedback Function(s)

Instrument chain for logging with TruLens

Explore in a Dashboard

Chain Leaderboard

Understand chain performance with Evaluations

Deep dive into full chain metadata

Or view results directly in your notebook

Logging

Automatic Logging

Manual Logging

Wrap with TruChain to instrument your chain

Set up logging and instrumentation

Evaluate Quality

Out-of-band Feedback evaluation

Out-of-the-box Feedback Functions

Relevance

Sentiment

Model Agreement

Language Match

Toxicity

Moderation

Adding new feedback functions

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes