Skip to main content

UpTrain - tool to evaluate LLM applications on aspects like factual accuracy, response quality, retrieval quality, tonality, etc.

Project description

Github banner 006 (1)

Try out Evaluations - Read Docs - Slack Community - Feature Request

PRs Welcome Quickstart Website

Demo of UpTrain's LLM evaluations with scores for hallucinations, retrieved-context quality, response tonality for a customer support chatbot

UpTrain is an open-source tool to evaluate LLM applications. UpTrain provides pre-built metrics to check LLM responses on aspects such as correctness, hallucination, toxicity, etc. as well as provides an easy-to-use framework to configure custom checks.

Pre-built Evaluations We Offer 📝

Evaluation Description
Factual Accuracy Checks if the response is grounded by the context provided
Response Completeness Grades how if the response completes the given question
Response Completeness wrt Context Grades how complete the response was for the question specified with respect to the information present in the context
Context Relevance Evaluates if the context has all the information to answer the given question
Response Relevance Grades how relevant the generated response is or if it has any additional irrelevant information for the question asked.
Tone Critique Assesses if the tone of machine-generated responses matches with the desired persona.
Language Critique Scores machine generated responses in a conversation. The response is evaluated on multiple aspects - fluence, politeness, grammar, and coherence.
Response Conciseness Grades how concise the generated response is or if it has any additional irrelevant information for the question asked.
Response Consistency Grades how consistent the response is with the question asked as well as with the context provided.
Guideline Adherence Grades how well the LLM adheres to a provided guideline when giving a response.
Conversation Satisfaction Measures the user’s satisfaction with the conversation with the LLM/AI assistant based on completeness and user’s acceptance.
Response Matching Operator to compare the llm-generated text with the gold (ideal) response using the defined score metric.

Get started 🙌

Install the package through pip:

pip install uptrain

How to use UpTrain:

There are two ways to use UpTrain:

  1. Open-source framework: You can evaluate your responses via the open-source version by providing your OpenAI API key to run evaluations. UpTrain leverages a pipeline comprising GPT-3.5 calls for the same. Note that the evaluation pipeline runs on UpTrain's server but none of the data is logged.

  2. UpTrain API: You can use UpTrain's managed service to log and evaluate your LLM responses. Just provide your UpTrain API key (no need for OpenAI keys) and UpTrain manages running evaluations for you with real-time dashboards and deep insights.

Open-source framework:

Follow the code snippet below to get started with UpTrain.

from uptrain import EvalLLM, Evals, CritiqueTone
import json

OPENAI_API_KEY = "sk-***************"

data = [{
    'question': 'Which is the most popular global sport?',
    'context': "The popularity of sports can be measured in various ways, including TV viewership, social media presence, number of participants, and economic impact. Football is undoubtedly the world's most popular sport with major events like the FIFA World Cup and sports personalities like Ronaldo and Messi, drawing a followership of more than 4 billion people. Cricket is particularly popular in countries like India, Pakistan, Australia, and England. The ICC Cricket World Cup and Indian Premier League (IPL) have substantial viewership. The NBA has made basketball popular worldwide, especially in countries like the USA, Canada, China, and the Philippines. Major tennis tournaments like Wimbledon, the US Open, French Open, and Australian Open have large global audiences. Players like Roger Federer, Serena Williams, and Rafael Nadal have boosted the sport's popularity. Field Hockey is very popular in countries like India, Netherlands, and Australia. It has a considerable following in many parts of the world.",
    'response': 'Football is the most popular sport with around 4 billion followers worldwide'
}]

eval_llm = EvalLLM(openai_api_key=OPENAI_API_KEY)

results = eval_llm.evaluate(
    data=data,
    checks=[Evals.CONTEXT_RELEVANCE, Evals.FACTUAL_ACCURACY, Evals.RESPONSE_RELEVANCE, CritiqueTone(persona="teacher")]
)

print(json.dumps(results, indent=3))

If you have any questions, please join our Slack community

UpTrain API:

  1. Get your free UpTrain API Key here.

  2. Follow the code snippets below to get started with UpTrain.

from uptrain import APIClient, Evals, CritiqueTone
import json

UPTRAIN_API_KEY = "up-***************" 

data = [{
    'question': 'Which is the most popular global sport?',
    'context': "The popularity of sports can be measured in various ways, including TV viewership, social media presence, number of participants, and economic impact. Football is undoubtedly the world's most popular sport with major events like the FIFA World Cup and sports personalities like Ronaldo and Messi, drawing a followership of more than 4 billion people. Cricket is particularly popular in countries like India, Pakistan, Australia, and England. The ICC Cricket World Cup and Indian Premier League (IPL) have substantial viewership. The NBA has made basketball popular worldwide, especially in countries like the USA, Canada, China, and the Philippines. Major tennis tournaments like Wimbledon, the US Open, French Open, and Australian Open have large global audiences. Players like Roger Federer, Serena Williams, and Rafael Nadal have boosted the sport's popularity. Field Hockey is very popular in countries like India, Netherlands, and Australia. It has a considerable following in many parts of the world.",
    'response': 'Football is the most popular sport with around 4 billion followers worldwide'
}]

client = APIClient(uptrain_api_key=UPTRAIN_API_KEY)

results = client.log_and_evaluate(
    project_name="Sample-Project",
    data=data,
    checks=[Evals.CONTEXT_RELEVANCE, Evals.FACTUAL_ACCURACY, Evals.RESPONSE_RELEVANCE, CritiqueTone(persona="teacher")]
)

print(json.dumps(results, indent=3))

To have a customized onboarding, please book a demo call here.

Performing experiments with UpTrain:

Experiments help you perform A/B testing with prompts, so you can compare and choose the options most suitable for you.

from uptrain import APIClient, Evals, CritiqueTone
import json

UPTRAIN_API_KEY = "up-***************" 

data = [{
    'question': 'Which is the most popular global sport?',
    'context': "The popularity of sports can be measured in various ways, including TV viewership, social media presence, number of participants, and economic impact. Football is undoubtedly the world's most popular sport with major events like the FIFA World Cup and sports personalities like Ronaldo and Messi, drawing a followership of more than 4 billion people. Cricket is particularly popular in countries like India, Pakistan, Australia, and England. The ICC Cricket World Cup and Indian Premier League (IPL) have substantial viewership. The NBA has made basketball popular worldwide, especially in countries like the USA, Canada, China, and the Philippines. Major tennis tournaments like Wimbledon, the US Open, French Open, and Australian Open have large global audiences. Players like Roger Federer, Serena Williams, and Rafael Nadal have boosted the sport's popularity. Field Hockey is very popular in countries like India, Netherlands, and Australia. It has a considerable following in many parts of the world.",
    'response': "1. The most popular global sport is determined by factors such as TV viewership, social media presence, number of participants, and economic impact. 2. Football is considered the most popular sport in the world, with events like the FIFA World Cup and star players like Ronaldo and Messi attracting over 4 billion followers. 3. Cricket is particularly popular in countries like India, Pakistan, Australia, and England, with events like the ICC Cricket World Cup and the Indian Premier League (IPL) having substantial viewership. 4. Basketball has gained global popularity, especially in the USA, Canada, China, and the Philippines, largely thanks to the NBA. 5. Tennis also has a significant global audience, with major tournaments like Wimbledon, the US Open, French Open, and Australian Open, and notable players like Roger Federer, Serena Williams, and Rafael Nadal. 6. Field Hockey enjoys popularity in countries like India, Netherlands, and Australia and has followers in many other parts of the world. So, the most popular global sport is football, followed by cricket, basketball, tennis, and field hockey, depending on the region and various factors.",
    'prompt_variation': 'chain-of-thought'
},
{
    'question': 'Which is the most popular global sport?',
    'context': "The popularity of sports can be measured in various ways, including TV viewership, social media presence, number of participants, and economic impact. Football is undoubtedly the world's most popular sport with major events like the FIFA World Cup and sports personalities like Ronaldo and Messi, drawing a followership of more than 4 billion people. Cricket is particularly popular in countries like India, Pakistan, Australia, and England. The ICC Cricket World Cup and Indian Premier League (IPL) have substantial viewership. The NBA has made basketball popular worldwide, especially in countries like the USA, Canada, China, and the Philippines. Major tennis tournaments like Wimbledon, the US Open, French Open, and Australian Open have large global audiences. Players like Roger Federer, Serena Williams, and Rafael Nadal have boosted the sport's popularity. Field Hockey is very popular in countries like India, Netherlands, and Australia. It has a considerable following in many parts of the world.",
    'response': "- Which is the most popular global sport?  - Popularity of sports can be measured in different ways:    - TV viewership    - Social media presence    - Number of participants    - Economic impact  - Football:    - FIFA World Cup    - Ronaldo and Messi draw over 4 billion followers  - Cricket:    - Popular in India, Pakistan, Australia, and England    - ICC Cricket World Cup    - Indian Premier League (IPL)  - Basketball:    - NBA    - Popularity in the USA, Canada, China, Philippines  - Tennis:    - Major tournaments: Wimbledon, US Open, French Open, Australian Open    - Players: Roger Federer, Serena Williams, Rafael Nadal  - Field Hockey:    - Popular in India, Netherlands, Australia    - Followers in many parts of the world  In summary, football is the most popular global sport, followed by cricket, basketball, tennis, and field hockey, with variations in popularity depending on region and measurement criteria.",
    'prompt_variation': 'tree-of-thought'
}]

client = APIClient(uptrain_api_key=UPTRAIN_API_KEY)

results = client.evaluate_experiments(
    project_name="Sample-Experiment",
    data=data,
    checks=[Evals.CONTEXT_RELEVANCE, Evals.FACTUAL_ACCURACY, Evals.RESPONSE_RELEVANCE, CritiqueTone()],
    exp_columns=['prompt_variation']
)

print(json.dumps(results, indent=3))

Key Features 💡

Dimensions of LLM Evaluations 💡

Different dimensions, metrics or criteria for LLM evaluations

We recently wrote about different criteria to evaluate LLM applications and explored grouping them into categories. Read more about it.

Integrations

Eval Frameworks LLM Providers LLM Packages Serving frameworks
OpenAI Evals ✅ GPT-3.5-turbo ✅ Langchain 🔜 HuggingFace 🔜
EleutherAI LM Eval 🔜 GPT-4 ✅ Llama Index 🔜 Replicate 🔜
BIG-Bench 🔜 Claude ✅ AutoGPT 🔜
Cohere ✅

Why UpTrain 🤔?

Large language models are trained over billions of data points and perform really well over a wide variety of tasks. But one thing these models are not good at is being deterministic. Even with the most well-crafted prompts, the model can misbehave for certain inputs, be it hallucinations, wrong output structure, toxic or biased response, irrelevant response, and error modes can be immense.

To ensure your LLM applications work reliably and correctly, UpTrain makes it easy for developers to evaluate the responses of their applications on multiple criteria. UpTrain's evaluation framework can be used to:

  1. Improve performance by 20% - You can’t improve what you can’t measure. UpTrain continuously monitors your application's performance on multiple evaluation criterions and alerts you in case of any regressions with automatic root cause analysis.

  2. Iterate 3x faster - UpTrain enables fast and robust experimentation across multiple prompts, model providers, and custom configurations, by calculating quantitative scores for direct comparison and optimal prompt selection.

  3. Mitigate LLM Hallucinations - Hallucinations have plagued LLMs since their inception. By quantifying degree of hallucination and quality of retrieved context, UpTrain helps to detect responses with low factual accuracy and prevent them before serving to the end-users.

What does UpTrain have to offer? 🚀

To make it easy for you to evaluate your LLM applications, UpTrain offers:

  1. Diverse LLM Evaluations - UpTrain provides a diverse set of pre-built metrics like response relevance, context quality, factual accuracy, language quality, etc. to evaluate your LLM applications upon.

  2. Single-line Integration - With UpTrain's wide array of pre-built metrics, you can run LLM evaluations in less than two minutes.

  3. Customization - UpTrain is built with customization at its core, allowing you to configure custom grading prompts and operators with just a python function.

We are constantly working to make UpTrain better. Want a new feature or need any integrations? Feel free to create an issue or contribute directly to the repository.

License 💻

This repo is published under Apache 2.0 license and we are committed to adding more functionalities to the UpTrain open-source repo. Upon popular demand, we have also rolled out a no-code self-serve console. For customized onboarding, please book a demo call here.

Stay Updated ☎️

We are continuously adding tons of features and use cases. Please support us by giving the project a star ⭐!

Provide feedback (Harsher the better 😉)

We are building UpTrain in public. Help us improve by giving your feedback here.

Contributors 🖥️

We welcome contributions to UpTrain. Please see our contribution guide for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uptrain-0.5.0.tar.gz (168.5 kB view details)

Uploaded Source

Built Distribution

uptrain-0.5.0-py3-none-any.whl (233.4 kB view details)

Uploaded Python 3

File details

Details for the file uptrain-0.5.0.tar.gz.

File metadata

  • Download URL: uptrain-0.5.0.tar.gz
  • Upload date:
  • Size: 168.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for uptrain-0.5.0.tar.gz
Algorithm Hash digest
SHA256 81cd37b938d3805cc3dc6031bce4be94c79e57684247b21f341508aad9bc96e6
MD5 96c2d75b3e690d2bbd8081900c550de6
BLAKE2b-256 aa8ff445f9158e3faa7d9aa74486a4bff35ace1810ee11ac4b1853608502cef4

See more details on using hashes here.

File details

Details for the file uptrain-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: uptrain-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 233.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for uptrain-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7a168c82136f5d3869204e4176b959676075c085dd109582a1fec67e030f58ad
MD5 da3f74f79a46177d78bfc9f09aa783da
BLAKE2b-256 4f3c5f1c929d950734e872b371aade39430aa31dd6fa45883cfcc08d04edef37

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page