UpTrain - tool to evaluate LLM applications on aspects like factual accuracy, response quality, retrieval quality, tonality, etc.
Project description
Try out Evaluations - Read Docs - Quickstart Tutorials - Slack Community - Feature Request
UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.
Key Features 🔑
All the evaluations and analysis run locally on your system, ensuring that the data never leaves your secure environment (except for LLM calls while using model grading checks)
Experiment with different embedding models like text-embedding-3-large/small, text-embedding-3-ada, baai/bge-large, etc. UpTrain supports HuggingFace models, Replicate endpoints, or custom models hosted on your endpoint.
By leveraging model grading and introducing an 'Unclear' grade, we are able to leverage GPT-3.5-turbo-1106 as the default evaluator and get high quality yet cost effective scores.
You can perform root cause analysis on cases with either negative user feedback or low evaluation scores to understand which part of your LLM pipeline is giving suboptimal results. Check out the supported RCA templates.
We allow you to use any of OpenAI, Anthropic, Mistral, Azure's Openai endpoints or open-source LLMs hosted on Anyscale to be used as evaluator.
UpTrain provides tons of ways to customize evaluations. You can customize evaluation method (chain of thought vs classify), few shot examples, add scenario description, as well as create custom evaluators.
Support for 40+ operators such as BLEU, ROUGE, Embeddings Similarity, Exact match, etc.
Coming Soon:
- Experiment Dashboards
- Collaborate with your team
- Embedding visualization via UMAP and Clustering
- Pattern recognition among failure cases
- Prompt improvement suggestions
Pre-built Evaluations We Offer 📝
Eval | Description |
---|---|
Reponse Completeness | Grades whether the response has answered all the aspects of the question specified. |
Reponse Conciseness | Grades how concise the generated response is or if it has any additional irrelevant information for the question asked. |
Reponse Relevance | Grades how relevant the generated context was to the question specified. |
Reponse Validity | Grades if the response generated is valid or not. A response is considered to be valid if it contains any information. |
Reponse Consistency | Grades how consistent the response is with the question asked as well as with the context provided. |
Eval | Description |
---|---|
Context Relevance | Grades how relevant the context was to the question specified. |
Context Utilization | Grades how complete the generated response was for the question specified given the information provided in the context. |
Factual Accuracy | Grades whether the response generated is factually correct and grounded by the provided context. |
Context Conciseness | Evaluates the concise context cited from an original context for irrelevant information. |
Context Reranking | Evaluates how efficient the reranked context is compared to the original context. |
Eval | Description |
---|---|
Language Features | Grades whether the response has answered all the aspects of the question specified. |
Tonality | Grades whether the generated response matches the required persona's tone |
Eval | Description |
---|---|
Code Hallucination | Grades whether the code present in the generated response is grounded by the context. |
Eval | Description |
---|---|
User Satisfaction | Grades how well the user's concerns are addressed and assesses their satisfaction based on provided conversation. |
Eval | Description |
---|---|
Custom Guideline | Allows you to specify a guideline and grades how well the LLM adheres to the provided guideline when giving a response. |
Custom Prompts | Allows you to create your own set of evaluations. |
Eval | Description |
---|---|
Response Matching | Compares and grades how well the response generated by the LLM aligns with the provided ground truth. |
Eval | Description |
---|---|
Prompt Injection | Grades whether the generated response is leaking any system prompt. |
Jailbreak Detection | Grades whether the user's prompt is an attempt to jailbreak (i.e. generate illegal or harmful responses). |
Get started 🙌
Locally hosted Dashboard
The UpTrain dashboard is a web-based interface that allows you to evaluate your LLM applications. It is a self-hosted dashboard that runs on your local machine. You don't need to write any code to use the dashboard. You can use the dashboard to evaluate your LLM applications, view the results, and perform root cause analysis.
Before you start, ensure you have docker installed on your machine. If not, you can install it from here.
The following commands will download the UpTrain dashboard and start it on your local machine.
# Clone the repository
git clone https://github.com/uptrain-ai/uptrain
cd uptrain
# Run UpTrain
bash run_uptrain.sh
Using the UpTrain package
If you are a developer and want to integrate UpTrain evaluations into your application, you can use the UpTrain package. This allows for a more programmatic way to evaluate your LLM applications.
Install the package through pip:
pip install uptrain
How to use UpTrain:
You can evaluate your responses via the open-source version by providing your OpenAI API key to run evaluations.
from uptrain import EvalLLM, Evals
import json
OPENAI_API_KEY = "sk-***************"
data = [{
'question': 'Which is the most popular global sport?',
'context': "The popularity of sports can be measured in various ways, including TV viewership, social media presence, number of participants, and economic impact. Football is undoubtedly the world's most popular sport with major events like the FIFA World Cup and sports personalities like Ronaldo and Messi, drawing a followership of more than 4 billion people. Cricket is particularly popular in countries like India, Pakistan, Australia, and England. The ICC Cricket World Cup and Indian Premier League (IPL) have substantial viewership. The NBA has made basketball popular worldwide, especially in countries like the USA, Canada, China, and the Philippines. Major tennis tournaments like Wimbledon, the US Open, French Open, and Australian Open have large global audiences. Players like Roger Federer, Serena Williams, and Rafael Nadal have boosted the sport's popularity. Field Hockey is very popular in countries like India, Netherlands, and Australia. It has a considerable following in many parts of the world.",
'response': 'Football is the most popular sport with around 4 billion followers worldwide'
}]
eval_llm = EvalLLM(openai_api_key=OPENAI_API_KEY)
results = eval_llm.evaluate(
data=data,
checks=[Evals.CONTEXT_RELEVANCE, Evals.FACTUAL_ACCURACY, Evals.RESPONSE_COMPLETENESS]
)
print(json.dumps(results, indent=3))
If you have any questions, please join our Slack community
Speak directly with the maintainers of UpTrain by booking a call here.
Integrations 🤝
Eval Frameworks | LLM Providers | LLM Packages | Serving frameworks | LLM Observability | Vector DBs |
---|---|---|---|---|---|
OpenAI Evals ✅ | GPT-3.5-turbo ✅ | Langchain 🔜 | HuggingFace ✅ | Langfuse 🔜 | Qdrant ✅ |
EleutherAI LM Eval 🔜 | GPT-4 ✅ | Llama Index ✅ | Replicate ✅ | Helicone 🔜 | Pinecone 🔜 |
BIG-Bench 🔜 | Claude ✅ | AutoGPT 🔜 | AnyScale ✅ | Chroma ✅ | |
Cohere ✅ | Together ai 🔜 | ||||
Llama2 ✅ | Ollama 🔜 | ||||
Mistral ✅ |
Resources 💡
Why we are building UpTrain 🤔
Having worked with ML and NLP models for the last 8 years, we were continuosly frustated with numerous hidden failures in our models which led to us building UpTrain. UpTrain was initially started as an ML observability tool with checks to identify regression in accuracy.
However we soon released that LLM developers face an even bigger problem -- there is no good way to measure accuracy of their LLM applications, let alone identify regression.
We also saw release of OpenAI evals, where they proposed the use of LLMs to grade the model responses. Furthermore, we gained confidence to approach this after reading how Anthropic leverages RLAIF and dived right into the LLM evaluations research (We are soon releasing a repository of awesome evaluations research).
So, come today, UpTrain is our attempt to bring order to LLM chaos and contribute back to the community. While a majority of developers still rely on intuition and productionise prompt changes by reviewing a couple of cases, we have heard enough regression stories to believe "evaluations and improvement" will be a key part of LLM ecosystem as the space matures.
-
Robust evaluations allows you to systematically experiment with different configurations and prevent any regressions by helping objectively select the best choice.
-
It helps you understand where your systems are going wrong, find the root cause(s) and fix them - long before your end users complain and potentially churn out.
-
Evaluations like prompt injection and jailbreak detection are essential to maintain safety and security of your LLM applications.
-
Evaluations help you provide transparency and build trust with your end-users - especially relevant if you are selling to enterprises.
Why open-source?
-
We understand that there is no one-size-fits-all solution when it come to evaluations. We are increasingly seeing the desire from developers to modify the evaluation prompt or set of choices or the few shot examples, etc. We believe the best developer experience lies in open-source, instead of exposing 20 different parameters.
-
Foster innovation: The field of LLM evaluations and using LLM-as-a-judge is still pretty nascent. We see a lot of exciting research happening, almost on a daily basis and being open-source provides the right platform to us and our community to implement those techniques and innovate faster.
How You Can Help 🙏
We are continuously striving to enhance UpTrain, and there are several ways you can contribute:
-
Notice any issues or areas for improvement: If you spot anything wrong or have ideas for enhancements, please create an issue on our GitHub repository.
-
Contribute directly: If you see an issue you can fix or have code improvements to suggest, feel free to contribute directly to the repository.
-
Request custom evaluations: If your application requires a tailored evaluation, let us know, and we'll add it to the repository.
-
Integrate with your tools: Need integration with your existing tools? Reach out, and we'll work on it.
-
Assistance with evaluations: If you need assistance with evaluations, post your query on our Slack channel, and we'll resolve it promptly.
-
Show your support: Show your support by starring us ⭐ on GitHub to track our progress.
-
Spread the word: If you like what we've built, give us a shoutout on Twitter!
Your contributions and support are greatly appreciated! Thank you for being a part of UpTrain's journey.
License 💻
This repo is published under Apache 2.0 license and we are committed to adding more functionalities to the UpTrain open-source repo. We also have a managed version if you just want a more hands-off experience. Please book a demo call here.
Provide feedback (Harsher the better 😉)
We are building UpTrain in public. Help us improve by giving your feedback here.
Contributors 🖥️
We welcome contributions to UpTrain. Please see our contribution guide for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file uptrain-0.6.5.post1.tar.gz
.
File metadata
- Download URL: uptrain-0.6.5.post1.tar.gz
- Upload date:
- Size: 211.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c40453612580a6c85d6f56dc1cbd30adb6450025f3474ba86ab8204b4b2dc578 |
|
MD5 | 638015dd58878587c90d5750f739eb88 |
|
BLAKE2b-256 | d36d7c9ebf7a7393c546e36b645e93fd459f5f55f44557ff9cc1c847bcf467dc |
File details
Details for the file uptrain-0.6.5.post1-py3-none-any.whl
.
File metadata
- Download URL: uptrain-0.6.5.post1-py3-none-any.whl
- Upload date:
- Size: 286.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b007720497324dc4e35837f90d7610e762a5636be4df2ce2d4dd99123328a025 |
|
MD5 | ec8e860e86488da577201733d44d73b8 |
|
BLAKE2b-256 | 97a5a42cc4f538a2472d054fb75f93d64c48798b8ae992c104486f1f5b409900 |