Package for LLM Evaluation

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 1 - Planning
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Project description

RagaAI - Logo

Raga LLM Hub

Raga AI | Documentation | Getting Started

Welcome to Raga LLM Eval, a comprehensive evaluation toolkit for Language and Learning Models (LLMs). This toolkit provides a suite of tests to evaluate various aspects of language model performance, including relevance, understanding, coherence, toxicity, and more.

Installation

Using pip

python -m venv venv
source venv/bin/activate
pip install raga-llm-eval


* `python -m venv venv` - Create a new python environment.
* `source venv/bin/activate` - Activate the environment.
* `pip install raga-llm-eval` - Install the package

### with conda
* `conda create --name myenv` - Create a new python environment.
* `conda activate myenv` - Activate the environment.
* `python -m pip install raga-llm-eval` - Install the package



## Quick Tour
### Setting up
```py
from raga_llm_eval import RagaLLMEval, get_data

# Initialize with API key
evaluator = RagaLLMEval(api_keys={"OPENAI_API_KEY": "xxx"})

List available

# List available tests
evaluator.list_available_tests()

Adding and Running Tests

Using Custom Data

# Add tests with custom data
evaluator.add_test(
    test_names=["relevancy_test", "summarisation_test"],
    data={
        "prompt": ["How are you?", "How do you do?"],
        "context": ["You are a student, answering your teacher."],
        "response": ["I am fine. Thank you", "Doooo do do do doooo..."],
    },
    arguments={"model": "gpt-3.5-turbo-1106", "threshold": 0.6},
).run()

evaluator.print_results()

Using Provided Test Data

# Add tests with provided test data
evaluator.add_test(
    test_names=["relevancy_test"],
    data=get_data("relevancy_test", num_samples=1),
    arguments={"model": "gpt-3.5-turbo-1106", "threshold": 0.6},
).run()

evaluator.print_results()

Advanced Usage: Piping and Saving Results

The raga_llm_eval package supports a fluent interface, allowing you to chain methods together using a piping style. This approach can make your code more readable and concise. Additionally, you can save the evaluation results to a JSON file for further analysis or record-keeping. Below are examples demonstrating these capabilities.

Piping Method Calls

Piping allows you to chain multiple operations in a single statement. This can simplify your code, making it easier to read and maintain. Here's an example of how to use piping to add a test, run it, and print the results:

# Method piping
evaluator.add_test(
    test_names=["relevancy_test", "summarisation_test"],
    data={
        "prompt": ["What is the capital of France?", "Explain quantum entanglement."],
        "context": ["You are a geography teacher.", "You are a physics professor explaining to a student."],
        "response": ["The capital of France is Paris.", "Quantum entanglement is a phenomenon where particles become interconnected..."],
    },
    arguments={"model": "gpt-3.5-turbo-1106", "threshold": 0.75},
).run()

evaluator.print_results()

Saving Results to a File

# Adding a test, running it, printing, and saving the results to a JSON file
evaluator.add_test(
    test_names=["relevancy_test", "summarisation_test"],
    data={
        "prompt": ["What is the capital of France?", "Explain quantum entanglement."],
        "context": ["You are a geography teacher.", "You are a physics professor explaining to a student."],
        "response": ["The capital of France is Paris.", "Quantum entanglement is a phenomenon where particles become interconnected..."],
    },
    arguments={"model": "gpt-3.5-turbo-1106", "threshold": 0.75},
).run()

evaluator.print_results()

This will execute the tests, print the results to the console, and also save the results in a file named evaluation_results.json in your current working directory.

Explore these capabilities to get the most out of your language model evaluations with raga-llm-eval.

Happy Evaluating!

Tests Supported

Relevance & Understanding

In this suite of tests, we focus on the model's ability to provide relevant, accurate, and contextually appropriate responses. This includes evaluating the model's precision, recall, and overall understanding of the given context to generate relevant answers.

Relevancy Test: Measures the relevance of LLM response to the input prompt
Contextual Precision Test: Evaluates if relevant nodes in context are ranked higher, resulting in a dictionary with precision score, reason, and details. Higher scores indicate more precise context alignment.
Contextual Recall Test: Measures alignment of retrieval context with expected response, outputting a dictionary with recall score, reason, and details. Higher scores denote better recall.
Contextual Relevancy Test: Assesses the overall relevance of context to the input prompt, providing a dictionary with relevancy score, reason, and details. Higher scores mean more relevant context.
Hallucination Test: Determines the hallucination score of the model's response compared to the context, offering a dictionary with scores and details. Higher scores indicate more hallucinated responses.
Faithfulness Test: Evaluates if the LLM response aligns with the retrieval context, producing a dictionary with a faithfulness score and details. Higher scores suggest more faithful responses.
Consistency Test: Provides a score for the consistency of responses, with a dictionary containing scores and evaluation details. Higher scores indicate better consistency.
Conciseness Test: Checks the conciseness of the LLM response, yielding a dictionary with a conciseness score and related information. Higher scores denote more concise responses.
Coherence Test: Assesses the coherence of the LLM response, resulting in a dictionary with coherence scores and details. Higher scores suggest more coherent responses.
Correctness Test: Evaluates the correctness of the LLM response, offering a dictionary with correctness scores and information. Higher scores indicate more correct responses.
Summarization Test: Determines the quality of summaries generated by the LLM, providing a dictionary with summarization scores and details. Higher scores mean better summary quality.
Grade Score Test: Provides a grade score indicating the education level required to understand the text, with a dictionary containing scores and details. Higher scores indicate a higher education level needed.
Complexity Test: Offers a score for the complexity of the text, producing a dictionary with complexity scores and submetrics. Higher scores signify more complex texts.
Readability Test: Provides a readability score, yielding a dictionary with scores and details. Higher scores indicate more readable texts.
Maliciousness Test: Evaluates the maliciousness of prompts and responses, resulting in a dictionary with scores and evaluation details. Higher scores indicate more malicious content.
Toxicity Test: Provides a score for the toxicity of model responses, offering a dictionary with toxicity scores. Higher scores suggest more toxic responses.
Bias Test: Measures the bias score of model responses, yielding a dictionary with scores. Higher scores indicate more biased responses.
Response Toxicity Test: Assesses the toxicity of model responses, providing a dictionary with toxicity scores. Higher scores suggest more toxic responses.
Refusal Test: Evaluates the model's refusal similarity, offering a dictionary with refusal scores. Higher scores indicate a greater likelihood of refusal.
Prompt Injection Test: Checks for injection issues in prompts, resulting in a dictionary with injection scores. Lower scores indicate better prompts.
Coverage Test: Assesses whether all concepts are covered by model responses, providing a dictionary with coverage ratios. This test evaluates concept utilization.
POS Test: Evaluates the accuracy of part-of-speech tagging in model responses, offering a dictionary with accuracy ratios. It checks for correct PoS tag usage.
Length Test: Measures the number of words in generated responses, yielding a dictionary with length details. This test assesses response length appropriateness.
Winner Test: Compares responses of two models or between a model and human annotation, providing a dictionary indicating which is better. It evaluates response quality.
Overall Test: Compares the overall score of two models on a provided task, offering a dictionary with overall scores. This test evaluates model performance comprehensively.
Sentiment Analysis Test: Provides a score for the sentiment of model responses, yielding a dictionary with sentiment scores. Higher scores indicate more positive responses.
Generic Evaluation Test: Returns a score based on specific criteria, response, and context, offering a dictionary with evaluation scores. Higher scores indicate better response quality.
Cosine Similarity Test: Provides a score for the similarity between the prompt and response, resulting in a dictionary with similarity scores. Higher scores indicate greater similarity.

Learn More

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 1 - Planning
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

2.1.1b4 pre-release

May 8, 2024

2.1.1b3 pre-release

May 8, 2024

2.1.1b2 pre-release

May 8, 2024

2.1.1b1 pre-release

May 8, 2024

This version

2.1.0

Apr 18, 2024

2.1.0b37 pre-release

Apr 18, 2024

2.1.0b36 pre-release

Apr 18, 2024

2.1.0b35 pre-release

Apr 17, 2024

2.1.0b34 pre-release

Apr 17, 2024

2.1.0b33 pre-release

Apr 17, 2024

2.1.0b32 pre-release

Apr 17, 2024

2.1.0b31 pre-release

Apr 17, 2024

2.1.0b30 pre-release

Apr 17, 2024

2.1.0b29 pre-release

Apr 17, 2024

2.1.0b28 pre-release

Apr 17, 2024

2.1.0b26 pre-release

Apr 17, 2024

2.1.0b25 pre-release

Apr 17, 2024

2.1.0b24 pre-release

Apr 17, 2024

2.1.0b23 pre-release

Apr 17, 2024

2.1.0b22 pre-release

Apr 17, 2024

2.1.0b21 pre-release

Apr 17, 2024

2.1.0b20 pre-release

Apr 16, 2024

2.1.0b19 pre-release

Apr 16, 2024

2.1.0b18 pre-release

Apr 16, 2024

2.1.0b17 pre-release

Apr 16, 2024

2.1.0b16 pre-release

Apr 16, 2024

2.1.0b15 pre-release

Apr 16, 2024

2.1.0b14 pre-release

Apr 16, 2024

2.1.0b12 pre-release

Apr 16, 2024

2.1.0b11 pre-release

Apr 16, 2024

2.1.0b10 pre-release

Apr 16, 2024

2.1.0b9 pre-release

Apr 16, 2024

2.1.0b8 pre-release

Apr 16, 2024

2.1.0b7 pre-release

Apr 16, 2024

2.1.0b6 pre-release

Apr 16, 2024

2.1.0b5 pre-release

Apr 16, 2024

2.1.0b4 pre-release

Apr 16, 2024

2.1.0b3 pre-release

Apr 16, 2024

2.1.0b2 pre-release

Apr 16, 2024

2.1.0b1 pre-release

Apr 16, 2024

2.1.0b0 pre-release

Apr 16, 2024

2.0.0b29 pre-release

Mar 6, 2024

2.0.0b28 pre-release

Mar 6, 2024

2.0.0b27 pre-release

Mar 6, 2024

2.0.0b26 pre-release

Mar 6, 2024

2.0.0b25 pre-release

Mar 6, 2024

2.0.0b24 pre-release

Mar 6, 2024

2.0.0b23 pre-release

Mar 6, 2024

2.0.0b22 pre-release

Mar 6, 2024

2.0.0b21 pre-release

Mar 6, 2024

2.0.0b19 pre-release

Mar 6, 2024

2.0.0b18 pre-release

Mar 6, 2024

2.0.0b17 pre-release

Mar 5, 2024

2.0.0b16 pre-release

Mar 5, 2024

2.0.0b15 pre-release

Mar 5, 2024

2.0.0b14 pre-release

Mar 5, 2024

2.0.0b13 pre-release

Mar 5, 2024

2.0.0b12 pre-release

Mar 5, 2024

2.0.0b11 pre-release

Mar 4, 2024

2.0.0b10 pre-release

Mar 4, 2024

2.0.0b9 pre-release

Mar 3, 2024

2.0.0b8 pre-release

Mar 3, 2024

2.0.0b7 pre-release

Mar 3, 2024

2.0.0b6 pre-release

Mar 3, 2024

2.0.0b5 pre-release

Mar 3, 2024

2.0.0b4 pre-release

Feb 29, 2024

2.0.0b3 pre-release

Feb 29, 2024

2.0.0b2 pre-release

Feb 29, 2024

2.0.0b1 pre-release

Feb 29, 2024

2.0.0a29 pre-release

Feb 29, 2024

2.0.0a28 pre-release

Feb 29, 2024

2.0.0a27 pre-release

Feb 29, 2024

2.0.0a26 pre-release

Feb 29, 2024

2.0.0a25 pre-release

Feb 29, 2024

2.0.0a24 pre-release

Feb 29, 2024

2.0.0a23 pre-release

Feb 28, 2024

2.0.0a22 pre-release

Feb 28, 2024

2.0.0a21 pre-release

Feb 28, 2024

2.0.0a20 pre-release

Feb 28, 2024

2.0.0a19 pre-release

Feb 28, 2024

2.0.0a18 pre-release

Feb 28, 2024

2.0.0a17 pre-release

Feb 28, 2024

2.0.0a16 pre-release

Feb 28, 2024

2.0.0a15 pre-release

Feb 28, 2024

2.0.0a14 pre-release

Feb 28, 2024

2.0.0a13 pre-release

Feb 27, 2024

2.0.0a12 pre-release

Feb 27, 2024

1.0.0

Feb 20, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

raga_llm_eval-2.1.0.tar.gz (584.3 kB view hashes)

Uploaded Apr 18, 2024 Source

Built Distribution

raga_llm_eval-2.1.0-py3-none-any.whl (706.7 kB view hashes)

Uploaded Apr 18, 2024 Python 3

Hashes for raga_llm_eval-2.1.0.tar.gz

Hashes for raga_llm_eval-2.1.0.tar.gz
Algorithm	Hash digest
SHA256	`c47238fd636d5519a2b009739f120697d3d296b66f24f0fefbe3c4d70b4519f4`
MD5	`90c5238ef9332359c0cb58591c8d384f`
BLAKE2b-256	`cdd679357067af90e3e4d08e2cc52d26381d1dbc5095b5d9dbd8f014a9bdd340`

Hashes for raga_llm_eval-2.1.0-py3-none-any.whl

Hashes for raga_llm_eval-2.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cf6b03abb131f454abe46b59ebbf6c439cbd84d2f45881600db3d1bc7d762475`
MD5	`383a236967224722094d262d73fdd73c`
BLAKE2b-256	`bb972bd949a4bbd75284f07a8bc05a8bce0a9522889094e63e3e1d8ea0d63e40`