Versatile Python library designed for evaluating the performance of large language models in Natural Language Processing (NLP) tasks. Developed by Sagacify

These details have not been verified by PyPI

Project description

🔮 Sagacify LLM Evaluation ML Library 🔮

Python Version

Welcome to the Saga LLM Evaluation ML library, a versatile Python library designed for evaluating the performance of large language models in Natural Language Processing (NLP) tasks. Whether you’re developing language models, chatbots, or other NLP applications, our library provides a comprehensive suite of metrics to help you assess the quality of your language models.

We divided the metrics into three categories: embedding-based, language-model-based, and LLM-based metrics. It is built on top of multiple libraries such as the Hugging Face Transformers library, or LangChain, with some additional metrics and features. You can use the metrics individually or all at once using the Scorer provided by this library, depending on the availability of references, context, and other parameters.

Moreover, the Scorer function provides metafeatures that are extracted from the prompt, prediction, and knowledge via the Elemeta Library. This allows you to monitor the performance of your model based on the structure of the prompt, prediction, and knowledge.

Developed by Sagacify.

Available Metrics

Embedding-based Metrics:
- BERTScore: A metric that measures the similarity between model-generated text and human-generated references. It leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity. It’s a valuable tool for evaluating semantic content. Read more
- MAUVE: Computes the divergence between the learned distributions from text generated by a text generation model and human-written text. Read more
Language-Model-based Metrics:
- BLEURTScore: A learned evaluation metric for Natural Language Generation. It is built using multiple phases of transfer learning starting from a pre-trained BERT model, employing another pre-training phase using synthetic data, and finally trained on WMT human annotations. Read more
- Q-Squared: A reference-free metric that aims to evaluate the factual consistency of knowledge-grounded dialogue systems. The approach is based on automatic question generation and question answering. Specifically, it generates questions from the knowledge base and uses the generated questions to evaluate the factual consistency of the generated response. Read more
LLM-based Metrics:
- SelfCheck-GPT (QA approach): A metric that evaluates the correctness of language model outputs by comparing the output to the typical distribution of the model outputs. It introduces a zero-shot approach to fact-check the response of black-box models and assess hallucination problems. Read more
- G-Eval: A framework that uses LLMs with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs. It has been experimented with two generation tasks, text summarization and dialogue generation, and many evaluation criteria. The task and the evaluation criteria may be changed depending on the application. Read more
- GPT-Score: An evaluation framework that utilizes the emergent abilities (e.g., zero-shot instruction) of generative pre-trained models to score generated texts. Experimental results on four text generation tasks, 22 evaluation aspects, and corresponding 37 datasets demonstrate that this approach can effectively allow us to achieve what one desires to evaluate for texts simply by natural language instructions. Read more
- Relevance: A metric that evaluates the relevance of the generated text to the user prompt. It uses another LLM to evaluate the relevance of the generated text.
- Correctness: A metric that evaluates the correctness of the generated text. It uses another LLM to evaluate the correctness of the generated text.
- Faithfulness: A metric that evaluates the faithfulness of the generated text. It uses another LLM to evaluate the faithfulness of the generated text.
- NegativeReject: A metric that evaluates the negative rejection of the generated text. It uses another LLM to evaluate the negative rejection of the generated text.
- HallucinationScore: A metric that evaluates the hallucination of the generated text. It uses another LLM to evaluate the hallucination of the generated text.
Retrieval Metrics:
- Accuracy: A metric that evaluates the accuracy of the retrieved information. It uses another LLM to evaluate the accuracy of the retrieved information.
- Relevance: A metric that evaluates the relevance of the retrieved information. It uses another LLM to evaluate the relevance of the retrieved information.

Each of these metrics uses either ChatGPT or a quantized LLAMA model by default to evaluate the generated text, but you can define yourself which model you want to use for evaluation, see the Usage section for more information.

Feel free to contribute and make this library even more powerful!
We appreciate your support. 💻💪🏻

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.12.1

Oct 22, 2024

0.12.0

Oct 22, 2024

0.11.7

Oct 21, 2024

0.11.6

Sep 26, 2024

0.11.5

Sep 26, 2024

0.11.4

Sep 26, 2024

0.11.3

Sep 26, 2024

0.11.2

Sep 25, 2024

0.11.1

Sep 24, 2024

0.11.0

Sep 24, 2024

0.10.1

Jun 27, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

saga_llm_evaluation-0.12.1.tar.gz (23.6 kB view details)

Uploaded Oct 22, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

saga_llm_evaluation-0.12.1-py3-none-any.whl (25.0 kB view details)

Uploaded Oct 22, 2024 Python 3

File details

Details for the file saga_llm_evaluation-0.12.1.tar.gz.

File metadata

Download URL: saga_llm_evaluation-0.12.1.tar.gz
Upload date: Oct 22, 2024
Size: 23.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.7.1 CPython/3.10.12 Linux/6.5.0-1025-azure

File hashes

Hashes for saga_llm_evaluation-0.12.1.tar.gz
Algorithm	Hash digest
SHA256	`e09b4cac9d58105f5b55bb10ba91e2f8fd35045d0846a0326d2f8aff710bec63`
MD5	`0e524cb865d2f90d6f045816f541f19c`
BLAKE2b-256	`7aebc4a8ecdaa0e899ec0a2424e16699c216127cfd57662aa32df636c4b4c4f8`

See more details on using hashes here.

File details

Details for the file saga_llm_evaluation-0.12.1-py3-none-any.whl.

File metadata

Download URL: saga_llm_evaluation-0.12.1-py3-none-any.whl
Upload date: Oct 22, 2024
Size: 25.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.7.1 CPython/3.10.12 Linux/6.5.0-1025-azure

File hashes

Hashes for saga_llm_evaluation-0.12.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b86da7a888622b071272d7091163b4668ef2ac0d27829e59c51ee9a127834ea7`
MD5	`5fb991a01c83fb642a9d69a1270c8923`
BLAKE2b-256	`95b08ee3550f8422c31529b161d35cf7b8cce64c6bf2079659daae4849721447`

See more details on using hashes here.

saga-llm-evaluation 0.12.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

🔮 Sagacify LLM Evaluation ML Library 🔮

Available Metrics

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes