Falcon Evaluate is an open-source Python library designed to simplify the process of evaluating and validating open source LLM models such as llama2,mistral ,etc. This library aims to provide an easy-to-use toolkit for assessing the performance, bias, and general behavior of LLMs in various natural language understanding (NLU) tasks.

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Falcon Evaluate

A Low-Code LLM - RAG Evaluation Solution

Installation | Quickstart |

Falcon Evaluate - A Language Language Model ( LLM ) Validation Library

Overview

Falcon Evaluate is an open-source Python library aims to revolutionize the LLM - RAG evaluation process by offering a low-code solution. Our goal is to make the evaluation process as seamless and efficient as possible, allowing you to focus on what truly matters.This library aims to provide an easy-to-use toolkit for assessing the performance, bias, and general behavior of LLMs in various natural language understanding (NLU) tasks.

:shield: Installation

pip install falcon_evaluate -q

if you want to install from source

git clone https://github.com/Praveengovianalytics/falcon_evaluate && cd falcon_evaluate
pip install -e .

:fire: Quickstart

Google Colab notebook

Get start with falcon_evaluate

# Example usage

!pip install falcon_evaluate -q

from falcon_evaluate.fevaluate_results import ModelScoreSummary
from falcon_evaluate.fevaluate_plot import ModelPerformancePlotter
import pandas as pd
import nltk
nltk.download('punkt')

df = pd.DataFrame({
    'prompt': [
        "What is the capital of France?"
    ],
    'reference': [
        "The capital of France is Paris."
    ],
    'Model A': [
        "Paris is the capital of France.
    ],
    'Model B': [
        "Capital of France is Paris."
    ],
    'Model C': [
        "Capital of France was Paris."
    ],
})

model_score_summary = ModelScoreSummary(df)
result,agg_score_df = model_score_summary.execute_summary()
print(result)

ModelPerformancePlotter(agg_score_df).get_falcon_performance_quadrant()

Note - Same model with different config settings can be plotted for qualification to specific usecase.

Model Evaluation Results

The following table shows the evaluation results of different models when prompted with a question. Various scoring metrics such as BLEU score, Jaccard similarity, Cosine similarity, and Semantic similarity have been used to evaluate the models. Additionally, composite scores like Falcon Score have also been calculated.

To dive in more detail to evaluation metric, refer below link

falcon-evaluate metrics in detail

Evaluation Data

Prompt	Reference
What is the capital of France?	The capital of France is Paris.

Model A Evaluation

Readability and Complexity

ARI: 2.7
Flesch-Kincaid Grade Level: 2.9

Language Modeling Performance

Perplexity: 112.17

Text Toxicity

Toxicity Level: 0.09

Text Similarity and Relevance

BLEU: 0.64
Cosine Similarity: 0.85
Semantic Similarity: 0.99
Jaccard Similarity: 0.71

Information Retrieval

Precision: 0.83
Recall: 0.71
F1-Score: 0.77

Falcon Score (Model A)

Evaluation Categories Metrics

Below are the computed metrics categorized under different evaluation categories:

Readability and Complexity

Arithmetic Mean: 1.65
Weighted Sum: 1.65
Geometric Mean: 1.59
Harmonic Mean: 1.53
T-Statistic: 2.12
P-Value: 0.28
F-Score: 0.00
Z-Score Normalization: [-1.00, 1.00]

Language Modeling Performance

Arithmetic Mean: 19.45
Weighted Sum: 19.45
Geometric Mean: 19.45
Harmonic Mean: 19.45
T-Statistic: NaN
P-Value: NaN
F-Score: 0.00
Z-Score Normalization: [NaN]

Text Toxicity

Arithmetic Mean: 0.046
Weighted Sum: 0.046
Geometric Mean: 0.046
Harmonic Mean: 0.046
T-Statistic: NaN
P-Value: NaN
F-Score: 0.00
Z-Score Normalization: [NaN]

Text Similarity and Relevance

Arithmetic Mean: 0.67
Weighted Sum: 0.67
Geometric Mean: 0.00
Harmonic Mean: 0.00
T-Statistic: 1.29
P-Value: 0.29
F-Score: 0.00
Z-Score Normalization: [-1.67, 0.82, 0.73, 0.11]

Information Retrieval

Arithmetic Mean: 0.77
Weighted Sum: 0.77
Geometric Mean: 0.77
Harmonic Mean: 0.77
T-Statistic: 11.23
P-Value: 0.01
F-Score: 0.77
Z-Score Normalization: [1.25, -1.19, -0.06]

Model B Evaluation

Response	Scores
Capital of France is Paris.

Scores

{ "Readability and Complexity": { "ARI": 2.7, "Flesch-Kincaid Grade Level": 2.9 }, "Language Modeling Performance": { "Perplexity": 112.17 }, "Text Toxicity": { "Toxicity Level": 0.09 }, "Text Similarity and Relevance": { "BLEU": 0.64, "Cosine Similarity": 0.85, "Semantic Similarity": 0.99, "Jaccard Similarity": 0.71 }, "Information Retrieval": { "Precision": 0.83, "Recall": 0.71, "F1-Score": 0.77 } }

Falcon Score (Model B)

Metric	Value
Arithmetic Mean	0.7999
Weighted Sum	0.7999
Geometric Mean	0.7888
Harmonic Mean	0.7781
T-Statistic	0.903
P-Value	0.4332
F-Score	0.7692

Model C Evaluation

Response	Scores
Capital of France was Paris.

Scores

Metric	Value
BLEU Score	9.07e-155
Jaccard Similarity	0.5714
Cosine Similarity	0.5803
Semantic Similarity	0.9881

Falcon Score (Model C)

Metric	Value
Arithmetic Mean	0.5350
Weighted Sum	0.5350
Geometric Mean	2.34e-39
Harmonic Mean	3.63e-154
T-Statistic	1.178
P-Value	0.3237
F-Score	0.6154

Key Features

Benchmarking: Falcon Evaluate provides a set of pre-defined benchmarking tasks commonly used for evaluating LLMs, including text completion, sentiment analysis, question answering, and more. Users can easily assess model performance on these tasks.
Custom Evaluation: Users can define custom evaluation metrics and tasks tailored to their specific use cases. Falcon Evaluate provides flexibility for creating custom test suites and assessing model behavior accordingly.
Interpretability: The library offers interpretability tools to help users understand why the model generates certain responses. This can aid in debugging and improving model performance.
Scalability: Falcon Evaluate is designed to work with both small-scale and large-scale evaluations. It can be used for quick model assessments during development and for extensive evaluations in research or production settings.

Use Cases

Model Development: Falcon Evaluate can be used during the development phase to iteratively assess and improve the performance of LLMs.
Research: Researchers can leverage the library to conduct comprehensive evaluations and experiments with LLMs, contributing to advancements in the field.
Production Deployment: Falcon Evaluate can be integrated into NLP pipelines to monitor and validate model behavior in real-world applications.

Getting Started

To use Falcon Evaluate, users will need Python and dependencies such as TensorFlow, PyTorch, or Hugging Face Transformers. The library will provide clear documentation and tutorials to assist users in getting started quickly.

Community and Collaboration

Falcon Evaluate is an open-source project that encourages contributions from the community. Collaboration with researchers, developers, and NLP enthusiasts is encouraged to enhance the library's capabilities and address emerging challenges in language model validation.

Project Goals

The primary goals of Falcon Evaluate are to:

Facilitate the evaluation and validation of Language Models.
Promote transparency and fairness in AI by detecting and mitigating bias.
Provide an accessible and extensible toolkit for NLP practitioners and researchers.

Conclusion

Falcon Evaluate aims to empower the NLP community with a versatile and user-friendly library for evaluating and validating Language Models. By offering a comprehensive suite of evaluation tools, it seeks to enhance the transparency, robustness, and fairness of AI-powered natural language understanding systems.

Project Organization

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.py           <- makes project pip installable (pip install -e .) so src can be imported
├── falcon_evaluate    <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│
└── tox.ini            <- tox file with settings for running tox; see tox.readthedocs.io

Project details

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.1.13.0

Jan 30, 2024

0.1.12.0

Jan 17, 2024

0.1.11.0

Jan 15, 2024

0.1.10.0

Jan 10, 2024

0.1.9.2

Dec 7, 2023

0.1.9.1

Dec 7, 2023

0.1.9.0

Nov 23, 2023

0.1.8.1

Oct 21, 2023

This version

0.1.8.0

Oct 21, 2023

0.1.7.0

Oct 19, 2023

0.1.6.2

Oct 16, 2023

0.1.6.1

Oct 16, 2023

0.1.6

Oct 16, 2023

0.1.3

Oct 16, 2023

0.1.2

Oct 16, 2023

0.1.1

Oct 16, 2023

0.1.0

Oct 16, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

falcon_evaluate-0.1.8.0.tar.gz (16.1 kB view hashes)

Uploaded Oct 21, 2023 Source

Built Distribution

falcon_evaluate-0.1.8.0-py3-none-any.whl (14.4 kB view hashes)

Uploaded Oct 21, 2023 Python 3

Hashes for falcon_evaluate-0.1.8.0.tar.gz

Hashes for falcon_evaluate-0.1.8.0.tar.gz
Algorithm	Hash digest
SHA256	`1d2698bd1c7afacb937685fa6121b865efb17bfc3b2bc29870c789ee5f6ff9ae`
MD5	`b18582950e52a0068d95e0d8a1742ff7`
BLAKE2b-256	`d2206c97458fd44416647626ffa8cbbd9aa294991565f10182b492ee48fa3060`

Hashes for falcon_evaluate-0.1.8.0-py3-none-any.whl

Hashes for falcon_evaluate-0.1.8.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1657efe6a2770d95a30b952f63dba6e11b6b22ac755c189e0fe7e38e0885a047`
MD5	`10267e6077e4bceb30d2adaba40dbf23`
BLAKE2b-256	`e97cf553b5411f1cad7da98fafbf792e3d84f47345cb0ee593087316bb4f1c8f`

falcon-evaluate 0.1.8.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Falcon Evaluate

A Low-Code LLM - RAG Evaluation Solution

Installation | Quickstart |

Falcon Evaluate - A Language Language Model ( LLM ) Validation Library

Overview

:shield: Installation

:fire: Quickstart

Google Colab notebook

Note - Same model with different config settings can be plotted for qualification to specific usecase.

Model Evaluation Results

Evaluation Data

Model A Evaluation

Readability and Complexity

Language Modeling Performance

Text Toxicity

Text Similarity and Relevance

Information Retrieval

Falcon Score (Model A)

Evaluation Categories Metrics

Readability and Complexity

Language Modeling Performance

Text Toxicity

Text Similarity and Relevance

Information Retrieval

Model B Evaluation

Scores

Falcon Score (Model B)

Model C Evaluation

Scores

Falcon Score (Model C)

Key Features

Use Cases

Getting Started

Community and Collaboration

Project Goals

Conclusion

Project Organization

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution