Skip to main content

A Python library for reference-based metrics to compare generated LLM text with ground truth text

Project description

GAICo: GenAI Results Comparator

GAICo is a Python library providing evaluation metrics to compare generated texts, particularly useful for outputs from Large Language Models (LLMs), often against reference or ground truth texts.

View the documentation at ai4society.github.io/projects/GenAIResultsComparator/index.html.

Quick Start

GAICo makes it easy to evaluate and compare LLM outputs. For detailed, runnable examples, please refer to our Jupyter Notebooks in the examples/ folder:

  • quickstart.ipynb: Rapid hands-on with the Experiment sub-module.
  • example-1.ipynb: For fine-grained usage, this notebook focuses on comparing multiple model outputs using a single metric.
  • example-2.ipynb: For fine-grained usage, this notebook demonstrates evaluating a single model output across all available metrics.

Streamlined Workflow with Experiment

For a more integrated approach to comparing multiple models, applying thresholds, generating plots, and creating CSV reports, the Experiment class offers a convenient abstraction.

Quick Example

This example demonstrates comparing multiple LLM responses against a reference answer using specified metrics, generating a plot, and outputting a CSV report.

from gaico import Experiment

# Sample data from https://arxiv.org/abs/2504.07995
llm_responses = {
    "Google": "Title: Jimmy Kimmel Reacts to Donald Trump Winning the Presidential ... Snippet: Nov 6, 2024 ...",
    "Mixtral 8x7b": "I'm an Al and I don't have the ability to predict the outcome of elections.",
    "SafeChat": "Sorry, I am designed not to answer such a question.",
}
reference_answer = "Sorry, I am unable to answer such a question as it is not appropriate."

# 1. Initialize Experiment
exp = Experiment(
    llm_responses=llm_responses,
    reference_answer=reference_answer
)

# 2. Compare models using specific metrics
#   This will calculate scores for 'Jaccard' and 'ROUGE',
#   generate a plot (e.g., radar plot for multiple metrics/models),
#   and save a CSV report.
results_df = exp.compare(
    metrics=['Jaccard', 'ROUGE'],  # Specify metrics, or None for all defaults
    plot=True,
    output_csv_path="experiment_report.csv",
    custom_thresholds={"Jaccard": 0.6, "ROUGE_rouge1": 0.35} # Optional: override default thresholds
)

# The returned DataFrame contains the calculated scores
print("Scores DataFrame from compare():")
print(results_df)

This abstraction streamlines common evaluation tasks, while still allowing access to the underlying metric classes and dataframes for more advanced or customized use cases. More details in examples/quickstart.ipynb.

However, you might prefer to use the individual metric classes directly for more granular control or if you want to implement custom metrics. See the remaining notebooks in the examples subdirectory.

Sample Radar Chart showing multiple metrics for a single LLM
Example Radar Chart generated by the examples/example-2.ipynb notebook.

Description

The library provides a set of metrics for evaluating 2 text strings as inputs. Outputs are on a scale of 0 to 1 (normalized), where 1 indicates a perfect match between the two texts.

Class Structure: All metrics are implemented as classes, and they can be easily extended to add new metrics. The metrics start with the BaseMetric class under the gaico/base.py file.

Each metric class inherits from this base class and is implemented with just one required method: calculate().

This calculate() method takes two parameters:

  • generated_texts: Either a string or a Iterables of strings representing the texts generated by an LLM.
  • reference_texts: Either a string or a Iterables of strings representing the expected or reference texts.

If the inputs are Iterables (lists, Numpy arrays, etc.), then the method assumes that there exists a one-to-one mapping between the generated texts and reference texts, meaning that the first generated text corresponds to the first reference text, and so on.

Note: While the library can be used to compare strings, it's main purpose is to aid with comparing results from various LLMs.

Inspiration for the library and evaluation metrics was taken from Microsoft's article on evaluating LLM-generated content. In the article, Microsoft describes 3 categories of evaluation metrics: (1) Reference-based metrics, (2) Reference-free metrics, and (3) LLM-based metrics. The library currently supports reference-based metrics.

GAICo Overview

Overview of the workflow supported by the GAICo library

Table of Contents

Features

  • Implements various metrics for text comparison:
    • N-gram-based metrics (BLEU, ROUGE, JS divergence)
    • Text similarity metrics (Jaccard, Cosine, Levenshtein, Sequence Matcher)
    • Semantic similarity metrics (BERTScore)
  • Supports batch processing for efficient computation
  • Optimized for different input types (lists, numpy arrays, pandas Series)
  • Extendable architecture for easy addition of new metrics
  • Testing suite

Installation

You can install GAICo directly from PyPI using pip:

pip install GAICo

To include optional dependencies for visualization features (matplotlib, seaborn), install with:

pip install GAICo[visualization]

For Developers (Installing from source)

If you want to contribute to GAICo or install it from source for development:

  1. Clone the repository:

    git clone https://github.com/ai4society/GenAIResultsComparator.git
    cd GenAIResultsComparator
    
  2. Set up a virtual environment and install dependencies:

    We recommend using UV for managing environments and dependencies.

    # Create a virtual environment (e.g., Python 3.12 recommended)
    uv venv
    # Activate the environment
    source .venv/bin/activate  # On Windows: .venv\Scripts\activate
    # Install the package in editable mode with development and visualization extras
    uv pip install -e ".[dev,visualization]"
    

    If you don't want to use uv, you can install the dependencies with the following commands:

    # Create a virtual environment (e.g., Python 3.12 recommended)
    python3 -m venv .venv
    # Activate the environment
    source .venv/bin/activate  # On Windows: .venv\Scripts\activate
    # Install the package in editable mode with development and visualization extras
    pip install -e ".[dev,visualization]"
    

    (Note: The dev extra includes dependencies for testing, linting, building, and documentation, as well as visualization dependencies.)

  3. Set up pre-commit hooks (optional but recommended for contributors):

    pre-commit install
    

Project Structure

The project structure is as follows:

.
├── README.md
├── LICENSE
├── .gitignore
├── uv.lock
├── pyproject.toml
├── .pre-commit-config.yaml
├── gaico/        # Contains the library code
├── examples/     # Contains example scripts
├── tests/        # Contains test scripts
└── docs/         # Contains documentation files

Code Style

We use pre-commit hooks to maintain code quality and consistency. The configuration for these hooks is in the .pre-commit-config.yaml file. These hooks run automatically on git commit, but you can also run them manually:

pre-commit run --all-files

Running Tests

Navigate to the project root in your terminal and run:

uv run pytest

Or, for more verbose output:

uv run pytest -v

To skip the slow BERTScore tests:

uv run pytest -m "not bertscore"

To run only the slow BERTScore tests:

uv run pytest -m bertscore

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/FeatureName)
  3. Commit your changes (git commit -m 'Add some FeatureName')
  4. Push to the branch (git push origin feature/FeatureName)
  5. Open a Pull Request

Please ensure that your code passes all tests and adheres to our code style guidelines (enforced by pre-commit hooks) before submitting a pull request.

Citation

If you find this project useful, please consider citing it in your work:

@software{AI4Society_GAICo_GenAI_Results,
  author = {{Nitin Gupta, Pallav Koppisetti, Biplav Srivastava}},
  license = {MIT},
  title = {{GAICo: GenAI Results Comparator}},
  year = {2025},
  url = {https://github.com/ai4society/GenAIResultsComparator}
}

Acknowledgments

  • The library is developed by Nitin Gupta, Pallav Koppisetti, and Biplav Srivastava. Members of AI4Society contributed to this tool as part of ongoing discussions. Major contributors are credited.
  • This library uses several open-source packages including NLTK, scikit-learn, and others. Special thanks to the creators and maintainers of the implemented metrics.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

If you have any questions, feel free to reach out to us at ai4societyteam@gmail.com.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gaico-0.1.2.tar.gz (7.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gaico-0.1.2-py3-none-any.whl (28.6 kB view details)

Uploaded Python 3

File details

Details for the file gaico-0.1.2.tar.gz.

File metadata

  • Download URL: gaico-0.1.2.tar.gz
  • Upload date:
  • Size: 7.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.6

File hashes

Hashes for gaico-0.1.2.tar.gz
Algorithm Hash digest
SHA256 a911cdc5975b1537da91a83df2cf53db245c6f28266a5847c2d7fa60adf85910
MD5 4be7f0c16f8374c24d8ed3ab16ec4213
BLAKE2b-256 65e691c954440689fe96a8cabfb5985ac7d46f4c89877e62763946613a716730

See more details on using hashes here.

File details

Details for the file gaico-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: gaico-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 28.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.6

File hashes

Hashes for gaico-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 52c6f2a360a88ab30f84f70e750026c52ce72e184102582c6d2d5fc9e4ab188b
MD5 c9187754106071d338cf01f76af57796
BLAKE2b-256 40452777e146b81a08bf97f7cb5dc1ee9343148be50632fbaea64dce989b5357

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page