A Python library for reference-based metrics to compare generated LLM text with ground truth text
Project description
GAICo: GenAI Results Comparator
GAICo is a Python library providing evaluation metrics to compare generated texts, particularly useful for outputs from Large Language Models (LLMs), often against reference or ground truth texts.
View the documentation at ai4society.github.io/projects/GenAIResultsComparator/index.html.
Quick Start
GAICo makes it easy to evaluate and compare LLM outputs. For detailed, runnable examples, please refer to our Jupyter Notebooks in the examples/ folder:
quickstart.ipynb: Rapid hands-on with the Experiment sub-module.example-1.ipynb: For fine-grained usage, this notebook focuses on comparing multiple model outputs using a single metric.example-2.ipynb: For fine-grained usage, this notebook demonstrates evaluating a single model output across all available metrics.
Streamlined Workflow with Experiment
For a more integrated approach to comparing multiple models, applying thresholds, generating plots, and creating CSV reports, the Experiment class offers a convenient abstraction.
Quick Example
This example demonstrates comparing multiple LLM responses against a reference answer using specified metrics, generating a plot, and outputting a CSV report.
from gaico import Experiment
# Sample data from https://arxiv.org/abs/2504.07995
llm_responses = {
"Google": "Title: Jimmy Kimmel Reacts to Donald Trump Winning the Presidential ... Snippet: Nov 6, 2024 ...",
"Mixtral 8x7b": "I'm an Al and I don't have the ability to predict the outcome of elections.",
"SafeChat": "Sorry, I am designed not to answer such a question.",
}
reference_answer = "Sorry, I am unable to answer such a question as it is not appropriate."
# 1. Initialize Experiment
exp = Experiment(
llm_responses=llm_responses,
reference_answer=reference_answer
)
# 2. Compare models using specific metrics
# This will calculate scores for 'Jaccard' and 'ROUGE',
# generate a plot (e.g., radar plot for multiple metrics/models),
# and save a CSV report.
results_df = exp.compare(
metrics=['Jaccard', 'ROUGE'], # Specify metrics, or None for all defaults
plot=True,
output_csv_path="experiment_report.csv",
custom_thresholds={"Jaccard": 0.6, "ROUGE_rouge1": 0.35} # Optional: override default thresholds
)
# The returned DataFrame contains the calculated scores
print("Scores DataFrame from compare():")
print(results_df)
This abstraction streamlines common evaluation tasks, while still allowing access to the underlying metric classes and dataframes for more advanced or customized use cases. More details in examples/quickstart.ipynb.
However, you might prefer to use the individual metric classes directly for more granular control or if you want to implement custom metrics. See the remaining notebooks in the examples subdirectory.
Example Radar Chart generated by the examples/example-2.ipynb notebook.
Description
The library provides a set of metrics for evaluating 2 text strings as inputs. Outputs are on a scale of 0 to 1 (normalized), where 1 indicates a perfect match between the two texts.
Class Structure: All metrics are implemented as classes, and they can be easily extended to add new metrics. The metrics start with the BaseMetric class under the gaico/base.py file.
Each metric class inherits from this base class and is implemented with just one required method: calculate().
This calculate() method takes two parameters:
generated_texts: Either a string or a Iterables of strings representing the texts generated by an LLM.reference_texts: Either a string or a Iterables of strings representing the expected or reference texts.
If the inputs are Iterables (lists, Numpy arrays, etc.), then the method assumes that there exists a one-to-one mapping between the generated texts and reference texts, meaning that the first generated text corresponds to the first reference text, and so on.
Note: While the library can be used to compare strings, it's main purpose is to aid with comparing results from various LLMs.
Inspiration for the library and evaluation metrics was taken from Microsoft's article on evaluating LLM-generated content. In the article, Microsoft describes 3 categories of evaluation metrics: (1) Reference-based metrics, (2) Reference-free metrics, and (3) LLM-based metrics. The library currently supports reference-based metrics.
Overview of the workflow supported by the GAICo library
Table of Contents
Features
- Implements various metrics for text comparison:
- N-gram-based metrics (BLEU, ROUGE, JS divergence)
- Text similarity metrics (Jaccard, Cosine, Levenshtein, Sequence Matcher)
- Semantic similarity metrics (BERTScore)
- Supports batch processing for efficient computation
- Optimized for different input types (lists, numpy arrays, pandas Series)
- Extendable architecture for easy addition of new metrics
- Testing suite
Installation
You can install GAICo directly from PyPI using pip:
pip install GAICo
To include optional dependencies for visualization features (matplotlib, seaborn), install with:
pip install GAICo[visualization]
For Developers (Installing from source)
If you want to contribute to GAICo or install it from source for development:
-
Clone the repository:
git clone https://github.com/ai4society/GenAIResultsComparator.git cd GenAIResultsComparator
-
Set up a virtual environment and install dependencies:
We recommend using UV for managing environments and dependencies.
# Create a virtual environment (e.g., Python 3.12 recommended) uv venv # Activate the environment source .venv/bin/activate # On Windows: .venv\Scripts\activate # Install the package in editable mode with development and visualization extras uv pip install -e ".[dev,visualization]"
If you don't want to use
uv, you can install the dependencies with the following commands:# Create a virtual environment (e.g., Python 3.12 recommended) python3 -m venv .venv # Activate the environment source .venv/bin/activate # On Windows: .venv\Scripts\activate # Install the package in editable mode with development and visualization extras pip install -e ".[dev,visualization]"
(Note: The
devextra includes dependencies for testing, linting, building, and documentation, as well as visualization dependencies.) -
Set up pre-commit hooks (optional but recommended for contributors):
pre-commit install
Project Structure
The project structure is as follows:
.
├── README.md
├── LICENSE
├── .gitignore
├── uv.lock
├── pyproject.toml
├── .pre-commit-config.yaml
├── gaico/ # Contains the library code
├── examples/ # Contains example scripts
├── tests/ # Contains test scripts
└── docs/ # Contains documentation files
Code Style
We use pre-commit hooks to maintain code quality and consistency. The configuration for these hooks is in the .pre-commit-config.yaml file. These hooks run automatically on git commit, but you can also run them manually:
pre-commit run --all-files
Running Tests
Navigate to the project root in your terminal and run:
uv run pytest
Or, for more verbose output:
uv run pytest -v
To skip the slow BERTScore tests:
uv run pytest -m "not bertscore"
To run only the slow BERTScore tests:
uv run pytest -m bertscore
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/FeatureName) - Commit your changes (
git commit -m 'Add some FeatureName') - Push to the branch (
git push origin feature/FeatureName) - Open a Pull Request
Please ensure that your code passes all tests and adheres to our code style guidelines (enforced by pre-commit hooks) before submitting a pull request.
Citation
If you find this project useful, please consider citing it in your work:
@software{AI4Society_GAICo_GenAI_Results,
author = {{Nitin Gupta, Pallav Koppisetti, Biplav Srivastava}},
license = {MIT},
title = {{GAICo: GenAI Results Comparator}},
year = {2025},
url = {https://github.com/ai4society/GenAIResultsComparator}
}
Acknowledgments
- The library is developed by Nitin Gupta, Pallav Koppisetti, and Biplav Srivastava. Members of AI4Society contributed to this tool as part of ongoing discussions. Major contributors are credited.
- This library uses several open-source packages including NLTK, scikit-learn, and others. Special thanks to the creators and maintainers of the implemented metrics.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contact
If you have any questions, feel free to reach out to us at ai4societyteam@gmail.com.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gaico-0.1.2.tar.gz.
File metadata
- Download URL: gaico-0.1.2.tar.gz
- Upload date:
- Size: 7.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a911cdc5975b1537da91a83df2cf53db245c6f28266a5847c2d7fa60adf85910
|
|
| MD5 |
4be7f0c16f8374c24d8ed3ab16ec4213
|
|
| BLAKE2b-256 |
65e691c954440689fe96a8cabfb5985ac7d46f4c89877e62763946613a716730
|
File details
Details for the file gaico-0.1.2-py3-none-any.whl.
File metadata
- Download URL: gaico-0.1.2-py3-none-any.whl
- Upload date:
- Size: 28.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
52c6f2a360a88ab30f84f70e750026c52ce72e184102582c6d2d5fc9e4ab188b
|
|
| MD5 |
c9187754106071d338cf01f76af57796
|
|
| BLAKE2b-256 |
40452777e146b81a08bf97f7cb5dc1ee9343148be50632fbaea64dce989b5357
|