Skip to main content

GenAI Results Comparator, GAICo, is a Python library to help compare, analyze and visualize outputs from Large Language Models (LLMs), often against a reference text. In doing so, one can use a range of extensible metrics from the literature.

Project description

GAICo: GenAI Results Comparator

Repository: github.com/ai4society/GenAIResultsComparator

Documentation: ai4society.github.io/projects/GenAIResultsComparator

Overview

GenAI Results Comparator (GAICo) is a Python library for comparing, analyzing, and visualizing outputs from Large Language Models (LLMs). It offers an extensible range of metrics, including standard text similarity scores, specialized metrics for structured data like planning sequences and time-series, and multimedia metrics for image and audio.

At its core, the library provides a set of metrics for evaluating various types of outputs—from plain text strings to structured data like planning sequences and time-series, and multimedia content such as images and audio. While the Experiment class streamlines evaluation for text-based and structured string outputs, individual metric classes offer direct control for all data types, including binary or array-based multimedia. These metrics produce normalized scores (typically 0 to 1), where 1 indicates a perfect match, enabling robust analysis and visualization of LLM performance.

Quickstart

GAICo's Experiment class offers a streamlined workflow for comparing multiple model outputs, applying thresholds, generating plots, and creating CSV reports.

Here's a quick example:

from gaico import Experiment

# Sample data from https://arxiv.org/abs/2504.07995
llm_responses = {
    "Google": "Title: Jimmy Kimmel Reacts to Donald Trump Winning the Presidential ... Snippet: Nov 6, 2024 ...",
    "Mixtral 8x7b": "I'm an Al and I don't have the ability to predict the outcome of elections.",
    "SafeChat": "Sorry, I am designed not to answer such a question.",
}
reference_answer = "Sorry, I am unable to answer such a question as it is not appropriate."
# Alternatively, if reference_answer is None, the response from the first model ("Google") will be used:
# reference_answer = None

# 1. Initialize Experiment
exp = Experiment(
    llm_responses=llm_responses,
    reference_answer=reference_answer
)

# 2. Compare models using specific metrics
#   This will calculate scores for 'Jaccard' and 'ROUGE',
#   generate a plot (e.g., radar plot for multiple metrics/models),
#   and save a CSV report.
results_df = exp.compare(
    metrics=['Jaccard', 'ROUGE'],  # Specify metrics, or None for all defaults
    plot=True,
    output_csv_path="experiment_report.csv",
    custom_thresholds={"Jaccard": 0.6, "ROUGE_rouge1": 0.35} # Optional: override default thresholds
)

# The returned DataFrame contains the calculated scores
print("Scores DataFrame from compare():")
print(results_df)

# 3. Get a summary of results (e.g., mean scores and pass rates)
summary_df = exp.summarize(metrics=['Jaccard', 'ROUGE'], custom_thresholds={"Jaccard": 0.6, "ROUGE_rouge1": 0.35})
print("\nSummary DataFrame:")
print(summary_df)

For more detailed examples, please refer to our Jupyter Notebooks in the examples/ folder in the repository.

Features

  • Comprehensive Metric Library:
    • Textual Similarity: Jaccard, Cosine, Levenshtein, Sequence Matcher.
    • N-gram Based: BLEU, ROUGE, JS Divergence.
    • Semantic Similarity: BERTScore.
    • Structured Data: Specialized metrics for planning sequences (PlanningLCS, PlanningJaccard) and time-series data (TimeSeriesElementDiff, TimeSeriesDTW).
    • Multimedia: Metrics for image similarity (ImageSSIM, ImageAverageHash, ImageHistogramMatch) and audio quality (AudioSNRNormalized, AudioSpectrogramDistance).
  • Streamlined Evaluation Workflow:
    • A high-level Experiment class to easily compare multiple models, apply thresholds, generate plots, and create CSV reports.
  • Enhanced Reporting:
    • A summarize() method for quick, aggregated overviews of model performance, including mean scores and pass rates.
  • Dynamic Metric Registration:
    • Easily extend the Experiment class by registering your own custom BaseMetric implementations at runtime.
  • Powerful Visualization:
    • Generate bar charts and radar plots to compare model performance using Matplotlib and Seaborn.
  • Efficient & Flexible:
    • Supports batch processing for efficient computation on datasets.
    • Optimized for various input types (lists, NumPy arrays, Pandas Series).
    • Easily extensible architecture for adding new custom metrics.
  • Robust and Reliable:
    • Includes a comprehensive test suite using Pytest.

Installation

GAICo can be installed using pip.

  • Create and activate a virtual environment (e.g., named gaico-env):

      # For Python 3.10+
      python3 -m venv gaico-env
      source gaico-env/bin/activate  # On macOS/Linux
      # gaico-env\Scripts\activate   # On Windows
    
  • Install GAICo: Once your virtual environment is active, install GAICo using pip:

      pip install gaico
    

This installs the core GAICo library.

Using GAICo with Jupyter Notebooks/Lab

If you plan to use GAICo within Jupyter Notebooks or JupyterLab (recommended for exploring examples and interactive analysis), install them into the same activated virtual environment:

# (Ensure your 'gaico-env' is active)
pip install notebook  # For Jupyter Notebook
# OR
# pip install jupyterlab # For JupyterLab

Then, launch Jupyter from the same terminal where your virtual environment is active:

# (Ensure your 'gaico-env' is active)
jupyter notebook
# OR
# jupyter lab

New notebooks created in this session should automatically use the gaico-env Python environment. For troubleshooting kernel issues, please see our FAQ document.

Optional Installations

The default pip install gaico is lightweight. Some metrics require extra dependencies, which you can install as needed.

  • To include Audio metrics (requires SciPy and SoundFile):
    pip install 'gaico[audio]'
    
  • To include the BERTScore metric (which has larger dependencies like PyTorch):
    pip install 'gaico[bertscore]'
    
  • To include the CosineSimilarity metric (requires scikit-learn):
    pip install 'gaico[cosine]'
    
  • To include the JSDivergence metric (requires SciPy and NLTK):
    pip install 'gaico[jsd]'
    
  • To install with all optional features:
    pip install 'gaico[audio,bertscore,cosine,jsd]'
    

[!TIP] The dev extra, used for development installs, also includes all optional features.

Installation Size Comparison

The following table provides an estimated overview of the relative disk space impact of different installation options. Actual sizes may vary depending on your operating system, Python version, and existing packages. These are primarily to illustrate the relative impact of optional dependencies.

Note: Core dependencies include: levenshtein, matplotlib, numpy, pandas, rouge-score, and seaborn.

Installation Command Dependencies Estimated Total Size Impact
pip install gaico Core 215 MB
pip install 'gaico[audio]' Core + scipy, soundfile 330 MB
pip install 'gaico[bertscore]' Core + bert-score (includes torch, transformers, etc.) 800 MB
pip install 'gaico[cosine]' Core + scikit-learn 360 MB
pip install 'gaico[jsd]' Core + scipy, nltk 310 MB
pip install 'gaico[audio,jsd,cosine,bertscore]' Core + all dependencies from above 1.0 GB

For Developers (Installing from source)

If you want to contribute to GAICo or install it from source for development:

  1. Clone the repository:

    git clone https://github.com/ai4society/GenAIResultsComparator.git
    cd GenAIResultsComparator
    
  2. Set up a virtual environment and install dependencies:

    We recommend using UV for fast environment and dependency management.

    # Create a virtual environment (Python 3.10-3.12 recommended)
    uv venv
    # Activate the environment
    source .venv/bin/activate  # On Windows: .venv\Scripts\activate
    # Install in editable mode with all development dependencies
    uv pip install -e ".[dev]"
    

    If you prefer not to use uv, you can use pip:

    # Create a virtual environment (Python 3.10-3.12 recommended)
    python3 -m venv .venv
    # Activate the environment
    source .venv/bin/activate  # On Windows: .venv\Scripts\activate
    # Install the package in editable mode with development extras
    pip install -e ".[dev]"
    

    The dev extra installs GAICo with all optional features, plus dependencies for testing, linting, and documentation.

  3. Set up pre-commit hooks (recommended for contributors):

    Pre-commit hooks help maintain code quality by running checks automatically before you commit.

    pre-commit install
    

Citation

If you find GAICo useful in your research or work, please consider citing it:

If you find this project useful, please consider citing it in your work:

@software{AI4Society_GAICo_GenAI_Results,
  author = {{Nitin Gupta, Pallav Koppisetti, Kausik Lakkaraju, Biplav Srivastava}},
  license = {MIT},
  title = {{GAICo: GenAI Results Comparator}},
  year = {2025},
  url = {https://github.com/ai4society/GenAIResultsComparator}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gaico-0.3.0.tar.gz (78.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gaico-0.3.0-py3-none-any.whl (50.9 kB view details)

Uploaded Python 3

File details

Details for the file gaico-0.3.0.tar.gz.

File metadata

  • Download URL: gaico-0.3.0.tar.gz
  • Upload date:
  • Size: 78.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.6

File hashes

Hashes for gaico-0.3.0.tar.gz
Algorithm Hash digest
SHA256 88af7844035a5d799dd027b40c62ca93d137703b441b479665cc002da17f2fa1
MD5 d614373eab7c0c71e21b455609485648
BLAKE2b-256 77ea3b7457cd2f7a6a5c2a4ec496402c586927c9054aa54471ec3106f675005e

See more details on using hashes here.

File details

Details for the file gaico-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: gaico-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 50.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.6

File hashes

Hashes for gaico-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9b384ff8143d4595dd2715a64624b4e0cf39e8ee426135becce4f76ba2251ac4
MD5 36708123f0acd8c134aed2604849c7c2
BLAKE2b-256 1c56e9f6235aaed1b713cac5f18788d94ad585248013ce07ea7402598e5328c4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page