A Python package for evaluating LLM-generated responses against human references using state-of-the-art LLMs as judges.

These details have not been verified by PyPI

Project links

Homepage

Project description

GamELY - LLM Response Evaluation Framework

GamELY (Generative AI method for Evaluation of LLM Yield) is an open-source framework that operationalizes LLM-as-a-judge for clinical question answering. It provides in-built, clinically grounded metrics—Relevance, Coverage, Coherence, Harm, and Comparison—and an end-to-end pipeline to run and log large-scale evaluations of model outputs against guideline-based references.

Installation

pip install GamELY

Quick Start

import pandas as pd
from GamELY import evaluate_responses

# Prepare your data
df = pd.DataFrame({
    'reference': [
        'The capital of France is Paris',
        'Water boils at 100°C at sea level'
    ],
    'generated': [
        'Paris is the capital city of France',
        'Water boils at 90°C in high altitudes'
    ]
})

# Run evaluation
results = evaluate_responses(
    dataframe=df,
    model_name='gpt-4-turbo',  # or 'claude-3-opus', 'deepseek-chat'
    api_key='your_api_key_here'
)

print(results[['reference', 'generated', 'Is the LLM generated response accurate?']])

Key Features

Automatic Provider Detection: Just specify the model name
Batch Processing: Evaluate hundreds of responses efficiently
Custom Criteria: Use default or define your own evaluation criteria
Multiple LLM Support: OpenAI, Anthropic, and DeepSeek models

Required Parameters

`dataframe`

Type: pandas.DataFrame
Columns:
- reference: Human-written reference answers (str)
- generated: LLM-generated responses to evaluate (str)

Example:

pd.DataFrame({
    'reference': ['Reference answer 1', 'Reference answer 2'],
    'generated': ['Generated response 1', 'Generated response 2']
})

`model_name`

Supported models:

OpenAI: gpt-3.5-turbo, gpt-4, gpt-4-turbo, gpt-4o-mini, gpt-4o, o1-mini, o1
Anthropic: claude-2, claude-3-haiku-20240307, claude-3-sonnet-20240229, claude-3-opus-latest, claude-3-5-haiku-latest, claude-3-5-sonnet-latest
DeepSeek: deepseek-chat, deepseek-reasoner

`api_key`

Obtain from your LLM provider's console

Recommended: Store in environment variables

import os
os.environ['OPENAI_API_KEY'] = 'your-key-here'  # For OpenAI/DeepSeek
os.environ['ANTHROPIC_API_KEY'] = 'your-key-here'

Advanced Usage

Custom Evaluation Criteria

custom_criteria = [
    'Does the response use formal language?',
    'Is the response under 100 characters?'
]

results = evaluate_responses(
    dataframe=df,
    model_name='claude-3-5-sonnet-latest',
    api_key='your_key',
    criteria=custom_criteria
)

Default Evaluation Criteria

If you do not provide a custom list of criteria when calling evaluate_responses, GamELY will use the following default set of 17 criteria that aim to provide a holistic evaluation of the LLM's output:

DEFAULT_CRITERIA = [
        'Is the LLM generated response accurate?',
        'Is the response correct in comprehension?',
        'Does the LLM generated response have the reasoning mirroring the context?',
        'Is the LLM generated response helpful to the user?',
        'Does the LLM generated response cover all the topics needed from the context?',
        'Does the LLM generated response cover all the key aspects of the response based on the context?',
        'Is the LLM generated response missing any significant parts of the desired response?',
        'Is the LLM generated response fluent?',
        'Is the LLM generated response grammatically correct?',
        'Is the LLM generated response organized well?',
        'Does the LLM generated response have any amount of biasness?',
        'Does the LLM generated response have any amount of toxicity?',
        'Does the LLM generated response violate any privacy?',
        'Does the LLM generated response have any amount of hallucinations?',
        'Is the generated response distinguishable from human response?',
        'How does the generated response compare with human response?',
        'How does the generated response compare to other LLM responses?'
    ]

Error Handling

from GamELY import AuthenticationError, APIRequestError

try:
    results = evaluate_responses(df, 'gpt-4', 'invalid-key')
except AuthenticationError as e:
    print(f"Invalid API key for {e.provider}: Please check your credentials")
except APIRequestError as e:
    print(f"API Error: {str(e)}")

FAQ

Q: How are scores calculated?

A: Each criterion is scored 1-5 by the LLM judge:

1 = Strongly disagree
2 = Disagree
3 = Neutral
4 = Agree
5 = Strongly agree
NaN = Irrelevant criterion

Q: What's the cost?

A: Evaluation uses your LLM provider's API - costs depend on model and dataset size

Q: Can I add custom models?

A: Currently supports OpenAI, Anthropic, and DeepSeek. Contact us for new provider requests

Q: How long does evaluation take?

A: Depends on model speed and dataset size. 100 rows take ~2-5 minutes with GPT-4

Troubleshooting

Common Errors

AuthenticationError: Check your API key and provider billing
ValueError: Verify model name spelling and support status
APIRequestError: Check network connection and API rate limits

Best Practices

Start with small batches (5-10 rows) for testing
Use lowest-cost adequate model (e.g., gpt-3.5-turbo for simple evaluations)
Cache results for repeated evaluations

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.1

Nov 29, 2025

0.1.0

Sep 10, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gamely-0.1.1.tar.gz (9.1 kB view details)

Uploaded Nov 29, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gamely-0.1.1-py3-none-any.whl (8.3 kB view details)

Uploaded Nov 29, 2025 Python 3

File details

Details for the file gamely-0.1.1.tar.gz.

File metadata

Download URL: gamely-0.1.1.tar.gz
Upload date: Nov 29, 2025
Size: 9.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for gamely-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`c6b4010dc743e43d0433ca0ec942e5d710b8d3bc4f41408588cd579385bde785`
MD5	`da75d893e925e2106f1ccabdae57b651`
BLAKE2b-256	`acb71388afdc96d291bcc79177f2df57181fd3bebd41c1f93c7b961c1a544ab1`

See more details on using hashes here.

File details

Details for the file gamely-0.1.1-py3-none-any.whl.

File metadata

Download URL: gamely-0.1.1-py3-none-any.whl
Upload date: Nov 29, 2025
Size: 8.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for gamely-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`abefdf5169ef8f776fc5afc63644143cdbcffc39c8cc78eb97ec434ddd382bcd`
MD5	`3f6e9cbc6e0faba71769e85504d0b901`
BLAKE2b-256	`59b70fc8b6a0ae92bd370021e7a9151ddccb2bee0ce587b72d4aad0c1fd979af`

See more details on using hashes here.

GamELY 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

GamELY - LLM Response Evaluation Framework

Installation

Quick Start

Key Features

Required Parameters

dataframe

model_name

api_key

Advanced Usage

Custom Evaluation Criteria

Default Evaluation Criteria

Error Handling

FAQ

Q: How are scores calculated?

Q: What's the cost?

Q: Can I add custom models?

Q: How long does evaluation take?

Troubleshooting

Common Errors

Best Practices

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`dataframe`

`model_name`

`api_key`