Equator: A Deterministic Framework for Evaluating LLM Reasoning with Open-Ended Questions.

These details have not been verified by PyPI

Project description

EQUATOR Evaluator

Overview

The EQUATOR Evaluator is a robust framework designed to systematically evaluate the factual accuracy and reasoning capabilities of large language models (LLMs). Unlike traditional evaluation methods, which often prioritize fluency over accuracy, this tool employs a deterministic scoring system that ensures precise and unbiased assessment of LLM-generated responses.

This repository implements the methodology described in the research paper "EQUATOR: A Deterministic Framework for Evaluating LLM Reasoning with Open-EndedQuestions. # v1.0.0-beta"(Bernard et al., 2024). By leveraging vector databases and smaller, locally hosted LLMs, the LLM Evaluator bridges the gap between scalability and accuracy in automated assessments.

Study paper: ArVix Study

Equator Framework

Key Features

Deterministic Scoring: Assigns binary scores (100% or 0%) based solely on factual correctness.
Vector Database Integration: Embeds open-ended questions and human-evaluated answers for semantic matching.
Automated Evaluation: Uses smaller LLMs to provide scalable and efficient assessments.
Bias Mitigation: Eliminates scoring biases related to linguistic fluency or persuasion.
Cost Efficiency: Optimizes token usage, significantly reducing operational costs for evaluation.

Why LLM Evaluator?

Traditional methods, like multiple-choice or human evaluation, fail to capture the nuanced reasoning and factual accuracy required in high-stakes domains such as medicine or law. The LLM Evaluator:

Focuses on factual correctness over linguistic style.
Reduces reliance on human evaluators by automating the grading process.
Provides insights into where LLMs fall short, enabling targeted improvements in model training.

Methodology

1. Deterministic Scoring Framework

The scoring framework evaluates LLM-generated answers against a vector database of human-evaluated responses. It follows these steps:

Embed Inputs: Convert questions and answers into vector embeddings using models like all-minilm.
Retrieve Closest Match: Identify the most semantically similar answer key using cosine similarity.
Binary Scoring: Assign 100% if the student’s answer matches the answer key; otherwise, 0%.

2. Vector Database

The vector database, implemented with ChromaDB, stores embeddings of open-ended questions and their corresponding answer keys. This database serves as the single source of truth for evaluations.

3. Evaluator LLM

A smaller LLM (e.g., LLaMA 3.2B) acts as the evaluator, ensuring strict adherence to the scoring criteria while reducing computational overhead.

Details of features

We classify LLMS as evaluators and students Eluator LLMS evaluate the "student models " in the case the STOA models found on OpenRouter (276Below is an updated “Evaluator vs. Student” matrix that includes Groq → Ollama support as well.

Evaluator vs. Student Matrix

Openrouater has 293 models from OpenAI etc. Groq has 14 Ollama 34925 = 148 family and 270 sizes

Evaluator LLM	Student LLM	Support Status
Ollama (local)	OpenRouter	Currently supported
Ollama (local)	Groq	Currently supported
Ollama (local)	Ollama (local)	Currently supported
Groq	OpenRouter	Currently supported
Groq	Ollama (local)	Currently supported
Groq	Groq	Next release
OpenRouter	OpenRouter	Next release

To determine the possible amount of testing from a combinatorial perspective based on your current support for Evaluator and Student LLMs, we'll break down the calculations step-by-step.

1. Understanding the Components

Evaluator LLMs:

Ollama (Local): 34,925 models
Groq: 14 models

Student LLMs:

OpenRouter: 293 models
Groq: 14 models
Ollama (Local): 34,925 models

Total Evaluator Models: 34,925 (Ollama) + 14 (Groq) = 34,939 Evaluators

Total Student Models: 293 (OpenRouter) + 14 (Groq) + 34,925 (Ollama) = 35,232 Students

2. Supported Evaluator-Student Combinations

currently supported combinations are:

Ollama (Evaluator) → OpenRouter (Student)
Ollama (Evaluator) → Groq (Student)
Ollama (Evaluator) → Ollama (Student)
Groq (Evaluator) → OpenRouter (Student)
Groq (Evaluator) → Ollama (Student)

Unsupported (Next Release):

Groq (Evaluator) → Groq (Student)
OpenRouter (Evaluator) → OpenRouter (Student)

Calculating the Number of Combinations**

* Current Support**

Ollama Evaluator Combinations:
- With OpenRouter Students:
  34,925 Evaluators × 293 Students = 10,232,275 combinations
- With Groq Students:
  34,925 Evaluators × 14 Students = 488,950 combinations
- With Ollama Students:
  34,925 Evaluators × 34,925 Students = 1,219,755,625 combinations
Subtotal for Ollama Evaluators:
10,232,275 + 488,950 + 1,219,755,625 = 1,230,476,850 combinations
Groq Evaluator Combinations:
- With OpenRouter Students:
  14 Evaluators × 293 Students = 4,102 combinations
- With Ollama Students:
  14 Evaluators × 34,925 Students = 488,950 combinations
Subtotal for Groq Evaluators:
4,102 + 488,950 = 493,052 combinations

Total Current Combinations:
1,230,476,850 (Ollama) + 493,052 (Groq) = 1,230,969,902 combinations

B. Future Support (Next Release)

Groq Evaluator → Groq Student:
14 Evaluators × 14 Students = 196 combinations
OpenRouter Evaluator → OpenRouter Student:
293 Evaluators × 293 Students = 85,849 combinations

Total Future Combinations:
196 + 85,849 = 86,045 combinations

4. Grand Total of Possible Evaluator-Student Combinations

Currently Supported: ~1,230,970,000 combinations
With Next Release: ~1,231,056,000 combinations

Note: These figures are approximate due to rounding in intermediate steps.

5. Summary

Total Supported Combinations (Current):
~1.23 Billion Evaluator-Student Pairs
Additional Combinations (Next Release):
~86,045 Evaluator-Student Pairs

6. Implications for Testing

With over 1.23 billion possible Evaluator-Student pairs currently supported, comprehensive testing would involve an extensive and potentially resource-intensive process. Here's how you might approach it:

A. Prioritization Strategies:

Model Importance: Focus on evaluating high-impact or frequently used models first.
Diversity: Ensure a diverse range of model families and sizes are tested to cover different capabilities and use cases.
Incremental Testing: Start with a subset of combinations and gradually expand.

B. Automation and Parallelization:

Utilize automated testing frameworks to handle large-scale evaluations.
Leverage parallel processing to distribute the workload across multiple machines or instances.

C. Sampling Techniques:

Instead of exhaustively testing all combinations, use statistical sampling methods to select representative pairs for evaluation.

D. Continuous Integration:

Implement continuous testing pipelines that automatically evaluate new combinations as models are added or updated.

7. Recommendations

Given the sheer volume of possible combinations, it's crucial to implement a strategic testing plan:

Define Testing Objectives: Clearly outline what you aim to achieve with each test (e.g., performance benchmarks, compatibility checks).
Allocate Resources: Ensure you have the necessary computational resources to handle large-scale testing.
Monitor and Iterate: Continuously monitor testing outcomes and refine your strategies based on findings and evolving requirements.

By adopting a structured and prioritized approach, you can effectively manage the extensive testing landscape and ensure robust evaluation of your LLM combinations.

Key Points

Evaluator LLMs (the “grader”)
- Ollama (local).
- Groq.
- More evaluators planned for future releases.
Student LLMs (the “respondent”)
- OpenRouter (276+ models: OpenAI, Anthropic, etc.).
- Groq.
- Ollama (local).
- More students planned for future releases.
Current Highlights
- Ollama can evaluate answers from OpenRouter, Groq, or Ollama itself.
- Groq can evaluate answers from OpenRouter, Groq, or Ollama.
- Ongoing development will expand these capabilities even further.

Use this chart as a quick reference for which LLM can serve as the evaluator versus which can serve as the student. We will be testing an OpenRouter to OpenRouter impelmation in our next release.
Below is an updated “Evaluator vs. Student” matrix that includes Groq → Ollama support as well.

Installation

Install the Package Install the equator package directly from PyPI:
```
pip install equator
```
Set Up the Environment
- Rename copy-to.env to .env in your working directory.
- Add the necessary API keys to the .env file.
- Example:
```
OPENROUTER_KEY="sk-xxx"
GROQ_API_KEY="gsk_xxx"
```

Optional: Set Up a Virtual Environment It is recommended to use a virtual environment to avoid conflicts with other Python packages.

On Windows

python -m venv .venv
.venv\Scripts\activate
pip install equator
deactivate

On Linux/MacOS

python3 -m venv .venv
source .venv/bin/activate
pip install equator
deactivate

**keepVectorDB Setting**

Set keepVectorDB to true in your configuration if you’ve already input data and want to avoid re-importing it.
The evaluator uses linguistic_benchmark.json as the source of truth for grading. Customize this file for your use case.

4.Directory Structure**

Hard-code the date in the configuration to maintain a consistent directory structure if needed.

This setup will allow users to install the package easily with pip and configure it for their specific needs.

Let me know if you'd like further refinements or additional examples!

Usage

Running the Program

Launch Jupyter Notebook
```
jupyter notebook
```
Open the Notebook Navigate to main.ipynb in your browser.
Execute the Cells Follow the notebook's sequence to:
- Load the dataset.
- Embed questions and answers.
- Run the evaluator.
- Generate results and visualizations.

Viewing Results

Results, including scores and explanations, are saved in the specified output directory as JSON files. Each entry includes:

Question
Model-generated answer
Score
Explanation for the score

Example Dataset

The included dataset (linguistic_benchmark.json) features open-ended questions across various categories, including puzzles, spatial reasoning, and logic. This diversity tests the reasoning capabilities of LLMs comprehensively. Here’s the updated section for the README.md to reflect the two datasets and instructions on renaming the larger one when ready to use:

Example Dataset

The repository includes two datasets to test the reasoning capabilities of LLMs:

Default Dataset:
- The file linguistic_benchmark.json contains open-ended questions across various categories, such as puzzles, spatial reasoning, and logic. This smaller dataset is ideal for quick tests or debugging.
Extended Dataset: Below is a short sample you could include in a README or project overview:

Usage

Running the Program

Launch Jupyter Notebook
```
jupyter notebook
```
Open the Notebook Navigate to main.ipynb in your browser.
Execute the Cells Follow the notebook's sequence to:
- Load the dataset.
- Embed questions and answers.
- Run the evaluator.
- Generate results and visualizations.

Viewing Results

Results, including scores and explanations, are saved in the specified output directory as JSON files. Each entry includes:

Question
Model-generated answer
Score
Explanation for the score

Example Dataset

The repository includes two datasets to test the reasoning capabilities of LLMs:

Default Dataset:
- The file linguistic_benchmark.json contains open-ended questions across various categories, such as puzzles, spatial reasoning, and logic. This smaller dataset is ideal for quick tests or debugging.
Extended Dataset: Below is a short sample you could include in a README or project overview:

Why We Keep Our Dataset Private

Our research examines the performance of large language models (LLMs) across state-of-the-art (SOTA) benchmarks, and we aim to maintain statistically significant evaluation results. If we were to release our full dataset publicly, there is a risk that future models could be trained or fine-tuned on our test items, which would compromise the fairness and meaningfulness of our benchmark. By keeping these data private, we ensure that our comparisons remain valid and our results accurately reflect model performance under unbiased test conditions.

Although our primary focus is maintaining a statistically significant and unbiased dataset for testing AI performance in QA reasoning and logic, we understand that different industries—such as law, medicine, or finance—have unique needs. Our linguistic_benchmark.json file can be extended to include domain-specific prompts and example responses. This approach allows you to evaluate how well AI models perform in your specialized context without compromising the integrity of our core benchmarking methodology. By adding your own questions, you can preserve our standardized evaluation framework while tailoring the tests to your field’s specific challenges. We aim to maintain a current benchmark results for our EQUATOR at equator.github.io

Contributions

Authors

Raymond Bernard (Independent Researcher)
Shaina Raza, Ph.D. (Vector Institute)
Subhabrata Das, PhD (JP Morgan Chase)
Raul Murugan (Columbia University)

Future Work

Expand the vector database to include more diverse datasets.
Optimize the embedding and retrieval process for larger-scale deployments.
Investigate additional scoring criteria for complex reasoning tasks.

Acknowledgment: We extend our gratitude to James Huckle for inspiring our work.
We have incorporated elements from https://github.com/autogenai/easy-problems-that-llms-get-wrong.
Our approach advances the field by simplifying the benchmarking process through our capability to score open-ended questions effectively.
Rather than benchmarking multiple models across disparate APIs, we leverage OpenRouter.ai's unified API, using the OpenAI SDK, which provides access to over 270 models for comprehensive benchmarking.
Citation

If you use this framework in your research, please cite:

@article {bernard2024equator,
  title        = {{EQUATOR: A Deterministic Framework for Evaluating LLM Reasoning with Open-Ended Questions. \# v1.0.0-beta}},
  author       = {Bernard, Raymond and Raza, Shaina and Das, Subhabrata and Murugan, Rahul},
  year         = {2024},
  eprint       = {2501.00257},
  archivePrefix= {arXiv},
  primaryClass = {cs.CL},
  note         = {MSC classes: 68T20; ACM classes: I.2.7; I.2.6; H.3.3},
  howpublished = {arXiv preprint arXiv:2501.00257 [cs.CL]},
  doi          = {10.48550/arXiv.2501.00257},
}

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.0.7

Jan 30, 2025

0.0.6

Jan 20, 2025

0.0.5

Jan 15, 2025

0.0.4

Jan 12, 2025

0.0.3

Jan 11, 2025

0.0.2

Jan 11, 2025

This version

0.0.1

Jan 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

equator_qa-0.0.1.tar.gz (231.2 kB view details)

Uploaded Jan 11, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

equator_qa-0.0.1-py3-none-any.whl (226.1 kB view details)

Uploaded Jan 11, 2025 Python 3

File details

Details for the file equator_qa-0.0.1.tar.gz.

File metadata

Download URL: equator_qa-0.0.1.tar.gz
Upload date: Jan 11, 2025
Size: 231.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for equator_qa-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`3f2e3acf7ae2ea0ad84b7a6afe7b7b327a775b63601c8d538c8592362eeaa220`
MD5	`2dd3c89772c364570ca5192376ad0820`
BLAKE2b-256	`0ba13ccdd3e02e75f7e8379a0544865134186896b13ec97fab822a6dde4d2d18`

See more details on using hashes here.

File details

Details for the file equator_qa-0.0.1-py3-none-any.whl.

File metadata

Download URL: equator_qa-0.0.1-py3-none-any.whl
Upload date: Jan 11, 2025
Size: 226.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for equator_qa-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b7c1408d3306bc8a71ac089046e706081819c73132193b80bf3ccf8847fb9088`
MD5	`e75269a355eea7c2ab05d17eccdc1c93`
BLAKE2b-256	`020a78c6de7a005a9541ea824ecc8a94b59312bc5bd906ae9f9eb877ce9a1964`

See more details on using hashes here.

equator-qa 0.0.1

Navigation

Verified details

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

EQUATOR Evaluator

Overview

Key Features

Why LLM Evaluator?

Methodology

1. Deterministic Scoring Framework

2. Vector Database

3. Evaluator LLM

Details of features

Evaluator vs. Student Matrix

1. Understanding the Components

Evaluator LLMs:

Student LLMs:

2. Supported Evaluator-Student Combinations

Unsupported (Next Release):

Calculating the Number of Combinations**

* Current Support**

B. Future Support (Next Release)

4. Grand Total of Possible Evaluator-Student Combinations

5. Summary

6. Implications for Testing

A. Prioritization Strategies:

B. Automation and Parallelization:

C. Sampling Techniques:

D. Continuous Integration:

7. Recommendations

Key Points

Installation

On Windows

On Linux/MacOS

4.Directory Structure**

Usage

Running the Program

Viewing Results

Example Dataset

Example Dataset

Usage

Running the Program

Viewing Results

Example Dataset

Example Dataset

Contributions

Authors

Future Work

Citation

Project details

Verified details

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes