Skip to main content

Métrique AITER

Project description

aiter-metric

PyPI Python

AITER is an evaluation metric for large language models (LLMs) focused on veracity.

It measures how factually consistent a generated answer is with respect to a reference text, inspired by the HTER (Human Translation Error Rate) approach.

Method description

For each input request (the user question), we consider:

  • Hypothesis: the model’s answer to the request
  • Reference: an ideal answer written by a human expert
  • Context: all the useful supporting information that may help with corrections

The pipeline runs in three stages:

  1. Off-topic filtering (LLM): Using the context, an LLM removes irrelevant or out-of-scope fragments from the hypothesis, producing a filtered hypothesis that only keeps content actually addressing the request.

  2. Correction (LLM): Using the reference and context, an LLM produces a corrected hypothesis that minimally edits the hypothesis to make it factually consistent.

  3. Edit-distance scoring (TER): We compute TER (Translation Edit Rate) to quantify both veracity and tendency to digress:

    • TER(hypothesis → corrected_hypothesis) — how much the original answer must change to be factually correct.
    • TER(hypothesis → filtered_hypothesis) — how much unrelated content had to be removed.
    • TER(filtered_hypothesis → corrected_hypothesis) — how much of the relevant content is factually correct.

Lower TER means fewer edits are needed.

Installation

Install from PyPI:

pip install aiter-metric

Or from source:

git clone https://github.com/dieuantoine/aiter-metric.git
cd aiter-metric
pip install -e .

Setup

This package requires access to LLM APIs from Mistral and Gemini. Before using the metric, you must set up your API keys so that the package can query these models.

You can export your keys as environment variables (recommended):

export MISTRAL_API_KEY="your_mistral_api_key"
export GEMINI_API_KEY="your_gemini_api_key"

Alternatively, you can use a .env file or pass the key when initializing the metric.

Quick Start

This package exposes a single entry point:

from aiter import Scorer
Inputs and Outputs

Dataframe

Scorer expects a pandas DataFrame with the following columns:

  • request_id — unique identifier of the example
  • request — the user question / prompt given to the conversational agent
  • reference — the ideal human-written answer
  • context — additional information to support correction (can be empty if none)
  • hypothesis — the model’s answer to evaluate

Version/config dictionary

You must also pass a version dictionary to select the method and language:

  • CODE_VERSION: "1", "2", or "3" (recommended: "3")
  • LANG: language of your data ("en", "fr")
  • REFORMULATION_MODEL: the Gemini or Mistral model name to use for filtering/correction (e.g., "gemini-2.5-pro" or "mistral-medium-latest")

The method get_available_models() returns a list of all supported Gemini and Mistral model identifiers available for use in the REFORMULATION_MODEL parameter.

French ("fr") is available for all code versions ("1", "2", and "3"), while English ("en") is currently supported only for version 3.

Output

After calling the methods reformulation() and scoring(), the results attribute returns a dictionary containing the aggregated mean scores over all evaluated examples and df returns a pandas DataFrame aligned with your input, enriched with additional columns that describe the different processing stages and scores:

Column Description
filtered_hypothesis The hypothesis after removing off-topic or irrelevant content (produced by the filtering LLM).
corrected_hypothesis The minimally corrected version of the hypothesis, made factually consistent with the reference and context.
cor_score Correction score = TER(filtered_hypothesis → corrected_hypothesis)
ot_score Off-topic score = TER(hypothesis → filtered_hypothesis)
score Global score = TER(hypothesis → corrected_hypothesis)
Example
import os
import pandas as pd
from aiter import Scorer

df = pd.DataFrame([
    {
        "request_id": "001",
        "request": "Where is the Eiffel Tower located?",
        "reference": "The Eiffel Tower is located in Paris, France.",
        "context": "The Eiffel Tower is a landmark in Paris, inaugurated in 1889.",
        "hypothesis": "The Eiffel Tower is in Berlin."
    }
])

version = {
    "CODE_VERSION": "3",
    "LANG": "en",
    "REFORMULATION_MODEL": "gemini-2.5-pro"
}

scorer = Scorer(
    df,
    version,
    # api_key="YOUR_API_KEY"  # if not env vars
)

scorer.reformulation()
scorer.scoring()

print(scorer.df.head())

Repository Structure

aiter/
├── config/         # Configuration files and default settings for the APIs
├── llm_api/        # Wrappers and utilities to interact with external LLM APIs
├── metric/         # Core implementation of the metric
├── utils/          # Utils functions
└── prompts/        # Prompt templates

License

Distributed under the MIT license. See LICENSE for more details.

Authors

Antoine Dieu @ALT-EDIC

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aiter_metric-0.1.6.tar.gz (17.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aiter_metric-0.1.6-py3-none-any.whl (22.3 kB view details)

Uploaded Python 3

File details

Details for the file aiter_metric-0.1.6.tar.gz.

File metadata

  • Download URL: aiter_metric-0.1.6.tar.gz
  • Upload date:
  • Size: 17.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for aiter_metric-0.1.6.tar.gz
Algorithm Hash digest
SHA256 b13ae76f2c3414a84aa4937d2d2579f0dd1ed307c122456aa5cdeeb132d40734
MD5 a415cdfcbc7f5ede1da2a777aab40e86
BLAKE2b-256 7fcf4fb328d52dc4384ec6e2db82ac58f7dce080fa795c1a862a8a20e612edf7

See more details on using hashes here.

File details

Details for the file aiter_metric-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: aiter_metric-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 22.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for aiter_metric-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 b96b4ed0430c740e64d0f071ab9243e7eb6714513889853d54dd8f1022f0627b
MD5 3ba3821cb3933df1d6f5bff34075a495
BLAKE2b-256 9cb1e42295b096f89fd0d1a458e02779b6cfaca7ea09560008372e01d8ba95f0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page