Skip to main content

Continuous evaluation for retrieval-based LLMs pipelines.

Project description

Continuous Evaluation for retrieval-based LLM pipelines

continuous-eval is an open-source package created for the scientific and practical evaluation of LLM application pipelines. Currently, it focuses on retrieval-augmented generation (RAG) pipelines.

Why another eval package?

Good LLM evaluation should help developers reliably identify weaknesses in the pipeline, inform what actions to take, and accelerate development from prototype to production. Although it is optimal to put LLM Evaluation as part of our CI/CD pipeline just like any other part of software, it remains challenging today because:

Human evaluation is trustworthy but not scalable

  • Eyeballing can only be done on a small dataset, and it has to be repeated for any pipeline update
  • User feedback is spotty and lacks granularity

Using LLMs to evaluate LLMs is expensive, slow and difficult to trust

  • Can be very costly and slow to run at scale
  • Can be biased towards certain answers and often doesn’t align well with human evaluation

How is continuous-eval different?

  • Comprehensive RAG Metric Library: mix and match Deterministic, Semantic and LLM-based metrics.

  • Trustworthy Ensemble Metrics: easily build a close-to-human ensemble evaluation pipeline with mathematical guarantees.

  • Cheaper and Faster Evaluation: our hybrid pipeline slashes cost by up to 15x compared to pure LLM-based metrics, and reduces eval time on large datasets from hours to minutes.

Installation

This code is provided as a Python package. To install it, run the following command:

python3 -m pip install continuous-eval

if you want to install from source

git clone https://github.com/relari-ai/continuous-eval.git && cd continuous-eval
poetry install --all-extras

Getting Started

Prerequisites

The code requires the OPENAI_API_KEY (optionally ANTHROPIC_API_KEY and/or GEMINI_API_KEY) in .env to run the LLM-based metrics.

Usage

from continuous_eval.metrics import PrecisionRecallF1, RougeChunkMatch

datum = {
    "question": "What is the capital of France?",
    "retrieved_contexts": [
        "Paris is the capital of France and its largest city.",
        "Lyon is a major city in France.",
    ],
    "ground_truth_contexts": ["Paris is the capital of France."],
    "answer": "Paris",
    "ground_truths": ["Paris"],
}

metric = PrecisionRecallF1(RougeChunkMatch())
print(metric.calculate(**datum))

To run over a dataset, you can use one of the evaluator classes:

from continuous_eval.data_downloader import example_data_downloader
from continuous_eval.evaluators import RetrievalEvaluator
from continuous_eval.metrics import PrecisionRecallF1, RankedRetrievalMetrics

# Build a dataset: create a dataset from a list of dictionaries containing question/answer/context/etc.
# Or download one of the of the examples... 
dataset = example_data_downloader("retrieval")
# Setup the evaluator
evaluator = RetrievalEvaluator(
    dataset=dataset,
    metrics=[
        PrecisionRecallF1(),
        RankedRetrievalMetrics(),
    ],
)
# Run the eval!
evaluator.run(k=2, batch_size=1)
# Peaking at the results
print(evaluator.aggregated_results)
# Saving the results for future use
evaluator.save("retrieval_evaluator_results.jsonl")

For generation you can instead use the GenerationEvaluator.

Metrics

Retrieval-based metrics

Deterministic

  • PrecisionRecallF1: Rank-agnostic metrics including Precision, Recall, and F1 of Retrieved Contexts
  • RankedRetrievalMetrics: Rank-aware metrics including Mean Average Precision (MAP), Mean Reciprical Rank (MRR), NDCG (Normalized Discounted Cumulative Gain) of retrieved contexts

LLM-based

  • LLMBasedContextPrecision: Precision and Mean Average Precision (MAP) based on context relevancy classified by LLM
  • LLMBasedContextCoverage: Proportion of statements in ground truth answer that can be attributed to Retrieved Contexts calcualted by LLM

Generation metrics

Deterministic

  • DeterministicAnswerCorrectness: Includes Token Overlap (Precision, Recall, F1), ROUGE-L (Precision, Recall, F1), and BLEU score of Generated Answer vs. Ground Truth Answer
  • DeterministicFaithfulness: Proportion of sentences in Answer that can be matched to Retrieved Contexts using ROUGE-L precision, Token Overlap precision and BLEU score

Semantic

  • DebertaAnswerScores: Entailment and contradiction scores between the Generated Answer and Ground Truth Answer
  • BertAnswerRelevance: Similarity score based on the BERT model between the Generated Answer and Question
  • BertAnswerSimilarity: Similarity score based on the BERT model between the Generated Answer and Ground Truth Answer

LLM-based

  • LLMBasedFaithfulness: Binary classifications of whether the statements in the Generated Answer can be attributed to the Retrieved Contexts by LLM
  • LLMBasedAnswerCorrectness: Score (1-5) of the Generated Answer based on the Question and Ground Truth Answer calcualted by LLM

Resources

License

This project is licensed under the Apache 2.0 - see the LICENSE file for details.

Open Analytics

We monitor basic anonymous usage statistics to understand our users' preferences, inform new features, and identify areas that might need improvement. You can take a look at exactly what we track in the telemetry code

To disable usage-tracking you set the CONTINUOUS_EVAL_DO_NOT_TRACK flag to true.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

continuous_eval-0.2.1.tar.gz (28.6 kB view details)

Uploaded Source

Built Distribution

continuous_eval-0.2.1-py3-none-any.whl (34.8 kB view details)

Uploaded Python 3

File details

Details for the file continuous_eval-0.2.1.tar.gz.

File metadata

  • Download URL: continuous_eval-0.2.1.tar.gz
  • Upload date:
  • Size: 28.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.12.1 Darwin/23.1.0

File hashes

Hashes for continuous_eval-0.2.1.tar.gz
Algorithm Hash digest
SHA256 743f09a8714472ef1157687bc2f90133971a18191ba4786aa59707fcd89f3a94
MD5 740b8919cb22e69378deb2d053a27f5e
BLAKE2b-256 033ce62e338a1a8bf942e18aa6e2aabb23937a7ad0916221f1bb14b9daf0180f

See more details on using hashes here.

File details

Details for the file continuous_eval-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: continuous_eval-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 34.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.12.1 Darwin/23.1.0

File hashes

Hashes for continuous_eval-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2d43f2664b240108262a95bea00385bf7d90a7aafd9e8d62e192710a8a6e85c3
MD5 698bc167edd58b1d0a2b28de03d8a2da
BLAKE2b-256 b92fe67c4206c303d97c7acb2353575f9c8e2dbfa8cc51c5882c2f1bbc5a25c8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page