continuous-eval

Open-Source Evaluation for GenAI Application Pipelines.

These details have not been verified by PyPI

Project description

Open-Source Evaluation for GenAI Application Pipelines

Overview

continuous-eval is an open-source package created for granular and holistic evaluation of GenAI application pipelines.

How is continuous-eval different?

Modularized Evaluation: Measure each module in the pipeline with tailored metrics.
Comprehensive Metric Library: Covers Retrieval-Augmented Generation (RAG), Code Generation, Agent Tool Use, Classification and a variety of other LLM use cases. Mix and match Deterministic, Semantic and LLM-based metrics.
Leverage User Feedback in Evaluation: Easily build a close-to-human ensemble evaluation pipeline with mathematical guarantees.
Synthetic Dataset Generation: Generate large-scale synthetic dataset to test your pipeline.

Getting Started

This code is provided as a PyPi package. To install it, run the following command:

python3 -m pip install continuous-eval

if you want to install from source:

git clone https://github.com/relari-ai/continuous-eval.git && cd continuous-eval
poetry install --all-extras

To run LLM-based metrics, the code requires at least one of the LLM API keys in .env. Take a look at the example env file .env.example.

Run a single metric

Here's how you run a single metric on a datum. Check all available metrics here: link

from continuous_eval.metrics.retrieval import PrecisionRecallF1

datum = {
    "question": "What is the capital of France?",
    "retrieved_context": [
        "Paris is the capital of France and its largest city.",
        "Lyon is a major city in France.",
    ],
    "ground_truth_context": ["Paris is the capital of France."],
    "answer": "Paris",
    "ground_truths": ["Paris"],
}

metric = PrecisionRecallF1()

print(metric(**datum))

Off-the-shelf Metrics

Module	Category	Metrics
Retrieval	Deterministic	PrecisionRecallF1, RankedRetrievalMetrics
Retrieval	LLM-based	LLMBasedContextPrecision, LLMBasedContextCoverage
Text Generation	Deterministic	DeterministicAnswerCorrectness, DeterministicFaithfulness, FleschKincaidReadability
	Semantic	DebertaAnswerScores, BertAnswerRelevance, BertAnswerSimilarity
	LLM-based	LLMBasedFaithfulness, LLMBasedAnswerCorrectness, LLMBasedAnswerRelevance, LLMBasedStyleConsistency
Classification	Deterministic	ClassificationAccuracy
Code Generation	Deterministic	CodeStringMatch, PythonASTSimilarity
Code Generation	LLM-based	LLMBasedCodeGeneration
Agent Tools	Deterministic	ToolSelectionAccuracy
Custom		Define your own metrics

To define your own metrics, you only need to extend the Metric class implementing the __call__ method. Optional methods are batch (if it is possible to implement optimizations for batch processing) and aggregate (to aggregate metrics results over multiple samples_).

Run evaluation on pipeline modules

Define modules in your pipeline and select corresponding metrics.

from continuous_eval.eval import Module, ModuleOutput, Pipeline, Dataset
from continuous_eval.metrics.retrieval import PrecisionRecallF1, RankedRetrievalMetrics
from continuous_eval.metrics.generation.text import DeterministicAnswerCorrectness
from typing import List, Dict

dataset = Dataset("dataset_folder")

# Simple 3-step RAG pipeline with Retriever->Reranker->Generation
retriever = Module(
    name="Retriever",
    input=dataset.question,
    output=List[str],
    eval=[
        PrecisionRecallF1().use(
            retrieved_context=ModuleOutput(),
            ground_truth_context=dataset.ground_truth_context,
        ),
    ],
)

reranker = Module(
    name="reranker",
    input=retriever,
    output=List[Dict[str, str]],
    eval=[
        RankedRetrievalMetrics().use(
            retrieved_context=ModuleOutput(),
            ground_truth_context=dataset.ground_truth_context,
        ),
    ],
)

llm = Module(
    name="answer_generator",
    input=reranker,
    output=str,
    eval=[
        FleschKincaidReadability().use(answer=ModuleOutput()),
        DeterministicAnswerCorrectness().use(
            answer=ModuleOutput(), ground_truth_answers=dataset.ground_truths
        ),
    ],
)

pipeline = Pipeline([retriever, reranker, llm], dataset=dataset)
print(pipeline.graph_repr()) # optional: visualize the pipeline

Now you can run the evaluation on your pipeline

eval_manager.start_run()
  while eval_manager.is_running():
    if eval_manager.curr_sample is None:
      break
    q = eval_manager.curr_sample["question"] # get the question or any other field
    # run your pipeline ...
    eval_manager.next_sample()

To log the results you just need to call the eval_manager.log method with the module name and the output, for example:

eval_manager.log("answer_generator", response)

The evaluator manager also offers

eval_manager.run_metrics() to run all the metrics defined in the pipeline
eval_manager.run_tests() to run the tests defined in the pipeline (see the documentation docs for more details)

Resources

Docs: link
Examples Repo: end-to-end example repo
Blog Posts:
- Practical Guide to RAG Pipeline Evaluation: Part 1: Retrieval
- Practical Guide to RAG Pipeline Evaluation: Part 2: Generation
- How important is a Golden Dataset for LLM evaluation? link
Discord: Join our community of LLM developers Discord
Reach out to founders: Email or Schedule a chat

License

This project is licensed under the Apache 2.0 - see the LICENSE file for details.

Open Analytics

We monitor basic anonymous usage statistics to understand our users' preferences, inform new features, and identify areas that might need improvement. You can take a look at exactly what we track in the telemetry code

To disable usage-tracking you set the CONTINUOUS_EVAL_DO_NOT_TRACK flag to true.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.3.14.post2

Jan 6, 2025

0.3.14.post1

Jan 3, 2025

0.3.14

Jan 2, 2025

0.3.13

Aug 4, 2024

0.3.12

Jun 30, 2024

0.3.11

Jun 18, 2024

0.3.10

Jun 2, 2024

0.3.9

May 22, 2024

0.3.8

May 19, 2024

0.3.7

Apr 25, 2024

0.3.6

Apr 8, 2024

0.3.5

Mar 27, 2024

0.3.4

Mar 17, 2024

0.3.2

Mar 8, 2024

0.3.1

Feb 29, 2024

This version

0.3.0

Feb 25, 2024

0.2.7

Feb 16, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

continuous_eval-0.3.0.tar.gz (39.3 kB view details)

Uploaded Feb 25, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

continuous_eval-0.3.0-py3-none-any.whl (48.1 kB view details)

Uploaded Feb 25, 2024 Python 3

File details

Details for the file continuous_eval-0.3.0.tar.gz.

File metadata

Download URL: continuous_eval-0.3.0.tar.gz
Upload date: Feb 25, 2024
Size: 39.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.7.1 CPython/3.12.1 Darwin/23.3.0

File hashes

Hashes for continuous_eval-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`83462e378bffa08f43b36868ac0ec246f0ec10c4871cd97d483236f4bd9a4e38`
MD5	`7ce16e1d0fd29bcb252e3c53e40132d9`
BLAKE2b-256	`d10d6dbb9c79cf28595f9489ea07a104d4829d13e1b2c1fcabcba367b8424d30`

See more details on using hashes here.

File details

Details for the file continuous_eval-0.3.0-py3-none-any.whl.

File metadata

Download URL: continuous_eval-0.3.0-py3-none-any.whl
Upload date: Feb 25, 2024
Size: 48.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.7.1 CPython/3.12.1 Darwin/23.3.0

File hashes

Hashes for continuous_eval-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7f8e1661f6269d3a776f44b4bfbc28311cec0de3fd3e94dab68725c73aad704a`
MD5	`37f231374823e7334551d5af041478bc`
BLAKE2b-256	`d8747c56514c48c46468f276cdce1e59529cf65abb4271ac233585a92fe542c0`

See more details on using hashes here.

continuous-eval 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Open-Source Evaluation for GenAI Application Pipelines

Overview

How is continuous-eval different?

Getting Started

Run a single metric

Off-the-shelf Metrics

Run evaluation on pipeline modules

Resources

License

Open Analytics

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes