Skip to main content

Open-Source Evaluation for GenAI Application Pipelines.

Project description

Documentation https://pypi.python.org/pypi/continuous-eval/ https://GitHub.com/relari-ai/continuous-eval/releases https://github.com/Naereen/badges/ https://pypi.python.org/pypi/continuous-eval/

Open-Source Evaluation for GenAI Application Pipelines

Overview

continuous-eval is an open-source package created for granular and holistic evaluation of GenAI application pipelines.

How is continuous-eval different?

  • Modularized Evaluation: Measure each module in the pipeline with tailored metrics.

  • Comprehensive Metric Library: Covers Retrieval-Augmented Generation (RAG), Code Generation, Agent Tool Use, Classification and a variety of other LLM use cases. Mix and match Deterministic, Semantic and LLM-based metrics.

  • Leverage User Feedback in Evaluation: Easily build a close-to-human ensemble evaluation pipeline with mathematical guarantees.

  • Synthetic Dataset Generation: Generate large-scale synthetic dataset to test your pipeline.

Getting Started

This code is provided as a PyPi package. To install it, run the following command:

python3 -m pip install continuous-eval

if you want to install from source:

git clone https://github.com/relari-ai/continuous-eval.git && cd continuous-eval
poetry install --all-extras

To run LLM-based metrics, the code requires at least one of the LLM API keys in .env. Take a look at the example env file .env.example.

Run a single metric

Here's how you run a single metric on a datum. Check all available metrics here: link

from continuous_eval.metrics.retrieval import PrecisionRecallF1

datum = {
    "question": "What is the capital of France?",
    "retrieved_context": [
        "Paris is the capital of France and its largest city.",
        "Lyon is a major city in France.",
    ],
    "ground_truth_context": ["Paris is the capital of France."],
    "answer": "Paris",
    "ground_truths": ["Paris"],
}

metric = PrecisionRecallF1()

print(metric(**datum))

Available Metrics

Module Category Metrics
Retrieval Deterministic PrecisionRecallF1, RankedRetrievalMetrics
LLM-based LLMBasedContextPrecision, LLMBasedContextCoverage
Text Generation Deterministic DeterministicAnswerCorrectness, DeterministicFaithfulness, FleschKincaidReadability
Semantic DebertaAnswerScores, BertAnswerRelevance, BertAnswerSimilarity
LLM-based LLMBasedFaithfulness, LLMBasedAnswerCorrectness, LLMBasedAnswerRelevance, LLMBasedStyleConsistency
Classification Deterministic ClassificationAccuracy
Code Generation Deterministic CodeStringMatch, PythonASTSimilarity
LLM-based LLMBasedCodeGeneration
Agent Tools Deterministic ToolSelectionAccuracy
Custom Define your own metrics

To define your own metrics, you only need to extend the Metric class implementing the __call__ method. Optional methods are batch (if it is possible to implement optimizations for batch processing) and aggregate (to aggregate metrics results over multiple samples_).

Run evaluation on pipeline modules

Define modules in your pipeline and select corresponding metrics.

from continuous_eval.eval import Module, ModuleOutput, Pipeline, Dataset
from continuous_eval.metrics.retrieval import PrecisionRecallF1, RankedRetrievalMetrics
from continuous_eval.metrics.generation.text import DeterministicAnswerCorrectness
from typing import List, Dict

dataset = Dataset("dataset_folder")

# Simple 3-step RAG pipeline with Retriever->Reranker->Generation
retriever = Module(
    name="Retriever",
    input=dataset.question,
    output=List[str],
    eval=[
        PrecisionRecallF1().use(
            retrieved_context=ModuleOutput(),
            ground_truth_context=dataset.ground_truth_context,
        ),
    ],
)

reranker = Module(
    name="reranker",
    input=retriever,
    output=List[Dict[str, str]],
    eval=[
        RankedRetrievalMetrics().use(
            retrieved_context=ModuleOutput(),
            ground_truth_context=dataset.ground_truth_context,
        ),
    ],
)

llm = Module(
    name="answer_generator",
    input=reranker,
    output=str,
    eval=[
        FleschKincaidReadability().use(answer=ModuleOutput()),
        DeterministicAnswerCorrectness().use(
            answer=ModuleOutput(), ground_truth_answers=dataset.ground_truths
        ),
    ],
)

pipeline = Pipeline([retriever, reranker, llm], dataset=dataset)
print(pipeline.graph_repr()) # optional: visualize the pipeline

Now you can run the evaluation on your pipeline

eval_manager.start_run()
  while eval_manager.is_running():
    if eval_manager.curr_sample is None:
      break
    q = eval_manager.curr_sample["question"] # get the question or any other field
    # run your pipeline ...
    eval_manager.next_sample()

To log the results you just need to call the eval_manager.log method with the module name and the output, for example:

eval_manager.log("answer_generator", response)

The evaluator manager also offers

  • eval_manager.run_metrics() to run all the metrics defined in the pipeline
  • eval_manager.run_tests() to run the tests defined in the pipeline (see the documentation docs for more details)

Synthetic Data Generation

Ground truth data, or reference data, is important for evaluation as it can offer a comprehensive and consistent measurement of system performance. However, it is often costly and time-consuming to manually curate such a golden dataset. We have created a synthetic data pipeline that can custom generate user interaction data for a variety of use cases such as RAG, agents, copilots. They can serve a starting point for a golden dataset for evaluation or for other training purposes. Below is an example for Coding Agents. Try out this demo: Synthetic Data Demo

Resources

License

This project is licensed under the Apache 2.0 - see the LICENSE file for details.

Open Analytics

We monitor basic anonymous usage statistics to understand our users' preferences, inform new features, and identify areas that might need improvement. You can take a look at exactly what we track in the telemetry code

To disable usage-tracking you set the CONTINUOUS_EVAL_DO_NOT_TRACK flag to true.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

continuous_eval-0.3.5.tar.gz (44.2 kB view details)

Uploaded Source

Built Distribution

continuous_eval-0.3.5-py3-none-any.whl (54.3 kB view details)

Uploaded Python 3

File details

Details for the file continuous_eval-0.3.5.tar.gz.

File metadata

  • Download URL: continuous_eval-0.3.5.tar.gz
  • Upload date:
  • Size: 44.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.12.2 Darwin/23.1.0

File hashes

Hashes for continuous_eval-0.3.5.tar.gz
Algorithm Hash digest
SHA256 7a377051803595b4eee87784823cff1f0d815ca6c17094395a0fa6ab79544792
MD5 3cdd11d986872b716b1bbf428c20ebbb
BLAKE2b-256 48af013a09335da358e53d6fb5504523f0e1f075d47b2ff33a3c585b02db432a

See more details on using hashes here.

File details

Details for the file continuous_eval-0.3.5-py3-none-any.whl.

File metadata

  • Download URL: continuous_eval-0.3.5-py3-none-any.whl
  • Upload date:
  • Size: 54.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.12.2 Darwin/23.1.0

File hashes

Hashes for continuous_eval-0.3.5-py3-none-any.whl
Algorithm Hash digest
SHA256 b6c3d5d6381427620b0658cd2a2295eb2990afd554957b4cc7f5a3e014f2ec04
MD5 22589a91b172e4420e64d64fc46e01c4
BLAKE2b-256 176f11d8ab4efee793dff83d5918524d307560d114772f311d9af05a63d2a0d9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page