continuous-eval

Open-Source Evaluation for GenAI Applications.

These details have not been verified by PyPI

Project description

Data-Driven Evaluation for LLM-Powered Applications

Overview

continuous-eval is an open-source package created for data-driven evaluation of LLM-powered application.

How is continuous-eval different?

Modularized Evaluation: Measure each module in the pipeline with tailored metrics.
Comprehensive Metric Library: Covers Retrieval-Augmented Generation (RAG), Code Generation, Agent Tool Use, Classification and a variety of other LLM use cases. Mix and match Deterministic, Semantic and LLM-based metrics.
Probabilistic Evaluation: Evaluate your pipeline with probabilistic metrics

Getting Started

This code is provided as a PyPi package. To install it, run the following command:

python3 -m pip install continuous-eval

if you want to install from source:

git clone https://github.com/relari-ai/continuous-eval.git && cd continuous-eval
poetry install --all-extras

To run LLM-based metrics, the code requires at least one of the LLM API keys in .env. Take a look at the example env file .env.example.

Run a single metric

Here's how you run a single metric on a datum. Check all available metrics here: link

from continuous_eval.metrics.retrieval import PrecisionRecallF1

datum = {
    "question": "What is the capital of France?",
    "retrieved_context": [
        "Paris is the capital of France and its largest city.",
        "Lyon is a major city in France.",
    ],
    "ground_truth_context": ["Paris is the capital of France."],
    "answer": "Paris",
    "ground_truths": ["Paris"],
}

metric = PrecisionRecallF1()

print(metric(**datum))

Run an evaluation

If you want to run an evaluation on a dataset, you can use the EvaluationRunner class.

from time import perf_counter

from continuous_eval.data_downloader import example_data_downloader
from continuous_eval.eval import EvaluationRunner, SingleModulePipeline
from continuous_eval.eval.tests import GreaterOrEqualThan
from continuous_eval.metrics.retrieval import (
    PrecisionRecallF1,
    RankedRetrievalMetrics,
)


def main():
    # Let's download the retrieval dataset example
    dataset = example_data_downloader("retrieval")

    # Setup evaluation pipeline (i.e., dataset, metrics and tests)
    pipeline = SingleModulePipeline(
        dataset=dataset,
        eval=[
            PrecisionRecallF1().use(
                retrieved_context=dataset.retrieved_contexts,
                ground_truth_context=dataset.ground_truth_contexts,
            ),
            RankedRetrievalMetrics().use(
                retrieved_context=dataset.retrieved_contexts,
                ground_truth_context=dataset.ground_truth_contexts,
            ),
        ],
        tests=[
            GreaterOrEqualThan(
                test_name="Recall", metric_name="context_recall", min_value=0.8
            ),
        ],
    )

    # Start the evaluation manager and run the metrics (and tests)
    tic = perf_counter()
    runner = EvaluationRunner(pipeline)
    eval_results = runner.evaluate()
    toc = perf_counter()
    print("Evaluation results:")
    print(eval_results.aggregate())
    print(f"Elapsed time: {toc - tic:.2f} seconds\n")

    print("Running tests...")
    test_results = runner.test(eval_results)
    print(test_results)


if __name__ == "__main__":
    # It is important to run this script in a new process to avoid
    # multiprocessing issues
    main()

Run evaluation on a pipeline (modular evaluation)

Sometimes the system is composed of multiple modules, each with its own metrics and tests. Continuous-eval supports this use case by allowing you to define modules in your pipeline and select corresponding metrics.

from typing import Any, Dict, List

from continuous_eval.data_downloader import example_data_downloader
from continuous_eval.eval import (
    Dataset,
    EvaluationRunner,
    Module,
    ModuleOutput,
    Pipeline,
)
from continuous_eval.eval.result_types import PipelineResults
from continuous_eval.metrics.generation.text import AnswerCorrectness
from continuous_eval.metrics.retrieval import PrecisionRecallF1, RankedRetrievalMetrics


def page_content(docs: List[Dict[str, Any]]) -> List[str]:
    # Extract the content of the retrieved documents from the pipeline results
    return [doc["page_content"] for doc in docs]


def main():
    dataset: Dataset = example_data_downloader("graham_essays/small/dataset")
    results: Dict = example_data_downloader("graham_essays/small/results")

    # Simple 3-step RAG pipeline with Retriever->Reranker->Generation
    retriever = Module(
        name="retriever",
        input=dataset.question,
        output=List[str],
        eval=[
            PrecisionRecallF1().use(
                retrieved_context=ModuleOutput(page_content),  # specify how to extract what we need (i.e., page_content)
                ground_truth_context=dataset.ground_truth_context,
            ),
        ],
    )

    reranker = Module(
        name="reranker",
        input=retriever,
        output=List[Dict[str, str]],
        eval=[
            RankedRetrievalMetrics().use(
                retrieved_context=ModuleOutput(page_content),
                ground_truth_context=dataset.ground_truth_context,
            ),
        ],
    )

    llm = Module(
        name="llm",
        input=reranker,
        output=str,
        eval=[
            AnswerCorrectness().use(
                question=dataset.question,
                answer=ModuleOutput(),
                ground_truth_answers=dataset.ground_truth_answers,
            ),
        ],
    )

    pipeline = Pipeline([retriever, reranker, llm], dataset=dataset)
    print(pipeline.graph_repr())  # visualize the pipeline in marmaid format

    runner = EvaluationRunner(pipeline)
    eval_results = runner.evaluate(PipelineResults.from_dict(results))
    print(eval_results.aggregate())


if __name__ == "__main__":
    main()

Note: it is important to wrap your code in a main function (with the if __name__ == "__main__": guard) to make sure the parallelization works properly.

Custom Metrics

There are several ways to create custom metrics, see the Custom Metrics section in the docs.

The simplest way is to leverage the CustomMetric class to create a LLM-as-a-Judge.

from continuous_eval.metrics.base.metric import Arg, Field
from continuous_eval.metrics.custom import CustomMetric
from typing import List

criteria = "Check that the generated answer does not contain PII or other sensitive information."
rubric = """Use the following rubric to assign a score to the answer based on its conciseness:
- Yes: The answer contains PII or other sensitive information.
- No: The answer does not contain PII or other sensitive information.
"""

metric = CustomMetric(
    name="PIICheck",
    criteria=criteria,
    rubric=rubric,
    arguments={"answer": Arg(type=str, description="The answer to evaluate.")},
    response_format={
        "reasoning": Field(
            type=str,
            description="The reasoning for the score given to the answer",
        ),
        "score": Field(
            type=str, description="The score of the answer: Yes or No"
        ),
        "identifies": Field(
            type=List[str],
            description="The PII or other sensitive information identified in the answer",
        ),
    },
)

# Let's calculate the metric for the first datum
print(metric(answer="John Doe resides at 123 Main Street, Springfield."))

💡 Contributing

Interested in contributing? See our Contribution Guide for more details.

Resources

Docs: link
Examples Repo: end-to-end example repo
Blog Posts:
- Practical Guide to RAG Pipeline Evaluation: Part 1: Retrieval, Part 2: Generation
- How important is a Golden Dataset for LLM evaluation? (link)
- How to evaluate complex GenAI Apps: a granular approach (link)
- How to Make the Most Out of LLM Production Data: Simulated User Feedback (link)
- Generate Synthetic Data to Test LLM Applications (link)
Discord: Join our community of LLM developers Discord
Reach out to founders: Email or Schedule a chat

License

This project is licensed under the Apache 2.0 - see the LICENSE file for details.

Open Analytics

We monitor basic anonymous usage statistics to understand our users' preferences, inform new features, and identify areas that might need improvement. You can take a look at exactly what we track in the telemetry code

To disable usage-tracking you set the CONTINUOUS_EVAL_DO_NOT_TRACK flag to true.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.3.14.post2

Jan 6, 2025

0.3.14.post1

Jan 3, 2025

0.3.14

Jan 2, 2025

0.3.13

Aug 4, 2024

0.3.12

Jun 30, 2024

0.3.11

Jun 18, 2024

0.3.10

Jun 2, 2024

0.3.9

May 22, 2024

0.3.8

May 19, 2024

0.3.7

Apr 25, 2024

0.3.6

Apr 8, 2024

0.3.5

Mar 27, 2024

0.3.4

Mar 17, 2024

0.3.2

Mar 8, 2024

0.3.1

Feb 29, 2024

0.3.0

Feb 25, 2024

0.2.7

Feb 16, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

continuous_eval-0.3.14.post2.tar.gz (56.5 kB view details)

Uploaded Jan 6, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

continuous_eval-0.3.14.post2-py3-none-any.whl (80.1 kB view details)

Uploaded Jan 6, 2025 Python 3

File details

Details for the file continuous_eval-0.3.14.post2.tar.gz.

File metadata

Download URL: continuous_eval-0.3.14.post2.tar.gz
Upload date: Jan 6, 2025
Size: 56.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.7.1 CPython/3.12.7 Darwin/23.4.0

File hashes

Hashes for continuous_eval-0.3.14.post2.tar.gz
Algorithm	Hash digest
SHA256	`333fc3da8d9fd0f0ea406a6bc91f8049f7ee6e1d3ce4d61fa07ce681f05aa92e`
MD5	`65fb4211686d8dabba6d397d1455fb54`
BLAKE2b-256	`afd0eb2ce2738ca392f43dc77b5c1186e0f0149dbcbcc4ae23ec33daa5ae2bc6`

See more details on using hashes here.

File details

Details for the file continuous_eval-0.3.14.post2-py3-none-any.whl.

File metadata

Download URL: continuous_eval-0.3.14.post2-py3-none-any.whl
Upload date: Jan 6, 2025
Size: 80.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.7.1 CPython/3.12.7 Darwin/23.4.0

File hashes

Hashes for continuous_eval-0.3.14.post2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`faa857991a75595c68dd44bfaf4c2387304de9c30bcd647f7af7962ae44d6f76`
MD5	`73b5a8097ac79154d91240dd870315f5`
BLAKE2b-256	`d9f8aa33cec694733597ac87a519319d9cb89b2734aa52ef08285f6d7bb9968d`

See more details on using hashes here.

continuous-eval 0.3.14.post2

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Data-Driven Evaluation for LLM-Powered Applications

Overview

How is continuous-eval different?

Getting Started

Run a single metric

Run an evaluation

Run evaluation on a pipeline (modular evaluation)

Custom Metrics

💡 Contributing

Resources

License

Open Analytics

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes