Skip to main content

FlexEval is a tool for designing custom metrics, completion functions, and LLM-graded rubrics for evaluating the behavior of LLM-powered systems.

Project description

FlexEval LLM Evals

PyPi DOI License GitHub issues

FlexEval banner

FlexEval is a tool for designing custom metrics, completion functions, and LLM-graded rubrics for evaluating the behavior of LLM-powered systems.

Documentation: https://digitalharborfoundation.github.io/FlexEval

Additional details about FlexEval can be found in our paper at the Educational Data Mining 2024 conference.

Usage

Basic usage:

import flexeval
from flexeval.schema import Eval, EvalRun, FileDataSource, Metrics, FunctionItem, Config

data_sources = [FileDataSource(path="vignettes/conversations.jsonl")]
eval = Eval(metrics=Metrics(function=[FunctionItem(name="flesch_reading_ease")]))
config = Config(clear_tables=True)
eval_run = EvalRun(
    data_sources=data_sources,
    database_path="eval_results.db",
    eval=eval,
    config=config,
)
flexeval.run(eval_run)

This example computes Flesch reading ease for every turn in a list of conversations provided in JSONL format. The metric values are stored in an SQLite database called eval_results.db.

See additional usage examples in the vignettes.

Installation

FlexEval is on PyPI as python-flexeval. See the Installation section in the Getting Started guide.

Using pip:

pip install python-flexeval

Basic functionality

FlexEval is designed to be "batteries included" for many basic use cases. It supports the following out-of-the-box:

  • scoring historical conversations - useful for monitoring live systems.
  • scoring LLMs:
    • locally hosted and served via an endpoint using something like LM Studio
    • LLMs accessible by a REST endpoint and accessible via a network call
    • any OpenAI LLM
  • a set of useful rubrics
  • a set of useful Python functions

Evaluation results are saved in an SQLite database. See the Metric Analysis vignette for a sample analysis demonstrating the structure and utility of the data saved by FlexEval.

Read more in the Getting Started guide.

Cite this work

If this work is useful to you, please cite our EDM 2024 paper:

S. Thomas Christie, Baptiste Moreau-Pernet, Yu Tian, & John Whitmer. (2024). FlexEval: a customizable tool for chatbot performance evaluation and dialogue analysis. Proceedings of the 17th International Conference on Educational Data Mining, 903-908. Atlanta, Georgia, USA, July 2024. https://doi.org/10.5281/zenodo.12729993

Development

Pull requests are welcome. Feel free to contribute:

  • New rubrics or functions
  • Bug fixes
  • New features

See DEVELOPMENT.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

python_flexeval-0.4.0.tar.gz (703.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

python_flexeval-0.4.0-py3-none-any.whl (73.7 kB view details)

Uploaded Python 3

File details

Details for the file python_flexeval-0.4.0.tar.gz.

File metadata

  • Download URL: python_flexeval-0.4.0.tar.gz
  • Upload date:
  • Size: 703.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for python_flexeval-0.4.0.tar.gz
Algorithm Hash digest
SHA256 0a4dbaaa8fbe81c2b6955cc56abf8fb9255275a1a787daf5affee7e1102994ea
MD5 5d1044aef8e5a4900afa347e7abb7fa2
BLAKE2b-256 40788163cfe6539b05e82fe053b02ea5508bb988c23d55e4f2bae0a1812f1ef9

See more details on using hashes here.

Provenance

The following attestation bundles were made for python_flexeval-0.4.0.tar.gz:

Publisher: deploy-to-pypi.yml on DigitalHarborFoundation/FlexEval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file python_flexeval-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: python_flexeval-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 73.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for python_flexeval-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0ae070033c5360a5852662cac1f5a7e512b796c83ea40b7cc8d98973e73f4c3f
MD5 1c93ed2dfdbf74ca1aee52a61577c9e7
BLAKE2b-256 5030fb27aec1ee7033a74b49a3fcee84aed6d3906c81fa114cc0004ac79bdf3d

See more details on using hashes here.

Provenance

The following attestation bundles were made for python_flexeval-0.4.0-py3-none-any.whl:

Publisher: deploy-to-pypi.yml on DigitalHarborFoundation/FlexEval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page