Skip to main content

Tools for evaluating large language models.

Project description

[!NOTE] This project is under development. The API may undergo major changes between versions, so we recommend checking the CHANGELOG for any breaking changes before upgrading.

EvalSense: LLM Evaluation

status: experimental PyPI package version license: MIT EvalSense status Guide status Python TypeScript React

Python v3.12 uv Ruff Checked with pyright ESLint

About

EvalSense is a framework for systematic evaluation of large language models (LLMs) on open-ended generation tasks, with a particular focus on bespoke, domain-specific evaluations. Some of its key features include:

  • Broad model support. Out-of-the-box compatibility with a wide range of local and API-based model providers, including Ollama, Hugging Face, vLLM, OpenAI, Anthropic and others.
  • Evaluation guidance. An interactive evaluation guide and automated meta-evaluation tools assist in selecting the most appropriate evaluation methods for a specific use-case, including the use of perturbed data to assess method effectiveness.
  • Interactive UI. A web-based interface enables rapid experimentation with different evaluation workflows without requiring any code.
  • Advanced evaluation methods. EvalSense incorporates recent LLM-as-a-Judge and hybrid evaluation approaches, such as G-Eval and QAGS, while also supporting more traditional metrics like BERTScore and ROUGE.
  • Efficient execution. Intelligent experiment scheduling and resource management minimise computational overhead for local models. For remote APIs, EvalSense uses asynchronous parallel calls to maximise throughput.
  • Modularity and extensibility. Key components and evaluation methods can be used independently or replaced with user-defined implementations.
  • Comprehensive logging. All key aspects of evaluation are recorded in machine-readable logs, including model parameters, prompts, model outputs, evaluation results, and other metadata.

More information about EvalSense can be found on its homepage and in its documentation.

Note: Only public or fake data are shared in this repository.

Project Stucture

  • The main code for the EvalSense Python package can be found under evalsense/.
  • The accompanying documentation is available in the docs/ folder.
  • Code for the interactive LLM evaluation guide is located under guide/.
  • Jupyter notebooks with the evaluation experiments and examples are located under notebooks/.

Getting Started

Installation

You can install the project using pip by running the following command:

pip install evalsense

This will install the latest released version of the package from PyPI without any optional dependencies.

Depending on your use-case, you may want to install additional dependencies from the following groups:

  • webui: For using the interactive web UI.
  • jupyter: For running experiments in Jupyter notebooks (only needed if you don't already have the necessary libraries installed).
  • transformers: For using models and metrics requiring the Hugging Face Transformers library.
  • vllm: For using models and metrics requiring vLLM.
  • interactive: For using EvalSense with interactive UI features (currently includes webui and jupyter).
  • local: For installing all local model dependencies (currently includes transformers and vllm).
  • all: For installing all optional dependencies.

For example, if you want to install EvalSense with all optional dependencies, you can run:

pip install "evalsense[all]"

If you want to use EvalSense with the interactive features (interactive) and Hugging Face Transformers (transformers), you can run:

pip install "evalsense[interactive,transformers]"

and similarly for other combinations.

Installation for Development

To install the project for local development, you can follow the steps below:

To clone the repo:

git clone git@github.com:nhsengland/evalsense.git

To setup the Python environment for the project:

  • Install uv if it's not installed already
  • uv sync --all-extras
  • source .venv/bin/activate
  • pre-commit install

Note that the code is formatted with ruff and type-checked by pyright in standard type checking mode. For the best development experience, we recommend enabling the corresponding extensions in your preferred code editor.

To setup the Node environment for the LLM evaluation guide (located under guide/):

  • Install node if it's not installed already
  • Change to the guide/ directory (cd guide)
  • npm install
  • npm run start to run the development server

See also the separate README.md for the guide.

Programmatic Usage

For examples illustrating the usage of EvalSense, please check the notebooks under the notebooks/ folder:

  • The Demo notebook illustrates a basic application of EvalSense to the ACI-Bench dataset.
  • The Experiments notebook illustrates more thorough experiments on the same dataset, involving a larger number of evaluators and models.
  • The Meta-Evaluation notebook focuses on meta-evaluation on synthetically perturbed data, where the goal is to identify the most reliable evaluation methods rather than the best-performing models.

Web-Based UI

To use the interactive web-based UI implemented in EvalSense, simply run

evalsense webui

after installing the package and its dependencies. Note that you need to install EvalSense with the webui extra (pip install "evalsense[webui]") or an extra that includes it before running this command.

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b amazing-feature)
  3. Commit your Changes (git commit -m 'Add some amazing feature')
  4. Push to the Branch (git push origin amazing-feature)
  5. Open a Pull Request

See CONTRIBUTING.md for detailed guidance.

License

Unless stated otherwise, the codebase is released under the MIT Licence. This covers both the codebase and any sample code in the documentation.

See LICENSE for more information.

The documentation is © Crown copyright and available under the terms of the Open Government 3.0 licence.

Contact

This project is currently maintained by @adamdejl. If you have any questions, suggestions for new features or want to report a bug, please open an issue. For security concerns, please file a private vulnerability report.

To find out more about the NHS England Data Science visit our project website or get in touch at datascience@nhs.net.

Acknowledgements

We thank the Inspect AI development team for their work on the Inspect AI library, which serves as a basis for EvalSense.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evalsense-0.1.5.tar.gz (57.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

evalsense-0.1.5-py3-none-any.whl (82.4 kB view details)

Uploaded Python 3

File details

Details for the file evalsense-0.1.5.tar.gz.

File metadata

  • Download URL: evalsense-0.1.5.tar.gz
  • Upload date:
  • Size: 57.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.7.3

File hashes

Hashes for evalsense-0.1.5.tar.gz
Algorithm Hash digest
SHA256 056967452b37f372ba7f1591b16a6f060908fa3f0b5513738bffd633430344be
MD5 c2fafb8d63bb21ec82c73b4cd0146418
BLAKE2b-256 c8b57942983df035dd84e4bc6cb889b476090f7b3690293720c3449fbb5a5622

See more details on using hashes here.

File details

Details for the file evalsense-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: evalsense-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 82.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.7.3

File hashes

Hashes for evalsense-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 194fc4c61381c6dc80fa579f1d97c300013a6e6e5f0a34628136f6fb8c669938
MD5 5cae3d5d2bd74a2db0aa4d7c8dfdf102
BLAKE2b-256 0de3e5b64e23e8314d4e5152777a10424c8e5802eb2c8f0fcedba83bc8fd0ef1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page