Skip to main content

Sequential evaluator for LLM trajectories

Project description

E-valuator

Code for paper E-valuator: Reliable Agent Verifiers with Sequential Hypothesis Testing. We build a sequential evaluator that can convert any black-box verifier/agent system into one with statistical guarantees. At deployment time, our system can flag and terminate agent trajectories that are likely to be unsuccessful without access to anything but a verifier's (black-box) scores.

Install

To start, please install our package:

pip install e-valuator

Quick start

Once installed, you can boot up e-valuator with from evaluator import EValuator. We provide two demo notebooks (and corresponding datasets) in demos/notebooks/hotpot_example.ipynb (corresponding dataset in data/hotpotqa_cleaned_w_scores.csv) and demos/notebooks/math_example_tokens.ipynb (corresponding dataset in data/math_cleaned_w_scores.csv).

These notebooks provide examples of the input data format required and evaluation pipeline.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

e_valuator-0.1.0.tar.gz (7.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

e_valuator-0.1.0-py3-none-any.whl (6.1 kB view details)

Uploaded Python 3

File details

Details for the file e_valuator-0.1.0.tar.gz.

File metadata

  • Download URL: e_valuator-0.1.0.tar.gz
  • Upload date:
  • Size: 7.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.12

File hashes

Hashes for e_valuator-0.1.0.tar.gz
Algorithm Hash digest
SHA256 fed8b9e8126cfd607c9aca10b52a6d54ce071976c7bddba825749a4655236a2c
MD5 b1358ad9ecce9694b8c1d81ca3905b82
BLAKE2b-256 27dd07c591ea170434ba96b051d6528ce7181f480ed3b31037b810983d58b49e

See more details on using hashes here.

File details

Details for the file e_valuator-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: e_valuator-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 6.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.12

File hashes

Hashes for e_valuator-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0391e59d3b686cc4ebfb588bcc0bd40cac9060ce5dfe92d2888d94c6564a9a71
MD5 dd2df560b09a9550e779798cd8014a0e
BLAKE2b-256 1f836458fedee952b2c6ab053fcb978ca3ba1adccb252b2455f7d18b736ff1d8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page