Skip to main content

Sequential evaluator for LLM trajectories

Project description

E-valuator

Code for paper E-valuator: Reliable Agent Verifiers with Sequential Hypothesis Testing. We build a sequential evaluator that can convert any black-box verifier/agent system into one with statistical guarantees. At deployment time, our system can flag and terminate agent trajectories that are likely to be unsuccessful without access to anything but a verifier's (black-box) scores.

Install

To start, please install our package:

pip install e-valuator

Quick start

We provide two demo notebooks (and corresponding datasets) in demos/notebooks/hotpot_example.ipynb (corresponding dataset in data/hotpotqa_w_scores_compressed.csv.gz) and demos/notebooks/math_example_tokens.ipynb (corresponding dataset in data/math_w_scores_compressed.csv.gz). These notebooks provide examples of the input data format required and evaluation pipeline. In general, the workflow for e-valuator consists of three parts:

  1. Collect agent trajectories and verifier scores. We provide an example collection script in demos/collect_verifier_scores/collect_math_example.py. The trajectories and scores used to calibrate e-valuator must be stored in a csv file (or similar) with (at least) four columns: (1) uq_problem_idx, a unique identifier for each trajectory, (2) step_idx (or num_steps), indicating the step count of the trajectory thus far, (3) judge_probability, indicating the verifier score for that particular step (could also be real-valued, doesn't have to be in 0-1), and (4) solved, a binary indicator column indicating whether the agent successfully solved the problem or not. The columns need not use exactly these names, but if you use a different naming system, you'll need to mark them appropriately upon initialization of e-valuator.

  2. Fit density ratio estimates on calibration set. Using the calibration set collected in step (1), we'll fit a stepwise density ratio estimator with a binary classifier.

  3. Apply learned density ratio estimates at test time. We can then apply our learned density ratios from step (2) at test time for online and sequential monitoring of agent trajectories. In particular, given a (black-box) agent's trajectory and corresponding stepwise verifier scores, we can apply e-valuator to terminate poor trajectories.

To start e-valuator:

import evaluator as e_val
ev = e_val.EValuator(
    model_type="logistic",   
    mt_variant="both",    ## "split" is finite time/empirical version of e-valuator. "anytime" is the anytime-valid version. "both" adds both the finite-time and empirical versions as columns to the test set.
    alphas=[0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5], ## by default, alphas = [0.05] unless otherwise specified, as we do here
    # delta=0.1, ## only used for split version, confidence level in false alarm guarantee. default is 0.1 (meaning 90% confidence in false alarm guarantee)
    delta=0.05,
)

You now have an e-valuator object that will fit the density ratio estimate using logistic regression. It will find valid thresholds for both the anytime-valid variant and empirical versions of e-valuator, using the list of $\alpha$ values provided. $\delta$ indicates the confidence in the false alarm guarantee. Specifically, $\delta=0.05$ indicates 95% confidence in the guarantee (see Proposition 3 of the paper for details).

You'll then need to fit e-valuator on a calibration dataframe that has the columns we described above:

ev.fit(cal_df)

To then apply it at test-time, you can run:

test_df_with_evals = ev.apply(test_df)

demos/collect_verifier_scores/collect_math_example.py provides an example token-saving experiment with early terminations. It also visualizes $M_t$ sequences for both successful and unsuccessful trajectories.

Online Evaluation/Monitoring of Agents

E-valuator assumes black-box access to the verifier/agent system. As such, we do note provide code to directly intervene in any particular agent/verifier system. To deploy e-valuator online, we recommend updating the test_df after each agent action/verifier score:

## assume ev.fit(cal_df) has been called before
x = 0
while agent_not_done:
    ## pseudocode to get a verifier score and add to test df
    verifier_score = verify(partial_traj_step_x)
    test_df.loc[len(test_df)] = ['problem_new', verifier_score1, x]
    test_df_scored = ev.apply(test_df)

    if test_df_scored['anytime_eval'] > 1/alpha:
        terminate
    else:
        x += 1

This is inefficient, and if there's interest, we can add better support for online evaluation/monitoring of agents with e-valuator.

Citation

If you use this code, please cite our work.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

e_valuator-0.1.1.tar.gz (9.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

e_valuator-0.1.1-py3-none-any.whl (7.9 kB view details)

Uploaded Python 3

File details

Details for the file e_valuator-0.1.1.tar.gz.

File metadata

  • Download URL: e_valuator-0.1.1.tar.gz
  • Upload date:
  • Size: 9.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.12

File hashes

Hashes for e_valuator-0.1.1.tar.gz
Algorithm Hash digest
SHA256 0809746968f10fae6b18610af3a7cf380d8d0a8c70210571cebe4379a5f16d43
MD5 93a0279a2cf7a3b09cad5ed85abf2df5
BLAKE2b-256 7b69d0fc24718fef166dd23807152f625a1babb8aa22a7fb7e514755195bdc0b

See more details on using hashes here.

File details

Details for the file e_valuator-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: e_valuator-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 7.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.12

File hashes

Hashes for e_valuator-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0ead7eba633f7b1596aee3da90a456acba8c6dca283122adf6171941d8582ae1
MD5 73dedc08aa261746e0e8ccb98a3ffe96
BLAKE2b-256 f80b4e4e9776b1ff6bf3953687effe04f23d5952cf437173bf7cc9a8d234dea6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page