LLM Application Debug/Eval UI on top of AIConfig

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

LLM application debugger / evaluation UI built on top of AIConfig

Quickstart

pip3 install lm-debug-eval-ui
Create an evaluation_module.py file containing the necessary functions to run evaluation (see Evaluation Module section below)
Create an aiconfig_model_registry.py parser module which registers the model parser to use for prompt iteration
Create an aiconfig file (e.g. eval_prompt_template.aiconfig.yaml) containing the prompts to iterate on and to use within the application / eval code
Set the following variables in bash:

evaluation_module_path=<path_to_evaluation_module>
parsers_path=<path_to_model_parser_module>
aiconfig_path=<path_to_aiconfig file>

Run the eval/debug UI with:

aiconfig eval --aiconfig-path=$aiconfig_path --parsers-module-path=$parsers_path --evaluation-module-path=$evaluation_module_path

NOTE: By default, the script will look for ./evaluation_module.py and ./aiconfig_model_registry.py for the evaluation module and parsers module, so no need to pass those args if those files exist already.

UI Overview

The UI is a single page web application with two tabs: "Evaluation" and "Prompt Iteration".

Evaluation

The evaluation tab is where evaluation sets (think batch runs) are created and analyzed.

Evaluation Sets

The first table shows the sets that have been created. Create a new evaluation set by clicking the 'Create Evaluation Set' button and stepping through the creation flow:

Give the set a name to explain what is being evaluated
Specify the path to the prompt config (aiconfig) to use for the evaluation set. The evaluation code should be hooked up (via python-aiconfig package) to resolve prompts from the config

Data Selection Select the desired data (i.e. paper paths) from the 'Available Data' on the left and click the > to move to the 'Selected Data' section on the right. These will be the source data (papers) used by the evaluation (opgee_cli) runs for the evaluation set. Click 'Next Step' once desired data is selected.

Specific Data Configuration For each of the selected data sources from the Data Selection step, optionally specify a ground truth file path to use for the evaluation. If no ground truth is specified, the raw evaluation results will be obtained without any comparison metrics to the ground truth.

General Data Configuration Specify the general configuration to be used for every evaluation call (for all data sources). Each opgee run for the papers in the evaluation set will use the arguments specified in the general data configuration.

Evaluation Set Results

Clicking a row in the Evaluation Sets table will load the table of results for that evaluation set. The results show all relevant metrics for each data source in the evaluation set. To start, these metrics are obtained from eval_matrix.csv and eval_matrix_report.txt.

Evaluation Result Details

Clicking a row in the Evaluation Set Results table will load the table of details for the specific result. This will include the raw extracted values for the data source (paper), with cells colored based on TP/FP/TN/FN compared to the ground truth file.

Prompt Iteration

The prompt iteration tab provides an editor for iterating on prompt templates to use in the application / evaluation. The editor opens the aiconfig file specified in aiconfig eval script call.

At the top of the page, select a data source (preprocessed paper) and specify the configuration of how it will be used for running each prompt. Other configuration settings (such as 'model') can be specified in the global model settings or the model settings section of each cell (cell-level overrides global).

We have created an ask_llm model parser and associated model parser registry file to use so that running a prompt will run the ask_llm script with that prompt text and arguments associated with the prompt context data and model settings.

Name a prompt cell and write the prompt in the input, then run it to see the ask_llm output. Iterate on the prompt text until it produces the desired output. The prompt can then be referenced in the application/evaluation code using AIConfig:

# TODO: Initialize AIConfig from the aiconfig somewhere in evaluation/application code:
aiconfig = AIConfigRuntime.load(<path_to_aiconfig>)

# ... In prompt template function in evaluation/application code
await aiconfig.resolve(prompt_name=<prompt_name>, parameters)

Evaluation Module

The application runs on a flask server which is integrated with the evaluation module (default evaluation_module.py) for interfacing with the evaluation logic (opgee script). Each function in the evaluation module is associated with some aspect of the UI and determines how the data is retrieved (or created) with respect to the underlying system. The evaluation_module.py we have created will retrieve data from the existing result folder structure and create evaluation sets via async runs of the opgee_cli script

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.0.6

Apr 9, 2024

0.0.5

Apr 9, 2024

0.0.4

Apr 8, 2024

0.0.3

Apr 5, 2024

0.0.2

Apr 5, 2024

0.0.1

Apr 4, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lm-debug-eval-ui-0.0.6.tar.gz (2.0 MB view hashes)

Uploaded Apr 9, 2024 Source

Built Distribution

lm_debug_eval_ui-0.0.6-py3-none-any.whl (2.1 MB view hashes)

Uploaded Apr 9, 2024 Python 3

Hashes for lm-debug-eval-ui-0.0.6.tar.gz

Hashes for lm-debug-eval-ui-0.0.6.tar.gz
Algorithm	Hash digest
SHA256	`525b53dc721dc873a829cf312937bce7500c8b4bc65cf9b4bfe41b4f7c4c3eb5`
MD5	`65c89cd50cb9079b6b59f450261fdd55`
BLAKE2b-256	`61921fd6d1a9f724d18c02e2fdcae86a04903786ee66f7f2d4b9bf3ce8e45271`

Hashes for lm_debug_eval_ui-0.0.6-py3-none-any.whl

Hashes for lm_debug_eval_ui-0.0.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cd64472eefd2f63cb49e24e8ae9d5064afc286026051149278855645c6eeae2f`
MD5	`53a40effdb3301df0b37a54b1530c5c9`
BLAKE2b-256	`dd543c85b4e7f9bf9920c2b902d2094203566191681d70fe5aee331dc410d502`