LLM Application Debug/Eval UI on top of AIConfig
Project description
LLM application debugger / evaluation UI built on top of AIConfig
Quickstart
pip3 install lm-debug-eval-ui
- Create an
evaluation_module.py
file containing the necessary functions to run evaluation (see Evaluation Module section below) - Create an
aiconfig_model_registry.py
parser module which registers the model parser to use for prompt iteration - Create an aiconfig file (e.g.
eval_prompt_template.aiconfig.yaml
) containing the prompts to iterate on and to use within the application / eval code - Set the following variables in bash:
evaluation_module_path=<path_to_evaluation_module>
parsers_path=<path_to_model_parser_module>
aiconfig_path=<path_to_aiconfig file>
- Run the eval/debug UI with:
aiconfig eval --aiconfig-path=$aiconfig_path --parsers-module-path=$parsers_path --evaluation-module-path=$evaluation_module_path
NOTE: By default, the script will look for ./evaluation_module.py
and ./aiconfig_model_registry.py
for the evaluation module and parsers module, so no need to pass those args if those files exist already.
UI Overview
The UI is a single page web application with two tabs: "Evaluation" and "Prompt Iteration".
Evaluation
The evaluation tab is where evaluation sets (think batch runs) are created and analyzed.
Evaluation Sets
The first table shows the sets that have been created. Create a new evaluation set by clicking the 'Create Evaluation Set' button and stepping through the creation flow:
- Give the set a name to explain what is being evaluated
- Specify the path to the prompt config (aiconfig) to use for the evaluation set. The evaluation code should be hooked up (via python-aiconfig package) to resolve prompts from the config
Data Selection Select the desired data (i.e. paper paths) from the 'Available Data' on the left and click the > to move to the 'Selected Data' section on the right. These will be the source data (papers) used by the evaluation (opgee_cli) runs for the evaluation set. Click 'Next Step' once desired data is selected.
Specific Data Configuration For each of the selected data sources from the Data Selection step, optionally specify a ground truth file path to use for the evaluation. If no ground truth is specified, the raw evaluation results will be obtained without any comparison metrics to the ground truth.
General Data Configuration Specify the general configuration to be used for every evaluation call (for all data sources). Each opgee run for the papers in the evaluation set will use the arguments specified in the general data configuration.
Evaluation Set Results
Clicking a row in the Evaluation Sets table will load the table of results for that evaluation set. The results show all relevant metrics for each data source in the evaluation set. To start, these metrics are obtained from eval_matrix.csv
and eval_matrix_report.txt
.
Evaluation Result Details
Clicking a row in the Evaluation Set Results table will load the table of details for the specific result. This will include the raw extracted values for the data source (paper), with cells colored based on TP/FP/TN/FN compared to the ground truth file.
Prompt Iteration
The prompt iteration tab provides an editor for iterating on prompt templates to use in the application / evaluation. The editor opens the aiconfig file specified in aiconfig eval
script call.
At the top of the page, select a data source (preprocessed paper) and specify the configuration of how it will be used for running each prompt. Other configuration settings (such as 'model') can be specified in the global model settings or the model settings section of each cell (cell-level overrides global).
We have created an ask_llm model parser and associated model parser registry file to use so that running a prompt will run the ask_llm script with that prompt text and arguments associated with the prompt context data and model settings.
Name a prompt cell and write the prompt in the input, then run it to see the ask_llm output. Iterate on the prompt text until it produces the desired output. The prompt can then be referenced in the application/evaluation code using AIConfig:
# TODO: Initialize AIConfig from the aiconfig somewhere in evaluation/application code:
aiconfig = AIConfigRuntime.load(<path_to_aiconfig>)
# ... In prompt template function in evaluation/application code
await aiconfig.resolve(prompt_name=<prompt_name>, parameters)
Evaluation Module
The application runs on a flask server which is integrated with the evaluation module (default evaluation_module.py
) for interfacing with the evaluation logic (opgee
script). Each function in the evaluation module is associated with some aspect of the UI and determines how the data is retrieved (or created) with respect to the underlying system. The evaluation_module.py
we have created will retrieve data from the existing result
folder structure and create evaluation sets via async runs of the opgee_cli
script
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file lm-debug-eval-ui-0.0.6.tar.gz
.
File metadata
- Download URL: lm-debug-eval-ui-0.0.6.tar.gz
- Upload date:
- Size: 2.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 525b53dc721dc873a829cf312937bce7500c8b4bc65cf9b4bfe41b4f7c4c3eb5 |
|
MD5 | 65c89cd50cb9079b6b59f450261fdd55 |
|
BLAKE2b-256 | 61921fd6d1a9f724d18c02e2fdcae86a04903786ee66f7f2d4b9bf3ce8e45271 |
File details
Details for the file lm_debug_eval_ui-0.0.6-py3-none-any.whl
.
File metadata
- Download URL: lm_debug_eval_ui-0.0.6-py3-none-any.whl
- Upload date:
- Size: 2.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cd64472eefd2f63cb49e24e8ae9d5064afc286026051149278855645c6eeae2f |
|
MD5 | 53a40effdb3301df0b37a54b1530c5c9 |
|
BLAKE2b-256 | dd543c85b4e7f9bf9920c2b902d2094203566191681d70fe5aee331dc410d502 |