Skip to main content

LLM Application Debug/Eval UI on top of AIConfig

Project description

LLM application debugger / evaluation UI built on top of AIConfig

Quickstart

  1. pip3 install lm-debug-eval-ui
  2. Create an evaluation_module.py file containing the necessary functions to run evaluation (see Evaluation Module section below)
  3. Create an aiconfig_model_registry.py parser module which registers the model parser to use for prompt iteration
  4. Create an aiconfig file (e.g. eval_prompt_template.aiconfig.yaml) containing the prompts to iterate on and to use within the application / eval code
  5. Set the following variables in bash:
evaluation_module_path=<path_to_evaluation_module>
parsers_path=<path_to_model_parser_module>
aiconfig_path=<path_to_aiconfig file>
  1. Run the eval/debug UI with:
aiconfig eval --aiconfig-path=$aiconfig_path --parsers-module-path=$parsers_path --evaluation-module-path=$evaluation_module_path

NOTE: By default, the script will look for ./evaluation_module.py and ./aiconfig_model_registry.py for the evaluation module and parsers module, so no need to pass those args if those files exist already.

UI Overview

The UI is a single page web application with two tabs: "Evaluation" and "Prompt Iteration".

Evaluation

The evaluation tab is where evaluation sets (think batch runs) are created and analyzed.

Evaluation Sets

The first table shows the sets that have been created. Create a new evaluation set by clicking the 'Create Evaluation Set' button and stepping through the creation flow:

  • Give the set a name to explain what is being evaluated
  • Specify the path to the prompt config (aiconfig) to use for the evaluation set. The evaluation code should be hooked up (via python-aiconfig package) to resolve prompts from the config

Data Selection Select the desired data (i.e. paper paths) from the 'Available Data' on the left and click the > to move to the 'Selected Data' section on the right. These will be the source data (papers) used by the evaluation (opgee_cli) runs for the evaluation set. Click 'Next Step' once desired data is selected.

Specific Data Configuration For each of the selected data sources from the Data Selection step, optionally specify a ground truth file path to use for the evaluation. If no ground truth is specified, the raw evaluation results will be obtained without any comparison metrics to the ground truth.

General Data Configuration Specify the general configuration to be used for every evaluation call (for all data sources). Each opgee run for the papers in the evaluation set will use the arguments specified in the general data configuration.

Evaluation Set Results

Clicking a row in the Evaluation Sets table will load the table of results for that evaluation set. The results show all relevant metrics for each data source in the evaluation set. To start, these metrics are obtained from eval_matrix.csv and eval_matrix_report.txt.

Evaluation Result Details

Clicking a row in the Evaluation Set Results table will load the table of details for the specific result. This will include the raw extracted values for the data source (paper), with cells colored based on TP/FP/TN/FN compared to the ground truth file.

Prompt Iteration

The prompt iteration tab provides an editor for iterating on prompt templates to use in the application / evaluation. The editor opens the aiconfig file specified in aiconfig eval script call.

At the top of the page, select a data source (preprocessed paper) and specify the configuration of how it will be used for running each prompt. Other configuration settings (such as 'model') can be specified in the global model settings or the model settings section of each cell (cell-level overrides global).

We have created an ask_llm model parser and associated model parser registry file to use so that running a prompt will run the ask_llm script with that prompt text and arguments associated with the prompt context data and model settings.

Name a prompt cell and write the prompt in the input, then run it to see the ask_llm output. Iterate on the prompt text until it produces the desired output. The prompt can then be referenced in the application/evaluation code using AIConfig:

# TODO: Initialize AIConfig from the aiconfig somewhere in evaluation/application code:
aiconfig = AIConfigRuntime.load(<path_to_aiconfig>)

# ... In prompt template function in evaluation/application code
await aiconfig.resolve(prompt_name=<prompt_name>, parameters)

Evaluation Module

The application runs on a flask server which is integrated with the evaluation module (default evaluation_module.py) for interfacing with the evaluation logic (opgee script). Each function in the evaluation module is associated with some aspect of the UI and determines how the data is retrieved (or created) with respect to the underlying system. The evaluation_module.py we have created will retrieve data from the existing result folder structure and create evaluation sets via async runs of the opgee_cli script

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lm-debug-eval-ui-0.0.6.tar.gz (2.0 MB view details)

Uploaded Source

Built Distribution

lm_debug_eval_ui-0.0.6-py3-none-any.whl (2.1 MB view details)

Uploaded Python 3

File details

Details for the file lm-debug-eval-ui-0.0.6.tar.gz.

File metadata

  • Download URL: lm-debug-eval-ui-0.0.6.tar.gz
  • Upload date:
  • Size: 2.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.1

File hashes

Hashes for lm-debug-eval-ui-0.0.6.tar.gz
Algorithm Hash digest
SHA256 525b53dc721dc873a829cf312937bce7500c8b4bc65cf9b4bfe41b4f7c4c3eb5
MD5 65c89cd50cb9079b6b59f450261fdd55
BLAKE2b-256 61921fd6d1a9f724d18c02e2fdcae86a04903786ee66f7f2d4b9bf3ce8e45271

See more details on using hashes here.

File details

Details for the file lm_debug_eval_ui-0.0.6-py3-none-any.whl.

File metadata

File hashes

Hashes for lm_debug_eval_ui-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 cd64472eefd2f63cb49e24e8ae9d5064afc286026051149278855645c6eeae2f
MD5 53a40effdb3301df0b37a54b1530c5c9
BLAKE2b-256 dd543c85b4e7f9bf9920c2b902d2094203566191681d70fe5aee331dc410d502

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page