A cli based micro-framework for LLM evals
Project description
LIME - Micro-framework for Evals
A homebrewed Language Model Eval tool. Specifically a cli pipeline to:
- Parse question/answer datasets in markdown format:
- Evaluate the language models on these datasets:
- Inference Services:
- OpenAI API
- Locally running LlamaCpp models
- Custom PipeLine (CPL) apps
- Grade or re-grade the completions
- Inference Services:
- Aggregate / Summarize / Compare the results.
Gallery of QA Repositories:
- Hello QA which has different experiments around using lime to demonstrate useful functionality and patterns.
- Wordle dataset which uses different multiple-choice questions about rules/strategy/reasoning for the game.
TODO - insert a diagram
Value Proposition
TODO - insert the value prop around the complexity-spectrum of how evals are performed.
Invoking Sub Commands
Lime is a command line tool, with three main sub-commands: eval
, agg
, and grade
. Each of these sub-commands has its own set of arguments and options. To see the full list of options for each sub-command, run lime <sub-command> --help
.
There are three main actions that can be taken with this tool:
lime eval
: run a specified model on a sheet or directory of sheets, create output(s).lime agg
: aggregate and compare the results of model run(s).lime grade
: update the grading and/or ground_truth of a sheet.
In addition there are supplementary commands:
lime init
: create a template config or an example dataset.lime check
: print info on version, parameters, configs, secrets, etc.
Run Models on Question Sheets - lime eval <input> [args]
:
lime eval
[<input>] # input-sheet or globs
[ -m <model_name>] # model name
[ -v <verbose_int>] # verbose level, can use -v / -vv style, default 0
[ -y / --dry_run ] # dry run, don't write output
[ --debug] # if set, print full stack trace on exception
Run a specified model on a specified sheet (or directory of sheets) and create an output file in the directory of the input sheet. If a directory is specified as an input one, outputs file per input-sheet) and applies grading after processing the models.
Aggregate and Compare Model Runs - lime agg <input_dir> [args]
:
lime agg
<input_glob> # glob pattern for input json files
[-v <verbose_int>] # verbose level (prints extra info to stderr)
[--md] # output optimized for disply in file
[--terminal] # output optimized for disply in terminal
[--no-format] # no justification or formatting applied to tables
[--completions] # table of formatted completions
[--discrepancies] # rows that have diff in grading
[--discrepancies-full] # rows that have diff in grading with respective completions
Basic: Generated summary tables of aggregation and comparison for all all output-*.json files found in the supplied input directory. Outputs this data as markdown format (from pandas) to stdout. Redirect stdout to a file to save the output, e.g. lime agg ./data/outputs/ > ./data/outputs/agg-1.md
.
Formatting / Style: Should auto-detect if the output is going to a terminal or a file and format appropriately, but you can also manually specify this with the --md
or --terminal
flags. When piping into less
use the --terminal
flag to get the best formatting. Add the --no-format
flag to always get the full output without formatting.
Filtering with Globs: We can use globs to filter the input files, e.g. lime agg ./aggfiles/*gpt-3.5*
to aggregate all output files that inclue this string in the word.
Report Types: By default, the output is a summary of the model runs, but you can also specify to output the --completions
, --discrepancies
, or --discrepancies-full
reports as args here.
Grade (or re-grade) the output of a model run - lime grade <output> [args]
:
lime grade
<output> # output json file to grade
[-i <input_file.md>] # input file to pull [updated] ground-truth from
[-w ] # write changes; leave off for dry run
[-v ] # verbose boolean
[-l ] # "liberal grading" option
Take as required input a path to an output json file for update the grading field of each question there in.
Optionaly, if specified with an input file to an input-sheet (-i
) can update the ground_truth field of questions, when ground_truth is initially ill-specifed or needs to be updated.
By default this is a dry run, use the -w
flag to write the changes to the output file.
Init Config or sample Dataset - lime init <init_type> [args]
:
lime init
<init_type> # `config` or `dataset`
[ --usr ] # create config in home directory
[ --simple] # create `simple` dataset
Will add files of a template:
config
: as.lime/config.yaml
to current working directory, unless specified with--usr
flag in which case writes to the home directory along withsecrets.env
file for holding api keys.dataset
: several files representing input sheets which can be used to test the tool. Currently the only option is--simple
of two sheets with two questions.
Check Versions, Config, etc - lime check [args]
:
lime check
[ --dataset ] # NotImplented - should check which datasets are available.
Collect and print information about current versions, config, secrets, etc. Useful for seeing if tool is configured appropriately:
- Current working directory loads what settings via workspace config file.
- Which local models, and api's are available.
Quickstart
Setup up the package:
git clone https://github.com/sutt/lime
cd lime
pip install -e .
Hello World
Head into a clean directory, outside of lime, and run the following:
lime init dataset --simple
set OPENAI_API_KEY=sk-...
lime eval . -v
lime agg .
Further
lime grade output-common-sense-1-gpt-3.5-turbo-aaff.json -l
lime grade output-common-sense-1-gpt-3.5-turbo-aaff.json -l -w
lime grade output-common-sense-2-gpt-3.5-turbo-aaff.json -i input-common-sense-2.md
lime grade output-common-sense-2-gpt-3.5-turbo-aaff.json -i input-common-sense-2.md -w
lime agg . > agg1.md
lime check
lime init config --workspace
# edit the .lime/config.yaml file
Obviously, with the long constructed filename, tab completion in the terminal is vital to usability. To avoid low number of completions several best practices are used:
- Keep the number of input files in the directory low.
- Periodically move output files into a cetralized repo, e.g. an
./aggfiles/
directory. This allows you to runagg
commands on all the most recent test runs, but not pollute your experiments reporting with older irrelevant data.- To build queries out of the
aggfiles
directory, use the globs, e.g.lime agg ./aggfiles/*gpt-3.5*
to aggregate all output files that inclue this string in the word.
- To build queries out of the
Running Tests
# install the pytest if not installed, can be installed with
pip install lime[dev]
# from root directory...
# run default tests
pytest -vv tests/
# without specifying tests/ directory, you won't pickup the pytest.ini
# and thus won't get the `slow` tag register to preclude these tests from default
# run slow tests - these are where local llm models are run
pytest -vv -m slow tests/
Custom PipeLine - CPL Apps
TODO - add more info on Custom PipeLine
Outputs
Sample outputs of the program:
Output of main program run:
{
"sheet": {
"type": "sheet",
"name": "Rule-QA-1",
"sub_sections": [
{
"type": "question",
"name": "question",
"text": "Below is a question about the game Wordle. Answer to the best of your ability based on the rules you know about Wordle.\n<EVAL-ENDCHAR>\n\n"
},
{
"type": "meta",
"name": "meta",
"text": "- answer_type: mutliple-choice\n- answer suggested length: 10\n\n",
"data": {
"answer_type": "mutliple-choice",
"answer_suggested_length": "10"
}
}
],
"run_id": "2047",
"model_name": "gpt-3.5-turbo",
"meta_data": {
"answer_type": "mutliple-choice",
"answer_suggested_length": "10"
},
"question": "Below is a question about the game Wordle. Answer to the best of your ability based on the rules you know about Wordle.\n<EVAL-ENDCHAR>\n\n"
},
"questions": [
{
"name": "Num-Guesses-1",
"meta_data": {
"answer_type": "mutliple-choice",
"answer_suggested_length": "10"
},
"ground_truth": "B) 6",
"question": "Below is a question about the game Wordle. Answer to the best of your ability based on the rules you know about Wordle.\nQ: What is the maximum number of guesses you can make in a game of Wordle?\nA) 5\nB) 6\nC) 7\nD) No limit\n<EVAL-ENDCHAR1>\n\n",
"completion": "A) 5",
"error": null,
"model_name": "gpt-3.5-turbo",
"eval_time": 1.4130847454071045,
"grade": false
},
{
"name": "Num-Letter-1",
"meta_data": {
"answer_type": "mutliple-choice",
"answer_suggested_length": "10"
Aggregate Output
Leaderboard: {input_sheet, model}
on pct_correct
input_name | model_name | num_questions | pct_correct |
---|---|---|---|
What Shows Up | gpt-4 | 1 | 1 |
What Shows Up | gpt-3.5-turbo | 1 | 0 |
Simulated-Missing-1 | gpt-4 | 10 | 0.5 |
Simulated-Missing-1 | gpt-3.5-turbo | 20 | 0.3 |
Simulated-Missing-1 | llama_13b_chat | 10 | 0.1 |
Rule-QA-1 | gpt-4 | 8 | 1 |
Rule-QA-1 | gpt-3.5-turbo | 16 | 0.38 |
Rule-QA-1 | llama_13b_chat | 8 | 0.38 |
JSON-state-reasoning-1 | gpt-4 | 10 | 0.7 |
JSON-state-reasoning-1 | gpt-3.5-turbo | 19 | 0.53 |
JSON-state-reasoning-1 | llama_13b_chat | 9 | 0.11 |
Runs: {input_sheet, model}
on number of run_id
's
input_name | model_name | run_id |
---|---|---|
JSON-state-reasoning-1 | gpt-3.5-turbo | 2 |
Rule-QA-1 | gpt-3.5-turbo | 2 |
Simulated-Missing-1 | gpt-3.5-turbo | 2 |
JSON-state-reasoning-1 | gpt-4 | 1 |
JSON-state-reasoning-1 | llama_13b_chat | 1 |
Rule-QA-1 | gpt-4 | 1 |
Rule-QA-1 | llama_13b_chat | 1 |
Simulated-Missing-1 | gpt-4 | 1 |
Simulated-Missing-1 | llama_13b_chat | 1 |
What Shows Up | gpt-3.5-turbo | 1 |
What Shows Up | gpt-4 | 1 |
All Questions: list of all question names by sheet
input_name | name |
---|---|
JSON-state-reasoning-1 | Reason-Current-Turn-Num |
JSON-state-reasoning-1 | Reason-Letters-Guessed |
JSON-state-reasoning-1 | Reason-Letters-Guessed-2 |
JSON-state-reasoning-1 | Reason-Win |
JSON-state-reasoning-1 | Reason-Win-2 |
JSON-state-reasoning-1 | Reason-Win-3 |
JSON-state-reasoning-1 | Reason-Words-Guessed |
JSON-state-reasoning-1 | Reason-Words-Guessed-2 |
Rule-QA-1 | Mechanics-Basic-Reasoning-1 |
Rule-QA-1 | Mechanics-Basic-Reasoning-2 |
Rule-QA-1 | Mechanics-Guess-Valid-Word |
Rule-QA-1 | Mechanics-Multiletter-1 |
Rule-QA-1 | Num-Guesses-1 |
Rule-QA-1 | Num-Letter-1 |
Rule-QA-1 | Terminology-Absent-1 |
Rule-QA-1 | Terminology-Present-1 |
Simulated-Missing-1 | Simulated-Missing-0 |
Simulated-Missing-1 | Simulated-Missing-1 |
Simulated-Missing-1 | Simulated-Missing-2 |
Simulated-Missing-1 | Simulated-Missing-3 |
Simulated-Missing-1 | Simulated-Missing-4 |
Simulated-Missing-1 | Simulated-Missing-5 |
Simulated-Missing-1 | Simulated-Missing-6 |
Simulated-Missing-1 | Simulated-Missing-7 |
Simulated-Missing-1 | Simulated-Missing-8 |
Simulated-Missing-1 | Simulated-Missing-9 |
What Shows Up | Q-1 |
Quicklaunch
Install the package dependencies:
pip install -r requirements.txt
Add your OpenAI API key to your environment variables:
OPENAI_API_KEY=sk-...
Get an eval dataset to run at it and run the scripts outlined above.
Markdown Question Sheets
These are the core data structure that this tool is built around.
Why Markdown?
This uses markdowns as compromise between the ease of editing and the flexibility of json/yaml. Tradeoffs we're looking to hit:
- accessible to no-code users
- ease of version control diffing + cli tools (no html, no database, but not json)
- good enough control over formatting, whitespace, unicode, etc
- allows semi-structured format
Parsing Scheme
Parsing is dictated by a schema file, which is a yaml file that specifies the markdown headers to look for and how to parse them. ./data/md-schema.yaml
and you can specify your own schema file with the -s
flag on the main script.
The Question Sheet iself, convention is named input-name-version-.md
with the input-
prefix being important to pick it when parsing a directory. It's broken into two sections: sheet
and question
. There is one sheet
section is for the top-level sheet metadata and system prompt which then cascades onto the individual level question
objects. meta
data is key value pairs about the question, e.g. us it multiple choice question? what is the suggested max_tokens the answer? etc.
sheet:
md_header: 1
children:
md_header: 4
options:
- meta
- question
question:
md_header: 2
children:
md_header: 4
options:
- meta
- answer
- question
End Character Token - <EVAL-ENDCHAR\>
or |EVAL-ENDCHAR|
Use these to signal the end of an question or answer section and trim trailing spcae (or deliberatly allow line breaks). Since the text field will add on \n
characters until the next markdown section is found by default.
Example
In this example:
- the
sheet
level has:info
: for a scratchpad notes/comments.meta
: specifying key value pairs about the all the questions on this sheet.question
: for the system prompt.
- the
question
level has multiple objects with:meta
: nothing specified here, since it cascades from sheet levelanswer
: for the answer to the question (using<EVAL-ENDCHAR>
)question
: for the question text.
# JSON-state-reasoning-1
#### info
Written 11.26.23
Revised 12.17.23
Goal: Use programming style representations of the game state to answer questions about the game.
#### meta
- answer_type: mutliple-choice
- answer suggested length: 10
#### question
Below is the state of a wordle game. Use the output of state information to answer the question.
A B O U T
S A L E S
F L A M E
□ □ □ □ □
□ □ □ □ □
□ □ □ □ □
```json
[
[['absent', 'a'], ['present', 'b'], ['correct', 'o'], ['absent', 'u'], ['absent', 't']],
[['present', 's'], ['absent', 'a'], ['absent', 'l'], ['absent', 'e'], ['correct', 's']],
[['absent', 'f'], ['absent', 'l'], ['absent', 'a'], ['absent', 'm'], ['absent', 'e']],
[['empty', ''], ['empty', ''], ['empty', ''], ['empty', ''], ['empty', '']],
[['empty', ''], ['empty', ''], ['empty', ''], ['empty', ''], ['empty', '']],
[['empty', ''], ['empty', ''], ['empty', ''], ['empty', ''], ['empty', '']]
]
Reason-Win
meta
answer
B) No
question
Based on the output, has the player won yet? A) Yes B) No
Reason-Win-2
meta
answer
B) No
question
Based on the output, do we know the secret word? A) Yes B) No
(...continued)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file lime-green-0.1.0.tar.gz
.
File metadata
- Download URL: lime-green-0.1.0.tar.gz
- Upload date:
- Size: 49.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 01d95d39f6914d09c32aba476c6bc45a8bce9e1927fea3f167dedb1ab4433e9c |
|
MD5 | b175629052170b9ca38904001ab7de51 |
|
BLAKE2b-256 | 80b2045a3aa97110f4b1bc5eff36ef2219f40c321893e6a1962c4814a8a512b6 |
File details
Details for the file lime_green-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: lime_green-0.1.0-py3-none-any.whl
- Upload date:
- Size: 46.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8eaad92497bb61c08d48934759acff2cae31f79f0e6af168572223260c7bd2eb |
|
MD5 | 292ec16c5bff2886c34014e6a7450041 |
|
BLAKE2b-256 | 8ba3988956222ac5eb06f78f9f7c6dde268bc5003fbc3e52112ca37c46f84357 |