streamline LLM evaluation

Project description

Nutcracker - Large Model Evaluations and Experiments

like LM-Eval-Harness but without PyTorch madness. Bring-your-own-API + straightforward data management

https://github.com/brucewlee/nutcracker/assets/54278520/151403fc-217c-486c-8de6-489af25789ce

Installation

Install Nutcracker

git clone https://github.com/brucewlee/nutcracker
pip install -e nutcracker

Download Nutcracker DB

git clone https://github.com/brucewlee/nutcracker-db

Check all tasks implemented in Nutcracker DB's readme page.

QuickStart

Case Study: Evaluate (Any) LLM API on TruthfulQA (Script)

STEP 1: Define Model

Define a simple model class with a "respond(self, user_prompt)" function.
We will use OpenAI here. But really, any api can be evaluated if the "respond(self, user_prompt)" function that returns LLM response in string exists. Get creative (Hugginface API, Anthropic API, Replicate API, OLLaMA, and etc.)

from openai import OpenAI
import os, logging, sys
logging.basicConfig(level=logging.INFO)
logging.getLogger('httpx').setLevel(logging.CRITICAL)
os.environ["OPENAI_API_KEY"] = ""
client = OpenAI()

class ChatGPT:
    def __init__(self):
        self.model = "gpt-3.5-turbo"

    def respond(self, user_prompt):
        response_data = None
        while response_data is None:
            try:
                completion = client.chat.completions.create(
                    model=self.model,
                    messages=[
                        {"role": "user", "content": f"{user_prompt}"}
                    ],
                    timeout=15,
                )
                response_data = completion.choices[0].message.content
                break
            except KeyboardInterrupt:
                sys.exit()
            except:
                print("Request timed out, retrying...")
        return response_data

STEP 2: Run Evaluation

from nutcracker.data import Task, Pile
from nutcracker.runs import Schema
from nutcracker.evaluator import MCQEvaluator, generate_report

# this db_directory value should work off-the-shelf if you cloned both repositories in the same directory
truthfulqa = Task.load_from_db(task_name='truthfulqa-mc1', db_directory='nutcracker-db/db')

# sample 20 for demo
truthfulqa.sample(20, in_place=True)

# running this experiment updates each instance's model_response property in truthfulqa data object with ChatGPT responses
experiment = Schema(model=ChatGPT(), data=truthfulqa)
experiment.run()

# running this evaluation updates each instance's response_correct property in truthfulqa data object with evaluations
evaluation = MCQEvaluator(data=truthfulqa)
evaluation.run()

for i in range (0, len(truthfulqa)):
    print(truthfulqa[i].user_prompt)
    print(truthfulqa[i].model_response)
    print(truthfulqa[i].correct_options)
    print(truthfulqa[i].response_correct)
    print()

print(generate_report(truthfulqa, save_path='accuracy_report.txt'))

Case Study: Task vs. Pile? Evaluating LLaMA on MMLU (Script)

STEP 1: Understand the basis of Nutcracker

Despite our lengthy history of model evaluation, my understanding of the field is that we have not reached a clear consensus on what a "benchmark" is (Is MMLU a "benchmark"? Is Huggingface Open LLM leaderboard a "benchmark"?).
Instead of using the word benchmark, Nutcracker divides the data structure into Instance, Task, and Pile (See blog post: HERE)
Nutcracker DB is constructed on the Task-level but you can call multiple Tasks together on the Pile-level.

STEP 2: Define Model

Since we've tried OpenAI API above, let's now try Hugginface Inference Endpoint. Most open-source models are accessible through this option. (See blog post: HERE)

class LLaMA:
    def __init__(self):
        self.API_URL = "https://xxxx.us-east-1.aws.endpoints.huggingface.cloud"

    def query(self, payload):
        headers = {
            "Accept" : "application/json",
            "Authorization": "Bearer hf_XXXXX",
            "Content-Type": "application/json" 
        }
        response = requests.post(self.API_URL, headers=headers, json=payload)
        return response.json()

    def respond(self, user_prompt):
        output = self.query({
            "inputs": f"<s>[INST] <<SYS>> You are a helpful assistant. You keep your answers short. <</SYS>> {user_prompt}",
        })
        return output[0]['generated_text']

STEP 3: Load Data

from nutcracker.data import Pile
import logging
logging.basicConfig(level=logging.INFO)

mmlu = Pile.load_from_db('mmlu','nutcracker-db/db')

STEP 4: Run Experiment (Retrieve Model Responses)

Running evaluation updates each instance's model_response attribute within the data object, which is mmlu Pile in this case.
You can save data object at any step of the evaluation. Let's try saving this time to prevent API requesting again in case anything happens.

from nutcracker.runs import Schema
mmlu.sample(n=1000, in_place = True)

experiment = Schema(model=LLaMA(), data=mmlu)
experiment.run()
mmlu.save_to_file('mmlu-llama.pkl')

You can load and check how the model responded.

loaded_mmlu = Pile.load_from_file('mmlu-llama.pkl')
for i in range (0,len(loaded_mmlu)):
    print("\n\n\n---\n")
    print("Prompt:")
    print(loaded_mmlu[i].user_prompt)
    print("\nResponses:")
    print(loaded_mmlu[i].model_response)

STEP 5: Run Evaluation

LLMs often don’t respond in immediately recognizable letters like A, B, C, or D.
Therefore, Nutcracker supports an intent-matching feature (requires OpenAI API Key) that parses model response to match discrete labels, but let’s disable that for now and proceed with our evaluation.
We recommend using intent-matching for almost all use cases. We will publish a detailed research later.

from nutcracker.evaluator import MCQEvaluator, generate_report
evaluation = MCQEvaluator(data=loaded_mmlu, disable_intent_matching=True)
evaluation.run()
print(generate_report(loaded_mmlu, save_path='accuracy_report.txt'))

https://github.com/brucewlee/nutcracker/assets/54278520/6deb5362-fd48-470e-9964-c794425811d9

Project details

Release history Release notifications | RSS feed

0.0.2a2 pre-release

Aug 3, 2024

0.0.2a1 pre-release

May 29, 2024

0.0.1a37 pre-release

Mar 17, 2024

0.0.1a36 pre-release

Mar 17, 2024

0.0.1a35 pre-release

Mar 16, 2024

0.0.1a34 pre-release

Mar 10, 2024

0.0.1a33 pre-release

Mar 9, 2024

0.0.1a32 pre-release

Mar 9, 2024

0.0.1a31 pre-release

Mar 5, 2024

0.0.1a30 pre-release

Mar 2, 2024

0.0.1a29.post5 pre-release

Mar 1, 2024

0.0.1a29.post4 pre-release

Mar 1, 2024

0.0.1a29.post3 pre-release

Mar 1, 2024

0.0.1a29.post2 pre-release

Mar 1, 2024

0.0.1a29 pre-release

Feb 29, 2024

0.0.1a28 pre-release

Feb 25, 2024

0.0.1a27 pre-release

Feb 25, 2024

This version

0.0.1a26 pre-release

Feb 24, 2024

0.0.1a15 pre-release

Feb 24, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nutcracker-py-0.0.1a26.tar.gz (20.5 kB view details)

Uploaded Feb 24, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nutcracker_py-0.0.1a26-py3-none-any.whl (25.5 kB view details)

Uploaded Feb 24, 2024 Python 3

File details

Details for the file nutcracker-py-0.0.1a26.tar.gz.

File metadata

Download URL: nutcracker-py-0.0.1a26.tar.gz
Upload date: Feb 24, 2024
Size: 20.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.9.18

File hashes

Hashes for nutcracker-py-0.0.1a26.tar.gz
Algorithm	Hash digest
SHA256	`4d6c4877344c3329abd568e19f47263cd71e7757c359d892c2e5bde93eedbfb1`
MD5	`9ebf32d46ecbdd0ded94c19c2dabdf60`
BLAKE2b-256	`675a8ad15125eaa32ddc22f021aa8686db9ff284d074b885795a8778fcfb4ff2`

See more details on using hashes here.

File details

Details for the file nutcracker_py-0.0.1a26-py3-none-any.whl.

File metadata

Download URL: nutcracker_py-0.0.1a26-py3-none-any.whl
Upload date: Feb 24, 2024
Size: 25.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.9.18

File hashes

Hashes for nutcracker_py-0.0.1a26-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4c808ac7c68e363ed1665d2eee348134f3358b15150d78da154dc2f51392e651`
MD5	`31b82f08e2fe1792445dc7c4ea666212`
BLAKE2b-256	`4bc5dc436d78bcf88003f3d857fed12bfcc83d708ffbbb4289b35ae466a5d135`

See more details on using hashes here.

nutcracker-py 0.0.1a26

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Nutcracker - Large Model Evaluations and Experiments

Installation

QuickStart

Case Study: Evaluate (Any) LLM API on TruthfulQA (Script)

STEP 1: Define Model

STEP 2: Run Evaluation

Case Study: Task vs. Pile? Evaluating LLaMA on MMLU (Script)

STEP 1: Understand the basis of Nutcracker

STEP 2: Define Model

STEP 3: Load Data

STEP 4: Run Experiment (Retrieve Model Responses)

STEP 5: Run Evaluation

Other Tutorials

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes