Skip to main content

streamline LLM evaluation

Project description

Nutcracker - Large Model Evaluation

like LM-Eval-Harness but without PyTorch madness. use this to evaluate LLMs served through APIs

https://github.com/brucewlee/nutcracker/assets/54278520/151403fc-217c-486c-8de6-489af25789ce


Installation

Route 1. PyPI

Install Nutcracker

pip install nutcracker-py

Download Nutcracker DB

git clone https://github.com/evaluation-tools/nutcracker-db

Route 2. GitHub

Install Nutcracker

git clone https://github.com/evaluation-tools/nutcracker
pip install -e nutcracker

Download Nutcracker DB

git clone https://github.com/evaluation-tools/nutcracker-db

Check all tasks implemented in Nutcracker DB's readme page.


QuickStart

Case Study: Evaluate (Any) LLM API on TruthfulQA (Script)

STEP 1: Define Model
  • Define a simple model class with a "respond(self, user_prompt)" function.
  • We will use OpenAI here. But really, any api can be evaluated if the "respond(self, user_prompt)" function that returns LLM response in string exists. Get creative (Hugginface API, Anthropic API, Replicate API, OLLaMA, and etc.)
from openai import OpenAI
import os, logging, sys
logging.basicConfig(level=logging.INFO)
logging.getLogger('httpx').setLevel(logging.CRITICAL)
os.environ["OPENAI_API_KEY"] = ""
client = OpenAI()

class ChatGPT:
    def __init__(self):
        self.model = "gpt-3.5-turbo"

    def respond(self, user_prompt):
        response_data = None
        while response_data is None:
            try:
                completion = client.chat.completions.create(
                    model=self.model,
                    messages=[
                        {"role": "user", "content": f"{user_prompt}"}
                    ],
                    timeout=15,
                )
                response_data = completion.choices[0].message.content
                break
            except KeyboardInterrupt:
                sys.exit()
            except:
                print("Request timed out, retrying...")
        return response_data
STEP 2: Run Evaluation
from nutcracker.data import Task, Pile
from nutcracker.runs import Schema
from nutcracker.evaluator import MCQEvaluator, generate_report

# this db_directory value should work off-the-shelf if you cloned both repositories in the same directory
truthfulqa = Task.load_from_db(task_name='truthfulqa-mc1', db_directory='nutcracker-db/db')

# sample 20 for demo
truthfulqa.sample(20, in_place=True)

# running this experiment updates each instance's model_response property in truthfulqa data object with ChatGPT responses
experiment = Schema(model=ChatGPT(), data=truthfulqa)
experiment.run()

# running this evaluation updates each instance's response_correct property in truthfulqa data object with evaluations
evaluation = MCQEvaluator(data=truthfulqa)
evaluation.run()

for i in range (0, len(truthfulqa)):
    print(truthfulqa[i].user_prompt)
    print(truthfulqa[i].model_response)
    print(truthfulqa[i].correct_options)
    print(truthfulqa[i].response_correct)
    print()

print(generate_report(truthfulqa, save_path='accuracy_report.txt'))

Case Study: Task vs. Pile? Evaluating LLaMA on MMLU (Script)

STEP 1: Understand the basis of Nutcracker
  • Despite our lengthy history of model evaluation, my understanding of the field is that we have not reached a clear consensus on what a "benchmark" is (Is MMLU a "benchmark"? Is Huggingface Open LLM leaderboard a "benchmark"?).
  • Instead of using the word benchmark, Nutcracker divides the data structure into Instance, Task, and Pile (See blog post: HERE)
  • Nutcracker DB is constructed on the Task-level but you can call multiple Tasks together on the Pile-level.

STEP 2: Define Model
  • Since we've tried OpenAI API above, let's now try Hugginface Inference Endpoint. Most open-source models are accessible through this option. (See blog post: HERE)
class LLaMA:
    def __init__(self):
        self.API_URL = "https://xxxx.us-east-1.aws.endpoints.huggingface.cloud"

    def query(self, payload):
        headers = {
            "Accept" : "application/json",
            "Authorization": "Bearer hf_XXXXX",
            "Content-Type": "application/json" 
        }
        response = requests.post(self.API_URL, headers=headers, json=payload)
        return response.json()

    def respond(self, user_prompt):
        output = self.query({
            "inputs": f"<s>[INST] <<SYS>> You are a helpful assistant. You keep your answers short. <</SYS>> {user_prompt}",
        })
        return output[0]['generated_text']
STEP 3: Load Data
from nutcracker.data import Pile
import logging
logging.basicConfig(level=logging.INFO)

mmlu = Pile.load_from_db('mmlu','nutcracker-db/db')
STEP 4: Run Experiment (Retrieve Model Responses)
  • Running evaluation updates each instance's model_response attribute within the data object, which is mmlu Pile in this case.
  • You can save data object at any step of the evaluation. Let's try saving this time to prevent API requesting again in case anything happens.
from nutcracker.runs import Schema
mmlu.sample(n=1000, in_place = True)

experiment = Schema(model=LLaMA(), data=mmlu)
experiment.run()
mmlu.save_to_file('mmlu-llama.pkl')
  • You can load and check how the model responded.
loaded_mmlu = Pile.load_from_file('mmlu-llama.pkl')
for i in range (0,len(loaded_mmlu)):
    print("\n\n\n---\n")
    print("Prompt:")
    print(loaded_mmlu[i].user_prompt)
    print("\nResponses:")
    print(loaded_mmlu[i].model_response)
STEP 5: Run Evaluation
  • LLMs often don’t respond in immediately recognizable letters like A, B, C, or D.
  • Therefore, Nutcracker supports an intent-matching feature (requires OpenAI API Key) that parses model response to match discrete labels, but let’s disable that for now and proceed with our evaluation.
  • We recommend using intent-matching for almost all use cases. We will publish a detailed research later.
from nutcracker.evaluator import MCQEvaluator, generate_report
evaluation = MCQEvaluator(data=loaded_mmlu, disable_intent_matching=True)
evaluation.run()
print(generate_report(loaded_mmlu, save_path='accuracy_report.txt'))

https://github.com/brucewlee/nutcracker/assets/54278520/6deb5362-fd48-470e-9964-c794425811d9


Tutorials

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nutcracker_py-0.0.2a2.tar.gz (26.6 kB view details)

Uploaded Source

Built Distribution

nutcracker_py-0.0.2a2-py3-none-any.whl (55.1 kB view details)

Uploaded Python 3

File details

Details for the file nutcracker_py-0.0.2a2.tar.gz.

File metadata

  • Download URL: nutcracker_py-0.0.2a2.tar.gz
  • Upload date:
  • Size: 26.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for nutcracker_py-0.0.2a2.tar.gz
Algorithm Hash digest
SHA256 74faf99f0d19500288a9a646e65fcde4a7e94c0c4f7fb0965fb3b371ef7e9d1e
MD5 683e7a3f377f62eeda6ed1d6b92bd239
BLAKE2b-256 bd100f95381f9528fa9c07f7c9aa3b15ddd3d316d7adf1fa44f1a863e5cb9c21

See more details on using hashes here.

File details

Details for the file nutcracker_py-0.0.2a2-py3-none-any.whl.

File metadata

File hashes

Hashes for nutcracker_py-0.0.2a2-py3-none-any.whl
Algorithm Hash digest
SHA256 3e3cedd73d423ccc499e462cc810f904679f41d7eaa22e658b39b86d6d019b9e
MD5 3a29bd5a5ef15ee9b99a145fbbfa4e23
BLAKE2b-256 877d8e115ddfa5623bd03f10f5313990c834e52451ed1c5db48188de81e8acaf

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page