Package to conduct model equality testing for black-box language model APIs

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Model Equality Testing: Which Model Is This API Serving?

Users often interact with large language models through black-box inference APIs, both for closed- and open-weight models (e.g., Llama models are popularly accessed via Amazon Bedrock and Azure AI Studios). In order to cut costs or add functionality, API providers may quantize, watermark, or finetune the underlying model, changing the output distribution — often without notifying users. How can we detect if an API has changed for our particular task using only sample access?

We formalize this problem as Model Equality Testing, a two-sample testing problem where the user collects samples from the API and a reference distribution, and conducts a statistical test to see if the two distributions are the same. Unlike current approaches that simply compare numbers on standard benchmarks, this approach is specific to a user’s distribution of task prompts, is applicable to tasks without automated evaluation metrics, and can be more powerful at distinguishing distributions.

To enable users to test APIs on their own tasks, we open-source a Python package here. Additionally, to encourage future research into this problem, we also release a dataset of 1 million LLM completions that can be used to learn / evaluate more powerful tests.

Installation

To run Model Equality Testing on your own samples, we recommend using pip to install the package:

pip install model-equality-testing

The package provides functions to run the tests discussed in the paper on your samples. This includes functions to compute test statistics and simulate p-values.

import numpy as np

########## Example data ###############
sampled_prompts_1 = np.array([0, 1, 0]) # integers representing which prompt was selected
corresponding_completions_1 = [
    "...a time to be born and a time to die",
    "'Laughter,' I said, 'is madness.'",
    "...a time to weep and a time to laugh",
] # corresponding completions
sampled_prompts_2 = np.array([0, 0, 1]) # integers representing which prompt was selected
corresponding_completions_2 = [
    "...a time to mourn and a time to dance",
    "...a time to embrace and a time to refrain from embracing",
    "I said to myself, 'Come now, I will test you'",
] # corresponding completions

######### Testing code ################
# Tokenize the string completions as unicode codepoints
# and pad both completion arrays to a shared maximum length of 200 chars
from model_equality_testing.utils import tokenize_unicode
corresponding_completions_1 = tokenize_unicode(corresponding_completions_1)
corresponding_completions_1 = pad_to_length(corresponding_completions_1, L=200) 
corresponding_completions_2 = tokenize_unicode(corresponding_completions_2)
corresponding_completions_2 = pad_to_length(corresponding_completions_2, L=200) 


# Wrap these as CompletionSample objects
# m is the total number of prompts supported by the distribution
from model_equality_testing.distribution import CompletionSample

sample1 = CompletionSample(prompts=sampled_prompts_1, completions=corresponding_completions_1, m=2)
sample2 = CompletionSample(prompts=sampled_prompts_2, completions=corresponding_completions_2, m=2)

from model_equality_testing.algorithm import run_two_sample_test

# Run the two-sample test
pvalue, test_statistic = run_two_sample_test(
    sample1,
    sample2,
    pvalue_type="permutation_pvalue", # use the permutation procedure to compute the p-value
    stat_type="mmd_hamming", # use the MMD with Hamming kernel as the test statistic
    b=100, # number of permutations
)
print(f"p-value: {pvalue}, test statistic: {test_statistic}")
print("Should we reject P = Q?", pvalue < 0.05)

Dataset

To enable future research on better tests for Model Equality Testing, we release a dataset of LLM completions, including samples used in the paper experiments. At a high level, this dataset includes 1.6M completion samples collected across 5 language models, each served by various sources (e.g. in fp32 and int8 precisions, as well as by various inference API providers, e.g. amazon and azure). These completions are collected for a fixed set of 540 prompts. For 100 of these prompts (the "dev set"), we additionally collect logprobs for each completion under the fp32 model.

The data (and a spreadsheet documenting its contents) are hosted as a zip file and can be found via the project homepage. For convenience, we provide a function in the model-equality-testing package to automatically download and unzip the dataset.

# make sure to first install gdown 
# ! pip install gdown
from model_equality_testing.dataset import download_dataset
download_dataset(root_dir="./data") # will download to ./data

Once downloaded, you can load the dataset using the function load_distribution, which returns a DistributionFromDataset object.

# load a distribution object representing the joint distribution
# where prompts come from Wikipedia (Ru) with prompt ids 0, 3, 10
# and Wikipedia (De) with prompt id 5
# and completions come from meta-llama/Meta-Llama-3-8B-Instruct
from model_equality_testing.dataset import load_distribution
p = load_distribution(
    model="meta-llama/Meta-Llama-3-8B-Instruct", # model
    prompt_ids={"wikipedia_ru": [0, 3, 10], "wikipedia_de": [5]}, # prompts
    L=1000, # number of characters to pad / truncate to
    source="fp32", # or replace with 'nf4', 'int8', 'amazon', etc.
    load_in_unicode=True, # instead of tokens
    root_dir="./data",
)

Project details

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.0.2

Oct 24, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

model_equality_testing-0.0.2.tar.gz (18.7 kB view details)

Uploaded Oct 24, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

model_equality_testing-0.0.2-py3-none-any.whl (21.5 kB view details)

Uploaded Oct 24, 2024 Python 3

File details

Details for the file model_equality_testing-0.0.2.tar.gz.

File metadata

Download URL: model_equality_testing-0.0.2.tar.gz
Upload date: Oct 24, 2024
Size: 18.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.2

File hashes

Hashes for model_equality_testing-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`e848ad82feccaf464cedb23c63cb02291df93b14e223309ff2a780e60ad711f2`
MD5	`4754be69694633b3edd09c88de4ce3c4`
BLAKE2b-256	`5d5cc708cd36b8212b083223994eaee27997119fdf840d175a03b2dd0422225b`

See more details on using hashes here.

File details

Details for the file model_equality_testing-0.0.2-py3-none-any.whl.

File metadata

Download URL: model_equality_testing-0.0.2-py3-none-any.whl
Upload date: Oct 24, 2024
Size: 21.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.2

File hashes

Hashes for model_equality_testing-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9a09bd513d5b92b544b6cbdca67765797e1708306a05d2683285b28f6ff12010`
MD5	`17e3b366e42210ee3d097eb0361f5b4e`
BLAKE2b-256	`7156cb4bf648863f5078baf1c602278e8cfa8990c609f8166e53f8ca0adb96cd`

See more details on using hashes here.

model-equality-testing 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Model Equality Testing: Which Model Is This API Serving?

Installation

Dataset

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes