Skip to main content

Bring Your Own Data! Self-Supervised Evaluation for Large Language Models

Project description

Bring Your Own Data! Self-Supervised Evaluation for Large Language Models

The official code for Bring Your Own Data! Self-Supervised Evaluation for Large Language Models. If you have any questions, feel free to email (njain17@umd.edu).

About

To complement conventional evaluation, we propose a framework for self-supervised model evaluation. In this framework, metrics are defined as invariances and sensitivities that can be checked in a self-supervised fashion using interventions based only on the model in question rather than external labels. Self-supervised evaluation pipelines are dataset-agnostic, and so they can be utilized over larger corpora of evaluation data than conventional metrics, or even directly in production systems to monitor day-to-day performance. In this work, we develop this framework, discuss desiderata for such metrics, and provide a number of case studies for self-supervised metrics: knownledge capability, toxicity detection, long-range (context), word-order, and tokenization sensitivities. By developing these new metrics, we hope to provide a more comprehensive and nuanced understanding of the strengths and limitations of LLMs.

Installation

You can run pip install byod to directly install our package. Or, install directly from source via pip install git+https://github.com/neelsjain/BYOD/.

Dependencies

  • transformers==4.28.1
  • scipy==1.10.1
  • torch==2.0.0
  • datasets==2.11.0
  • nltk==3.8.1
  • apache_beam==2.48.0

Python 3.8 or higher is recommended

Usage

See run_model.sh for examples on how to evaluate a model. We provide scripts to run all huggingface models against metrics computed on wikipedia data, as an example. These are named run_[metric].py.

Note that only models are huggingface are currently supported.

You can also use the metrics directly, given your own model, tokenizer, and dataset, like so

import BYOD

long_range_sensitivity = BYOD.lrs_metric(model, data, tokenizer)
negation_knowledge = BYOD.negation_metric(model, data, tokenizer)
tokenization_robustness = BYOD.tokenization_metric(model, data, tokenizer)
toxicity_proxy = BYOD.toxicity_metric(model, data, tokenizer)
word_order_sensitivity = BYOD.word_order_metric(model, data, tokenizer)

Suggestions and Pull Requests are welcome!

Everything can be better! If you have suggestions on improving the codebase or the invariance/sensitivity test. Feel free to reach out!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

BYOD-0.3.0.tar.gz (264.1 kB view details)

Uploaded Source

Built Distribution

BYOD-0.3.0-py3-none-any.whl (22.1 kB view details)

Uploaded Python 3

File details

Details for the file BYOD-0.3.0.tar.gz.

File metadata

  • Download URL: BYOD-0.3.0.tar.gz
  • Upload date:
  • Size: 264.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for BYOD-0.3.0.tar.gz
Algorithm Hash digest
SHA256 d1eb6f2970e3cc5042e1a0edfdb596478112f843a62bc320ae297bac12a2a076
MD5 ee30b3b347da95febe49cd4bc075459a
BLAKE2b-256 148e173e5c1e7ffe71950be6a626f4a2e489dcb1ae5859c66b6a62a75c5e6b9b

See more details on using hashes here.

File details

Details for the file BYOD-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: BYOD-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 22.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for BYOD-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a6b088133439c584633971594bb83211f0a90743546009f6c61bc986d6f51a6e
MD5 1b8f755fbd0dad333a631effbae9bb99
BLAKE2b-256 fad9b67797f45b074abdd712c856876369429403f0dbff07d972c6aa133a01ad

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page