Skip to main content

Mood-Bench: a multi-domain out-of-distribution safety benchmark for LLMs.

Project description

MOOD Bench

Paper Dataset Leaderboard
License: MIT Python 3.12+

A multi-domain out-of-distribution safety benchmark for LLMs.

Table of Contents

Introduction

MOOD Bench (Misalignment out-of-distribution benchmark) measures how well safety monitors generalize beyond the data they were trained on. MOOD is built to evaluate monitors that are trained on a specific restricted training split. Models are embedded as part of monitoring pipelines, and each pipeline is calibrated on a small set of in-distribution conversations (helpful, harmless, function-calling) and then evaluated on diverse out-of-distribution unsafe behaviors — jailbreaking, sycophancy, scheming, insecure code, controlling responses, missing or inappropriate function calls, etc.

Installation

Installing from PyPI

pip install mood-bench

# With vLLM support (recommended for the instruction-tuned pipeline)
pip install mood-bench[vllm]

Building from source

Clone the repo:

git clone https://github.com/Dylan102938/mood-bench.git
cd mood-bench

Using uv (recommended):

uv venv
uv sync

Or with pip:

pip install -e ".[vllm]"

Verifying installation

Run the CLI without arguments to confirm the mood entry point is on your path:

mood --help

To run the unit tests:

uv run pytest tests/test_cli.py tests/test_analysis.py

End-to-end tests that pull real adapters and require a GPU live under tests/e2e/:

uv run pytest tests/e2e/ -m gpu

Usage

Running the built-in pipelines

The mood CLI exposes one subcommand per built-in pipeline under mood bench and a separate mood analyze for post-hoc analysis.

Guard model — a sequence-classification model with a binary safe/unsafe head, optionally with a LoRA adapter. The mood-bench-released guard adapters output a safety score (higher = more safe), so pass --predict-safe to invert it for AUROC:

mood bench guard \
    --model-id google/gemma-2-2b \
    --adapter-id mood-bench/gemma-2-2b-guard \
    --output-dir results/gemma-2-2b-guard \
    --batch-size 8 \
    --max-length 2048 \
    --predict-safe

Perplexity — token-level negative log-likelihood under a causal LM, with optional LoRA adapter merged on top:

mood bench perplexity \
    --model-id google/gemma-2-2b \
    --adapter-id mood-bench/gemma-2-2b-causal-lm \
    --output-dir results/perplexity \
    --batch-size 8 \
    --max-length 2048

Mahalanobis distance — fits a Gaussian on safe in-distribution hidden states and scores each test sample by its distance from that distribution. Stats are cached under --stats-cache-dir so subsequent runs are fast:

mood bench mahalanobis \
    --model-id google/gemma-2-2b \
    --adapter-id mood-bench/gemma-2-2b-guard \
    --pooling cls \
    --stats-cache-dir mahalanobis-stats/ \
    --output-dir results/mahalanobis \
    --batch-size 4 \
    --max-length 2048

Instruction-tuned judge — an instruction-tuned LLM asked to score each sample. Uses vLLM if installed, falls back to transformers:

mood bench instruction-tuned \
    --model-id meta-llama/Meta-Llama-3-8B-Instruct \
    --grading-type alignment \
    --num-few-shot 3 \
    --output-dir results/instruction-tuned

Every mood bench subcommand accepts a common set of flags (--use-mini for a quick sanity-check subset, --domains to evaluate a subset of misaligned settings, --no-figures to skip plots, -v for verbose output, etc.). Run mood bench <pipeline> --help for the full list.

Each run writes a versioned directory under --output-dir containing results.jsonl (per-sample scores), analysis.json (group-level AUROC and TPR@FPR), and per-group score_hist.png / auroc.png figures.

Analyzing pre-scored results

mood analyze consumes one or more results.jsonl files (the format produced by mood bench) and re-runs the metric / figure step, optionally aggregating across multiple results datasets.

Single run:

mood analyze results/guard/results.jsonl --output-dir reports/guard

Combining multiple monitors with an aggregator (min / mean / lambda):

mood analyze \
    results/guard/results.jsonl \
    results/perplexity/results.jsonl \
    --aggregator lambda \
    --anchor-index 0 \
    --output-dir reports/guard+ppl

For an ensemble of identical-architecture guard runs, take the min:

mood analyze \
    results/guard-particle-0/results.jsonl \
    results/guard-particle-1/results.jsonl \
    results/guard-particle-2/results.jsonl \
    --aggregator min \
    --output-dir reports/guard-ensemble

Running your own pipelines

The CLI is a thin wrapper around the mood_bench Python API. To plug in your own monitors, implement the Pipeline protocol — any callable that maps a list of conversation strings to a (scores, metadata) tuple — and hand it to mood_bench():

from mood_bench import mood_bench, GuardModelPipeline, load_tokenizer
from transformers import AutoModelForSequenceClassification

tokenizer = load_tokenizer("mood-bench/gemma-2-2b-guard")
model = AutoModelForSequenceClassification.from_pretrained(
    "google/gemma-2-2b", dtype="bfloat16"
)

results, report = mood_bench(
    pipelines=GuardModelPipeline(model, tokenizer, unsafe_label_index=1),
    output_dir="results/my-guard",
    eval_batch_size=8,
    max_length=2048,
    predict_safe=True,
)
print(report["groups"]["overall"])

mood_bench(...) returns the scored Dataset and a metrics report dict. Passing output_dir=None skips disk writes and returns everything in-memory.

For multi-pipeline runs, pass a list of pipelines plus an Aggregator:

from mood_bench import MinAggregate, mood_bench

mood_bench(
    pipelines=[guard_a, guard_b, guard_c],
    aggregator=MinAggregate(),
    output_dir="results/guard-ensemble",
)

The [examples/](examples/) directory contains complete, self-contained scripts that you can copy and adapt:

Script What it shows
[examples/guard.py](examples/guard.py) Minimal single-pipeline run with a guard model + LoRA adapter.
[examples/guard_ensemble.py](examples/guard_ensemble.py) Loading five guard adapters in sequence and aggregating with MinAggregate.
[examples/mixture_guard_perplexity.py](examples/mixture_guard_perplexity.py) Combining a guard classifier with a perplexity scorer via LambdaAggregate, including per-pipeline predict_safe orientation.
[examples/analysis.py](examples/analysis.py) Standalone re-analysis script — equivalent to mood analyze but easier to fork for custom metrics.

To re-run analysis from Python without re-scoring:

from datasets import load_dataset
from mood_bench import mood_bench_analysis

ds = load_dataset("json", data_files="results/guard/results.jsonl", split="train")
scored_ds, analysis_report = mood_bench_analysis(results=ds, output_path="reports/guard")

Code structure overview

The package is laid out as follows:

mood_bench/
├── core.py          # mood_bench() and mood_bench_analysis() entry points
├── data.py          # EvalDataset enum, load_mood_dataset(), domain constants
├── aggregator.py    # MinAggregate, MeanAggregate, LambdaAggregate
├── metrics.py       # tpr_at_fpr, ROC + score-histogram plotters
├── tokenize.py      # load_tokenizer() and chat-template rendering
├── pipeline/
│   ├── base.py              # Pipeline protocol
│   ├── guard.py             # GuardModelPipeline
│   ├── perplexity.py        # PerplexityPipeline
│   ├── mahalanobis.py       # MahalanobisPipeline + get_stats_for_model
│   └── instruction_tuned.py # InstructionTunedPipeline (vLLM/HF backends)
└── cli/
    ├── __init__.py          # `mood` entry point
    ├── _common.py           # shared CLI flags
    ├── guard.py             # `mood bench guard`
    ├── perplexity.py        # `mood bench perplexity`
    ├── mahalanobis.py       # `mood bench mahalanobis`
    ├── instruction_tuned.py # `mood bench instruction-tuned`
    └── analyze.py           # `mood analyze`

examples/         # Standalone Python scripts demonstrating the library API
tests/
├── test_cli.py        # Fast CLI tests with stub pipelines
├── test_analysis.py   # Aggregator + analysis tests
└── e2e/               # GPU-only end-to-end tests against real adapters

Further issues and questions

If you run into a problem or have a question, please contact Dylan Feng at dfeng102938@berkeley.edu.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mood_bench-1.0.0.tar.gz (204.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mood_bench-1.0.0-py3-none-any.whl (39.0 kB view details)

Uploaded Python 3

File details

Details for the file mood_bench-1.0.0.tar.gz.

File metadata

  • Download URL: mood_bench-1.0.0.tar.gz
  • Upload date:
  • Size: 204.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mood_bench-1.0.0.tar.gz
Algorithm Hash digest
SHA256 cc2ddde1eb340877057054ceafad5beeb28a0fd99fa65e1a360797a537145431
MD5 d3a04ed58a120a39cb76d3aeea10bf69
BLAKE2b-256 c41a942125d8da166a24312403f1c55753441be973a18a963fa2ea1bd01736d0

See more details on using hashes here.

Provenance

The following attestation bundles were made for mood_bench-1.0.0.tar.gz:

Publisher: publish.yml on Dylan102938/mood-bench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mood_bench-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: mood_bench-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 39.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mood_bench-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8c2b2e59534a92c59c1b37a97e97b02fb5a7538a724cc004ac3a84872fe27f67
MD5 454fb6486c56317a1976ca83acde35af
BLAKE2b-256 479fcee00b3d84c105a6be99f03bbc8220736baa3b31c152a963320655d7a24e

See more details on using hashes here.

Provenance

The following attestation bundles were made for mood_bench-1.0.0-py3-none-any.whl:

Publisher: publish.yml on Dylan102938/mood-bench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page