Mood-Bench: a multi-domain out-of-distribution safety benchmark for LLMs.
Project description
MOOD Bench
A multi-domain out-of-distribution safety benchmark for LLMs.
Table of Contents
Introduction
MOOD Bench (Misalignment out-of-distribution benchmark) measures how well safety monitors generalize beyond the data they were trained on. MOOD is built to evaluate monitors that are trained on a specific restricted training split. Models are embedded as part of monitoring pipelines, and each pipeline is calibrated on a small set of in-distribution conversations (helpful, harmless, function-calling) and then evaluated on diverse out-of-distribution unsafe behaviors — jailbreaking, sycophancy, scheming, insecure code, controlling responses, missing or inappropriate function calls, etc.
Installation
Installing from PyPI
pip install mood-bench
# With vLLM support (recommended for the instruction-tuned pipeline)
pip install mood-bench[vllm]
Building from source
Clone the repo:
git clone https://github.com/Dylan102938/mood-bench.git
cd mood-bench
Using uv (recommended):
uv venv
uv sync
Or with pip:
pip install -e ".[vllm]"
Verifying installation
Run the CLI without arguments to confirm the mood entry point is on your path:
mood --help
To run the unit tests:
uv run pytest tests/test_cli.py tests/test_analysis.py
End-to-end tests that pull real adapters and require a GPU live under tests/e2e/:
uv run pytest tests/e2e/ -m gpu
Usage
Running the built-in pipelines
The mood CLI exposes one subcommand per built-in pipeline under mood bench and a separate mood analyze for post-hoc analysis.
Guard model — a sequence-classification model with a binary safe/unsafe head, optionally with a LoRA adapter. The mood-bench-released guard adapters output a safety score (higher = more safe), so pass --predict-safe to invert it for AUROC:
mood bench guard \
--model-id google/gemma-2-2b \
--adapter-id mood-bench/gemma-2-2b-guard \
--output-dir results/gemma-2-2b-guard \
--batch-size 8 \
--max-length 2048 \
--predict-safe
Perplexity — token-level negative log-likelihood under a causal LM, with optional LoRA adapter merged on top:
mood bench perplexity \
--model-id google/gemma-2-2b \
--adapter-id mood-bench/gemma-2-2b-causal-lm \
--output-dir results/perplexity \
--batch-size 8 \
--max-length 2048
Mahalanobis distance — fits a Gaussian on safe in-distribution hidden states and scores each test sample by its distance from that distribution. Stats are cached under --stats-cache-dir so subsequent runs are fast:
mood bench mahalanobis \
--model-id google/gemma-2-2b \
--adapter-id mood-bench/gemma-2-2b-guard \
--pooling cls \
--stats-cache-dir mahalanobis-stats/ \
--output-dir results/mahalanobis \
--batch-size 4 \
--max-length 2048
Instruction-tuned judge — an instruction-tuned LLM asked to score each sample. Uses vLLM if installed, falls back to transformers:
mood bench instruction-tuned \
--model-id meta-llama/Meta-Llama-3-8B-Instruct \
--grading-type alignment \
--num-few-shot 3 \
--output-dir results/instruction-tuned
Every mood bench subcommand accepts a common set of flags (--use-mini for a quick sanity-check subset, --domains to evaluate a subset of misaligned settings, --no-figures to skip plots, -v for verbose output, etc.). Run mood bench <pipeline> --help for the full list.
Each run writes a versioned directory under --output-dir containing results.jsonl (per-sample scores), analysis.json (group-level AUROC and TPR@FPR), and per-group score_hist.png / auroc.png figures.
Analyzing pre-scored results
mood analyze consumes one or more results.jsonl files (the format produced by mood bench) and re-runs the metric / figure step, optionally aggregating across multiple results datasets.
Single run:
mood analyze results/guard/results.jsonl --output-dir reports/guard
Combining multiple monitors with an aggregator (min / mean / lambda):
mood analyze \
results/guard/results.jsonl \
results/perplexity/results.jsonl \
--aggregator lambda \
--anchor-index 0 \
--output-dir reports/guard+ppl
For an ensemble of identical-architecture guard runs, take the min:
mood analyze \
results/guard-particle-0/results.jsonl \
results/guard-particle-1/results.jsonl \
results/guard-particle-2/results.jsonl \
--aggregator min \
--output-dir reports/guard-ensemble
Running your own pipelines
The CLI is a thin wrapper around the mood_bench Python API. To plug in your own monitors, implement the Pipeline protocol — any callable that maps a list of conversation strings to a (scores, metadata) tuple — and hand it to mood_bench():
from mood_bench import mood_bench, GuardModelPipeline, load_tokenizer
from transformers import AutoModelForSequenceClassification
tokenizer = load_tokenizer("mood-bench/gemma-2-2b-guard")
model = AutoModelForSequenceClassification.from_pretrained(
"google/gemma-2-2b", dtype="bfloat16"
)
results, report = mood_bench(
pipelines=GuardModelPipeline(model, tokenizer, unsafe_label_index=1),
output_dir="results/my-guard",
eval_batch_size=8,
max_length=2048,
predict_safe=True,
)
print(report["groups"]["overall"])
mood_bench(...) returns the scored Dataset and a metrics report dict. Passing output_dir=None skips disk writes and returns everything in-memory.
For multi-pipeline runs, pass a list of pipelines plus an Aggregator:
from mood_bench import MinAggregate, mood_bench
mood_bench(
pipelines=[guard_a, guard_b, guard_c],
aggregator=MinAggregate(),
output_dir="results/guard-ensemble",
)
The [examples/](examples/) directory contains complete, self-contained scripts that you can copy and adapt:
| Script | What it shows |
|---|---|
[examples/guard.py](examples/guard.py) |
Minimal single-pipeline run with a guard model + LoRA adapter. |
[examples/guard_ensemble.py](examples/guard_ensemble.py) |
Loading five guard adapters in sequence and aggregating with MinAggregate. |
[examples/mixture_guard_perplexity.py](examples/mixture_guard_perplexity.py) |
Combining a guard classifier with a perplexity scorer via LambdaAggregate, including per-pipeline predict_safe orientation. |
[examples/analysis.py](examples/analysis.py) |
Standalone re-analysis script — equivalent to mood analyze but easier to fork for custom metrics. |
To re-run analysis from Python without re-scoring:
from datasets import load_dataset
from mood_bench import mood_bench_analysis
ds = load_dataset("json", data_files="results/guard/results.jsonl", split="train")
scored_ds, analysis_report = mood_bench_analysis(results=ds, output_path="reports/guard")
Code structure overview
The package is laid out as follows:
mood_bench/
├── core.py # mood_bench() and mood_bench_analysis() entry points
├── data.py # EvalDataset enum, load_mood_dataset(), domain constants
├── aggregator.py # MinAggregate, MeanAggregate, LambdaAggregate
├── metrics.py # tpr_at_fpr, ROC + score-histogram plotters
├── tokenize.py # load_tokenizer() and chat-template rendering
├── pipeline/
│ ├── base.py # Pipeline protocol
│ ├── guard.py # GuardModelPipeline
│ ├── perplexity.py # PerplexityPipeline
│ ├── mahalanobis.py # MahalanobisPipeline + get_stats_for_model
│ └── instruction_tuned.py # InstructionTunedPipeline (vLLM/HF backends)
└── cli/
├── __init__.py # `mood` entry point
├── _common.py # shared CLI flags
├── guard.py # `mood bench guard`
├── perplexity.py # `mood bench perplexity`
├── mahalanobis.py # `mood bench mahalanobis`
├── instruction_tuned.py # `mood bench instruction-tuned`
└── analyze.py # `mood analyze`
examples/ # Standalone Python scripts demonstrating the library API
tests/
├── test_cli.py # Fast CLI tests with stub pipelines
├── test_analysis.py # Aggregator + analysis tests
└── e2e/ # GPU-only end-to-end tests against real adapters
Further issues and questions
If you run into a problem or have a question, please contact Dylan Feng at dfeng102938@berkeley.edu.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mood_bench-0.0.1.tar.gz.
File metadata
- Download URL: mood_bench-0.0.1.tar.gz
- Upload date:
- Size: 204.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b75b9a06f5d93fdc7f0d18de76797feaa2297b69325ec2ed33e72eccc77ce0b3
|
|
| MD5 |
22d5a2051ac9bbbf66debbe5b47f5908
|
|
| BLAKE2b-256 |
a78c2ea9d31c0370b6178798a8e01aa481ab506efc369c6f9b9b49353c7a3b85
|
Provenance
The following attestation bundles were made for mood_bench-0.0.1.tar.gz:
Publisher:
publish.yml on Dylan102938/mood-bench
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mood_bench-0.0.1.tar.gz -
Subject digest:
b75b9a06f5d93fdc7f0d18de76797feaa2297b69325ec2ed33e72eccc77ce0b3 - Sigstore transparency entry: 1600777166
- Sigstore integration time:
-
Permalink:
Dylan102938/mood-bench@cbe4dc99cc107d95df45cabd97a17b4da2558522 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/Dylan102938
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@cbe4dc99cc107d95df45cabd97a17b4da2558522 -
Trigger Event:
push
-
Statement type:
File details
Details for the file mood_bench-0.0.1-py3-none-any.whl.
File metadata
- Download URL: mood_bench-0.0.1-py3-none-any.whl
- Upload date:
- Size: 38.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
76fc6a05da8e78a06dd7f5eedff102cb5f9d940231d2bf5f6bc8be264a77f841
|
|
| MD5 |
ca521907e66e9cb7601a9b7606e46a6e
|
|
| BLAKE2b-256 |
a8910e7b777c404e540e85880f571015cc814dffd4f7e23cf7c9ca245f17062b
|
Provenance
The following attestation bundles were made for mood_bench-0.0.1-py3-none-any.whl:
Publisher:
publish.yml on Dylan102938/mood-bench
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mood_bench-0.0.1-py3-none-any.whl -
Subject digest:
76fc6a05da8e78a06dd7f5eedff102cb5f9d940231d2bf5f6bc8be264a77f841 - Sigstore transparency entry: 1600777386
- Sigstore integration time:
-
Permalink:
Dylan102938/mood-bench@cbe4dc99cc107d95df45cabd97a17b4da2558522 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/Dylan102938
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@cbe4dc99cc107d95df45cabd97a17b4da2558522 -
Trigger Event:
push
-
Statement type: