Mood-Bench: a multi-domain out-of-distribution safety benchmark for LLMs.

These details have not been verified by PyPI

Project description

MOOD Bench

A multi-domain out-of-distribution safety benchmark for LLMs.

Introduction
Installation
Usage
Code structure overview
Further issues and questions

Introduction

MOOD Bench (Misalignment out-of-distribution benchmark) measures how well safety monitors generalize beyond the data they were trained on. MOOD is built to evaluate monitors that are trained on a specific restricted training split. Models are embedded as part of monitoring pipelines, and each pipeline is calibrated on a small set of in-distribution conversations (helpful, harmless, function-calling) and then evaluated on diverse out-of-distribution unsafe behaviors — jailbreaking, sycophancy, scheming, insecure code, controlling responses, missing or inappropriate function calls, etc.

Installation

Installing from PyPI

pip install mood-bench

# With vLLM support (recommended for the instruction-tuned pipeline)
pip install mood-bench[vllm]

Building from source

Clone the repo:

git clone https://github.com/Dylan102938/mood-bench.git
cd mood-bench

Using uv (recommended):

uv venv
uv sync

Or with pip:

pip install -e ".[vllm]"

Verifying installation

Run the CLI without arguments to confirm the mood entry point is on your path:

mood --help

To run the unit tests:

uv run pytest tests/test_cli.py tests/test_analysis.py

End-to-end tests that pull real adapters and require a GPU live under tests/e2e/:

uv run pytest tests/e2e/ -m gpu

Usage

Running the built-in pipelines

The mood CLI exposes one subcommand per built-in pipeline under mood bench and a separate mood analyze for post-hoc analysis.

Guard model — a sequence-classification model with a binary safe/unsafe head, optionally with a LoRA adapter. The mood-bench-released guard adapters output a safety score (higher = more safe), so pass --predict-safe to invert it for AUROC:

mood bench guard \
    --model-id google/gemma-2-2b \
    --adapter-id mood-bench/gemma-2-2b-guard \
    --output-dir results/gemma-2-2b-guard \
    --batch-size 8 \
    --max-length 2048 \
    --predict-safe

Perplexity — token-level negative log-likelihood under a causal LM, with optional LoRA adapter merged on top:

mood bench perplexity \
    --model-id google/gemma-2-2b \
    --adapter-id mood-bench/gemma-2-2b-causal-lm \
    --output-dir results/perplexity \
    --batch-size 8 \
    --max-length 2048

Mahalanobis distance — fits a Gaussian on safe in-distribution hidden states and scores each test sample by its distance from that distribution. Stats are cached under --stats-cache-dir so subsequent runs are fast:

mood bench mahalanobis \
    --model-id google/gemma-2-2b \
    --adapter-id mood-bench/gemma-2-2b-guard \
    --pooling cls \
    --stats-cache-dir mahalanobis-stats/ \
    --output-dir results/mahalanobis \
    --batch-size 4 \
    --max-length 2048

Instruction-tuned judge — an instruction-tuned LLM asked to score each sample. Uses vLLM if installed, falls back to transformers:

mood bench instruction-tuned \
    --model-id meta-llama/Meta-Llama-3-8B-Instruct \
    --grading-type alignment \
    --num-few-shot 3 \
    --output-dir results/instruction-tuned

Every mood bench subcommand accepts a common set of flags (--use-mini for a quick sanity-check subset, --domains to evaluate a subset of misaligned settings, --no-figures to skip plots, -v for verbose output, etc.). Run mood bench <pipeline> --help for the full list.

Each run writes a versioned directory under --output-dir containing results.jsonl (per-sample scores), analysis.json (group-level AUROC and TPR@FPR), and per-group score_hist.png / auroc.png figures.

Analyzing pre-scored results

mood analyze consumes one or more results.jsonl files (the format produced by mood bench) and re-runs the metric / figure step, optionally aggregating across multiple results datasets.

Single run:

mood analyze results/guard/results.jsonl --output-dir reports/guard

Combining multiple monitors with an aggregator (min / mean / lambda):

mood analyze \
    results/guard/results.jsonl \
    results/perplexity/results.jsonl \
    --aggregator lambda \
    --anchor-index 0 \
    --output-dir reports/guard+ppl

For an ensemble of identical-architecture guard runs, take the min:

mood analyze \
    results/guard-particle-0/results.jsonl \
    results/guard-particle-1/results.jsonl \
    results/guard-particle-2/results.jsonl \
    --aggregator min \
    --output-dir reports/guard-ensemble

Running your own pipelines

The CLI is a thin wrapper around the mood_bench Python API. To plug in your own monitors, implement the Pipeline protocol — any callable that maps a list of conversation strings to a (scores, metadata) tuple — and hand it to mood_bench():

from mood_bench import mood_bench, GuardModelPipeline, load_tokenizer
from transformers import AutoModelForSequenceClassification

tokenizer = load_tokenizer("mood-bench/gemma-2-2b-guard")
model = AutoModelForSequenceClassification.from_pretrained(
    "google/gemma-2-2b", dtype="bfloat16"
)

results, report = mood_bench(
    pipelines=GuardModelPipeline(model, tokenizer, unsafe_label_index=1),
    output_dir="results/my-guard",
    eval_batch_size=8,
    max_length=2048,
    predict_safe=True,
)
print(report["groups"]["overall"])

mood_bench(...) returns the scored Dataset and a metrics report dict. Passing output_dir=None skips disk writes and returns everything in-memory.

For multi-pipeline runs, pass a list of pipelines plus an Aggregator:

from mood_bench import MinAggregate, mood_bench

mood_bench(
    pipelines=[guard_a, guard_b, guard_c],
    aggregator=MinAggregate(),
    output_dir="results/guard-ensemble",
)

The [examples/](examples/) directory contains complete, self-contained scripts that you can copy and adapt:

Script	What it shows
`[examples/guard.py](examples/guard.py)`	Minimal single-pipeline run with a guard model + LoRA adapter.
`[examples/guard_ensemble.py](examples/guard_ensemble.py)`	Loading five guard adapters in sequence and aggregating with `MinAggregate`.
`[examples/mixture_guard_perplexity.py](examples/mixture_guard_perplexity.py)`	Combining a guard classifier with a perplexity scorer via `LambdaAggregate`, including per-pipeline `predict_safe` orientation.
`[examples/analysis.py](examples/analysis.py)`	Standalone re-analysis script — equivalent to `mood analyze` but easier to fork for custom metrics.

To re-run analysis from Python without re-scoring:

from datasets import load_dataset
from mood_bench import mood_bench_analysis

ds = load_dataset("json", data_files="results/guard/results.jsonl", split="train")
scored_ds, analysis_report = mood_bench_analysis(results=ds, output_path="reports/guard")

Code structure overview

The package is laid out as follows:

mood_bench/
├── core.py          # mood_bench() and mood_bench_analysis() entry points
├── data.py          # EvalDataset enum, load_mood_dataset(), domain constants
├── aggregator.py    # MinAggregate, MeanAggregate, LambdaAggregate
├── metrics.py       # tpr_at_fpr, ROC + score-histogram plotters
├── tokenize.py      # load_tokenizer() and chat-template rendering
├── pipeline/
│   ├── base.py              # Pipeline protocol
│   ├── guard.py             # GuardModelPipeline
│   ├── perplexity.py        # PerplexityPipeline
│   ├── mahalanobis.py       # MahalanobisPipeline + get_stats_for_model
│   └── instruction_tuned.py # InstructionTunedPipeline (vLLM/HF backends)
└── cli/
    ├── __init__.py          # `mood` entry point
    ├── _common.py           # shared CLI flags
    ├── guard.py             # `mood bench guard`
    ├── perplexity.py        # `mood bench perplexity`
    ├── mahalanobis.py       # `mood bench mahalanobis`
    ├── instruction_tuned.py # `mood bench instruction-tuned`
    └── analyze.py           # `mood analyze`

examples/         # Standalone Python scripts demonstrating the library API
tests/
├── test_cli.py        # Fast CLI tests with stub pipelines
├── test_analysis.py   # Aggregator + analysis tests
└── e2e/               # GPU-only end-to-end tests against real adapters

Further issues and questions

If you run into a problem or have a question, please contact Dylan Feng at dfeng102938@berkeley.edu.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.0.0

May 22, 2026

0.0.2

May 22, 2026

0.0.1

May 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mood_bench-1.0.0.tar.gz (204.6 kB view details)

Uploaded May 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mood_bench-1.0.0-py3-none-any.whl (39.0 kB view details)

Uploaded May 22, 2026 Python 3

File details

Details for the file mood_bench-1.0.0.tar.gz.

File metadata

Download URL: mood_bench-1.0.0.tar.gz
Upload date: May 22, 2026
Size: 204.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mood_bench-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`cc2ddde1eb340877057054ceafad5beeb28a0fd99fa65e1a360797a537145431`
MD5	`d3a04ed58a120a39cb76d3aeea10bf69`
BLAKE2b-256	`c41a942125d8da166a24312403f1c55753441be973a18a963fa2ea1bd01736d0`

See more details on using hashes here.

Provenance

The following attestation bundles were made for mood_bench-1.0.0.tar.gz:

Publisher: publish.yml on Dylan102938/mood-bench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: mood_bench-1.0.0.tar.gz
- Subject digest: cc2ddde1eb340877057054ceafad5beeb28a0fd99fa65e1a360797a537145431
- Sigstore transparency entry: 1601005612
- Sigstore integration time: May 22, 2026
Source repository:
- Permalink: Dylan102938/mood-bench@190c1e9a17b19f6f47f17e98d13ba1cf75fa550f
- Branch / Tag: refs/heads/main
- Owner: https://github.com/Dylan102938
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@190c1e9a17b19f6f47f17e98d13ba1cf75fa550f
- Trigger Event: push

File details

Details for the file mood_bench-1.0.0-py3-none-any.whl.

File metadata

Download URL: mood_bench-1.0.0-py3-none-any.whl
Upload date: May 22, 2026
Size: 39.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mood_bench-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8c2b2e59534a92c59c1b37a97e97b02fb5a7538a724cc004ac3a84872fe27f67`
MD5	`454fb6486c56317a1976ca83acde35af`
BLAKE2b-256	`479fcee00b3d84c105a6be99f03bbc8220736baa3b31c152a963320655d7a24e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for mood_bench-1.0.0-py3-none-any.whl:

Publisher: publish.yml on Dylan102938/mood-bench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: mood_bench-1.0.0-py3-none-any.whl
- Subject digest: 8c2b2e59534a92c59c1b37a97e97b02fb5a7538a724cc004ac3a84872fe27f67
- Sigstore transparency entry: 1601005743
- Sigstore integration time: May 22, 2026
Source repository:
- Permalink: Dylan102938/mood-bench@190c1e9a17b19f6f47f17e98d13ba1cf75fa550f
- Branch / Tag: refs/heads/main
- Owner: https://github.com/Dylan102938
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@190c1e9a17b19f6f47f17e98d13ba1cf75fa550f
- Trigger Event: push

mood-bench 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

MOOD Bench

Table of Contents

Introduction

Installation

Installing from PyPI

Building from source

Verifying installation

Usage

Running the built-in pipelines

Analyzing pre-scored results

Running your own pipelines

Code structure overview

Further issues and questions

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance