Skip to main content

CLI-first harness for safety and guardrail evaluation

Project description

guard-eval-harness

guard-eval-harness

CLI-first harness for benchmarking guardrail, moderation, and safety classification models.

Evaluate any safety model — local HuggingFace, vLLM, OpenAI, Anthropic, or custom API — against 80+ built-in safety benchmarks with a single command.

Quickstart

pip install geh

# Run a quick eval
geh run --dataset xstest --model mock --limit 50

# Run multiple datasets
geh run --dataset xstest,toxic_chat,harmful_qa --model hf \
    --model-name meta-llama/Llama-Guard-3-8B

# Run from a YAML config
geh run --config examples/run-mock-jsonl.yaml

# Use benchmark packs
geh run --pack core --model mock

Installation

Requires Python 3.10+.

# Base install
pip install geh

# With HuggingFace model support
pip install "geh[hf]"

# With vLLM support
pip install "geh[vllm]"

# With API model support (OpenAI, Anthropic)
pip install "geh[api]"

From source (for development):

git clone https://github.com/Virtue-Research/guard-eval-harness.git
cd guard-eval-harness
pip install -e ".[dev]"

Copy .env.example to .env and fill in the API keys you need.

Documentation

The deeper docs live under docs/. The most useful starting points are:

Usage

Inline mode

The fastest way to run evals — no config files needed:

geh run --dataset <dataset> --model <adapter> [--model-name <name>] [options]
# HuggingFace model on XSTest
geh run --dataset xstest --model hf --model-name meta-llama/Llama-Guard-3-8B

# OpenAI moderation
geh run --dataset xstest,toxic_chat --model openai_moderation

# vLLM serving
geh run --dataset harmbench_behaviors --model vllm \
    --model-name meta-llama/Llama-Guard-3-8B --batch-size 32

# Limit samples for quick smoke tests
geh run --dataset xstest --model mock --limit 10

YAML config mode

For full control over model args, dataset options, execution tuning, and output:

geh run --config examples/run-mock-jsonl.yaml

See examples/ for sample configs.

Benchmark packs

Curated dataset bundles for common evaluation scenarios:

geh list packs
geh run --pack core --model mock
geh run --pack jailbreak --model hf --model-name meta-llama/Llama-Guard-3-8B

Discovery

geh list datasets    # 80+ built-in safety benchmarks
geh list backends    # Available model adapters
geh list packs       # Curated benchmark bundles
geh list metrics     # Supported metrics

Inspecting results

geh inspect --run-dir out/my-run       # View manifest, summary, artifacts
geh report --run-dir out/my-run        # Rebuild HTML report
geh compare --run-a out/run1 --run-b out/run2  # Diff two runs
geh export --run-dir out/my-run --format csv --output results.csv

Run artifacts

Each run writes a self-contained directory:

out/my-run/
  manifest.json              # Run metadata
  resolved-config.json       # Exact config snapshot
  summary.json               # Aggregated metrics
  report.html                # Static HTML report
  datasets/
    <dataset>/
      predictions.jsonl      # Per-sample predictions
      metrics.json           # Dataset-level metrics
      dataset-manifest.json  # Dataset metadata

Model adapters

Adapter Description
mock Deterministic mock for testing
hf HuggingFace Transformers (local GPU)
vllm vLLM inference server
openai_compatible OpenAI-compatible APIs
openai_moderation OpenAI Moderation endpoint
anthropic Anthropic Claude API
http Generic HTTP endpoint

Datasets

80+ built-in safety benchmarks spanning two modalities:

Text

The core modality — evaluate text-based guardrails and moderation models across a range of safety dimensions:

  • Jailbreak / adversarial: XSTest, HarmBench, JBB Behaviors, AdvBench, Do-Anything-Now, StrongREJECT, MaliciousInstruct, WildGuardMix
  • Toxicity: ToxicChat, ToxiGen, Jigsaw Toxicity, Civil Comments, RealToxicityPrompts, OR-Bench
  • Hate & harassment: HateCheck, DynaHate, ETHOS, HatExplain, Implicit Hate, Measuring Hate Speech, Social Bias Frames, ConvAbuse
  • General safety: BeaverTails 330k, Do-Not-Answer, OpenAI Moderation (via API), GuardBench, CircleGuardBench
  • Prompt injection: Dedicated prompt-injection benchmarks for testing input-filtering guardrails

Image

Evaluate multimodal safety models that process image+text inputs. The harness handles image downloading, caching, and normalization automatically:

  • Unsafe content detection: UnsafeBench (8k+ images across safety categories), HoliSafeBench (holistic image safety with fine-grained risk types)
  • Visual jailbreaks: JailbreakV (adversarial images designed to bypass vision-language model safeguards)
  • Image edit safety: Safe-vs-Unsafe Image Edits (detecting harmful image manipulation requests)
  • Cross-modal attacks: VLSBench, MSTS (text+image multimodal safety evaluation)
  • Benign baselines: ImageNet-1k safe subset (measuring false positive rates on benign images)
  • Local image data: Load from local directories or JSONL manifests with image paths/URLs

Local files

Bring your own data in any modality:

  • local_jsonl — text samples from a JSONL file
  • local_csv — text samples from a CSV file
  • local_image_jsonl — image+text samples from a JSONL manifest with image paths/URLs
  • local_image_dir — image samples from a directory of images

Run geh list datasets for the full list.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geh-0.1.0.tar.gz (1.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

geh-0.1.0-py3-none-any.whl (250.3 kB view details)

Uploaded Python 3

File details

Details for the file geh-0.1.0.tar.gz.

File metadata

  • Download URL: geh-0.1.0.tar.gz
  • Upload date:
  • Size: 1.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for geh-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e1f266816c81164a91bfccee478dcc11b2e27dc083cf59a8c3be56032f60270b
MD5 caa921c87bc2e9fe6ede940ff218b174
BLAKE2b-256 d14ab96e9ff1c9711562bdec5fdc47f7a5542a05f1f0f21500b549015a924250

See more details on using hashes here.

Provenance

The following attestation bundles were made for geh-0.1.0.tar.gz:

Publisher: release.yml on Virtue-Research/guard-eval-harness

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file geh-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: geh-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 250.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for geh-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 81a6108baef084202baf736d080319d94b42867117f00da572948bf124b02691
MD5 7e28729d7d02f3a247a9e8e2a3f0e610
BLAKE2b-256 4d498a4ab253b98ca370ec4af0d361a731c1cdfdcc16c3802166066c0889a14b

See more details on using hashes here.

Provenance

The following attestation bundles were made for geh-0.1.0-py3-none-any.whl:

Publisher: release.yml on Virtue-Research/guard-eval-harness

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page