Skip to main content

Optimus: A semantic and harmfulness-based metric for evaluating LLM jailbreak prompts

Project description

Optimus: Semantic–Harmfulness-Based Jailbreak Scoring

Overview

This repository provides an implementation of Optimus, a continuous metric for evaluating jailbreak prompts in large language models. The metric jointly considers semantic similarity to a harmful target intent and the estimated harmfulness of the prompt content.

Unlike binary jailbreak success metrics such as Attack Success Rate (ASR), Optimus produces a real-valued score in the range [0, 1]. This enables finer-grained evaluation by penalizing trivial paraphrases, benign rewrites, and low-risk prompts, while highlighting prompts that are both semantically aligned with harmful intent and likely to induce unsafe behavior.

The core implementation is provided through the JBScoreCalculator class.


Key Features

  • Semantic similarity computation using Sentence-BERT embeddings
  • Harmfulness estimation using an NLI-style sequence classification model
  • Continuous jailbreak scoring metric (Optimus)
  • Compatible with CPU and GPU execution via PyTorch
  • Modular design enabling replacement of encoders or classifiers

Dependencies

The following libraries are required:

  • Python 3.9 or higher
  • PyTorch
  • HuggingFace Transformers
  • Sentence-Transformers
  • NumPy

Installation

pip install torch transformers sentence-transformers numpy

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

optimus_jbscorer-0.0.3.tar.gz (4.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

optimus_jbscorer-0.0.3-py3-none-any.whl (3.7 kB view details)

Uploaded Python 3

File details

Details for the file optimus_jbscorer-0.0.3.tar.gz.

File metadata

  • Download URL: optimus_jbscorer-0.0.3.tar.gz
  • Upload date:
  • Size: 4.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for optimus_jbscorer-0.0.3.tar.gz
Algorithm Hash digest
SHA256 f77e32b6a13e0f02e7ea3cc1ae5d9bbb83ce4e5ba8692d072e8327d8e1464f2a
MD5 219a09e2b6c9737c715ade1330e58d90
BLAKE2b-256 c900e136d488112be9ee988576647a373f770ed4e8bb88b5d950876148e1d29d

See more details on using hashes here.

File details

Details for the file optimus_jbscorer-0.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for optimus_jbscorer-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 b46281e9da30488e726e9b7303f3cab9693c63d23addb51263473e1f866cc229
MD5 7ef9d3196d2a0e1c31e393435afc049c
BLAKE2b-256 70a6258b5eb9f566b6eac6f0053334982ecb639c44b6fe5a81a80998843fca8c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page