Skip to main content

A Python Library for Language Model Jailbreak Evaluation

Project description

License PyPI GitHub stars GitHub forks

The jailbreak-evaluation is an easy-to-use Python package for language model jailbreak evaluation. The jailbreak-evaluation is designed for comprehensive and accurate evaluation of language model jailbreak attempts. Currently, jailbreak-evaluation support evaluating a language model jailbreak attempt on multiple metrics: Safeguard Violation and Relative Truthfulness.

Installation

Note on PyTorch and FastChat Installation: The jailbreak-evaluation depends on PyTorch and FastChat. To streamline the installation, jailbreak-evaluation does NOT install these libraries for you.

It is recommended to use pip for installation. Please make sure the latest version is installed, as jailbreak-evaluation is updated frequently:

python -m pip install --upgrade pip

# Please install the PyTorch: https://pytorch.org/get-started/locally/
# Please install the FastChat: https://github.com/lm-sys/FastChat?tab=readme-ov-file#install

# Install the jailbreak-evaluation
python -m pip install -U jailbreak-evaluation

# Initialize the Natural Language Toolkit (NLTK)
python -c "import nltk; nltk.download('punkt')"

Quick Start

Evaluate a Language Model Jailbreak Attempt Using jailbreak-evaluation. You should have an OpenAI API key to use the jailbreak-evaluation, which can be obtained from OpenAI Platform.

from jailbreak_evaluation import MultifacetedEvaluation

# Initialize the jailbreak evaluation
openai_api_key = "your openai_api_key"
multifaceted_evaluation = MultifacetedEvaluation(openai_api_key)

# Construct a jailbreak attempt example
intent = "intent"
response = "response"

# Evaluate the above jailbreak attempt example
result = multifaceted_evaluation.evaluate(intent, response)

print("The evaluation result for all levels:")
print(result)

print("The evaluation result for combination level, which is our best accurate result:")
print(result.combination_level)

Contributing

Please let us know if you encounter a bug or have any suggestions by filing an issue.

We welcome all contributions from bug fixes to new features and extensions.

We expect all contributions discussed in the issue tracker and going through PRs.

Cite

If you use jailbreak-evaluation in a scientific publication, we would appreciate citations to the following paper:

@article{todo,
  title={todo},
  author={todo},
  journal={todo},
  year={todo},
  publisher={todo}
}

The Team

The jailbreak-evaluation is developed and maintained by PurSec Lab.

License

The jailbreak-evaluation uses Apache License 2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jailbreak_evaluation-1.0.0.tar.gz (191.0 kB view details)

Uploaded Source

Built Distribution

jailbreak_evaluation-1.0.0-py3-none-any.whl (10.2 kB view details)

Uploaded Python 3

File details

Details for the file jailbreak_evaluation-1.0.0.tar.gz.

File metadata

  • Download URL: jailbreak_evaluation-1.0.0.tar.gz
  • Upload date:
  • Size: 191.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.7

File hashes

Hashes for jailbreak_evaluation-1.0.0.tar.gz
Algorithm Hash digest
SHA256 c189d59692625f3a0af51bf152cb1ab28060e571a83f567f210b5e017f15d9b7
MD5 479dd5bbe4cab15c55cf0a4642d9d752
BLAKE2b-256 3d1e6f75ae6b12a5e42f400f6c95f3e070cd0d8bb82561e394d4b9fe06d4ebb4

See more details on using hashes here.

File details

Details for the file jailbreak_evaluation-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for jailbreak_evaluation-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2cc7c233942d20228d53f671d65063365d9da4f67c4e0ce0f991fe78265bf0a7
MD5 cd1c277ac37162088fcb08302f2bd44f
BLAKE2b-256 bda60a41232020833b9a73faf75d3fb2aeb5430eab44d3ec738351ce00d31200

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page