Skip to main content

Adversarial Training and SFT for Bot Safety Models

Project description

Blade2Blade

GitHub Workflow Status GitHub GitHub Repo stars

Evil Sharpens Good.

Developed By: The LAION-AI Safety Team

|Quick Start | Installation | More about blade2blade |

Quick Start :fire:

from blade2blade import Blade2Blade

blade = Blade2Blade("shahules786/blade2blade-t5-base")
prompt = "|prompter| I'm here to test your blade|endoftext|"
blade.predict(prompt)

Installation

  • With Pypi
pip install blade2blade
  • From source
git clone https://github.com/LAION-AI/blade2blade.git
cd blade2blade
pip install -e .

What is Blade2Blade?

Blade2Blade is a system that performs fully automated redteaming and blueteaming on any chat, or classifcation model. By using RL, we tune an adversarial user prompter to attack other models and create user prompts that promote dangerous responses from the attacked model. The attacked model is then optimized against the adversarial user prompter to preform automated blueteaming.

Below shows an example of a Blade2Blade type system that attacks GoodT5 (the blueteam model). GoodT5 is a model that is designed to predict Rules of Thumb and safety labels. EvilT5 is a model designed to predict user prompts from the Rules of Thumb and safety labels given to it. image

What is redteaming and blueteaming?

Both red teams and blue teams strive to increase security in a system, but they go about it in different ways. A red team simulates an attacker by looking for weaknesses and trying to get past a system's defences. When an incident occurs, a blue team answers and defends against attacks.

What is the final goal of Blade2Blade?

We want to make an easy to use package that will add automated redteaming and blueteaming to any existing training loop. By bringing down the requirements for redteaming and blueteaming we hope that companies and individuals will strive to include this in their systems and create safer LLMs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

blade2blade-0.0.1.tar.gz (7.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

blade2blade-0.0.1-py2.py3-none-any.whl (8.4 kB view details)

Uploaded Python 2Python 3

File details

Details for the file blade2blade-0.0.1.tar.gz.

File metadata

  • Download URL: blade2blade-0.0.1.tar.gz
  • Upload date:
  • Size: 7.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.8

File hashes

Hashes for blade2blade-0.0.1.tar.gz
Algorithm Hash digest
SHA256 0ebb215dde5896abad17e92dfcb8082ed37198b482295d2ad6c6afa405417acb
MD5 b44d22dd9f213a34535d7e9fccf4aeca
BLAKE2b-256 ec3b7e904abd03fc04708d864bb2f21d5407ad93ecc6c50879bcd219e60c74ae

See more details on using hashes here.

File details

Details for the file blade2blade-0.0.1-py2.py3-none-any.whl.

File metadata

  • Download URL: blade2blade-0.0.1-py2.py3-none-any.whl
  • Upload date:
  • Size: 8.4 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.8

File hashes

Hashes for blade2blade-0.0.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 fff03a72ced7baf46ce6164361a84eb41e316db018d9aff545d99cdf382cb9d5
MD5 0a855fc4c4213b1feb323246fc784935
BLAKE2b-256 9b5ce46b832487c3208aebfa5a5936bf0c63445fb490e6d7af663fe3b6842155

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page