Skip to main content

Package to test Prompt Injection Against OpenAI's ChatGPT, Google's Gemini and Azure Open AI

Project description

Prompt Injection Benchmarking

The mother of all prompt injection benchmarking repositories, this is the one you have been waiting for (or soon will be)

Analysing ChatGPT-4 and Gemini Pro Jailbreak Detection (CyberPunk mode)

This repository contains Python code to analyze the Hugging Face Jailbreak dataset against OpenAI's ChatGPT-4 an Gemini Pro models. The code sends prompts from the dataset, processes and tabulates results.

What is a Prompt Jailbreak attack?

The term 'Jailbreak' is used to describe a situation where a language model is manipulated to generate text that is harmful, inappropriate, or otherwise violates community guidelines. This can include generating hate speech, misinformation, or other harmful content.

Let me tell you a story: Once I tried to Jailbreak my iPhone to side load apps and it worked until a reboot, then I had a paper weigh in the format of an iPhone

In the world of LLM the term jailbreak is used very loosely. Some attacks are just ridiculous and silly, others are serious.

Anxious to check the results?

The table below shows the total injection attack prompts according to the data set and the number of detected attacks by ChatGPT-4 and Gemini Pro.

Latest Run

Prompts GPT-4 Gemini Azure OpenAI Azure OpenAI w/ Jailbreak Risk Detection
139
Detected 131 49 134
Not Attack TBD TBD
Missed Attack TBD TBD

Previous runs

Prompts GPT-4 Gemini Azure OpenAI Azure OpenAI w/ Jailbreak Risk Detection
139
Detected 133 56
Not Attack TBD TBD
Missed Attack TBD TBD

Important Details

  • Prompts can be blocked by built-in or explicit safety filters. Therefore the code becomes nuanced and to gather proper results you need to catch certain exceptions (see code)

Next steps (soon)

  • Test with Azure Open AI w/wo Jailbreak Risk Detection
  • Tabulate missed attacks (attack not caught by the model) vs. not attack (the model explicitly did not consider an attack)
    • Code has support, need to properly test it
  • Add more models to the analysis.

Requirements

To run this code, you need the following:

  • Python 3
  • OpenAI Python library
  • Hugging Face datasets library
  • A valid OpenAI API key
  • A valid Google API key
  • A valid Azure Open AI endpoint and all the config that comes with it

If you do not have all these keys the code will skip that LLM, no need to worry.

Setup

  1. Clone this repository:
git clone https://github.com/BenderScript/PromptInjectionBench.git
cd PromptInjectionBench
  1. Create a .env

Create a .env file in the project root directory that contains your OpenAI API, Azure and Google Keys. If you do not have all these keys the code will skip that LLM, no need to worry.

OPENAI_API_KEY=your key>
#
GOOGLE_API_KEY=<your key>
# Azure
AZURE_OPENAI_API_KEY=your key>
AZURE_MODEL_NAME=gpt-4
AZURE_OPENAI_ENDPOINT=your endpoint>
AZURE_OPENAI_API_VERSION=<your api verion, normally 2023-12-01-preview>
AZURE_OPENAI_DEPLOYMENT=<your deployment name. This is the name you gave when deploying the model>
  1. Install the required libraries using pip3 and the provided requirements.txt file:
pip3 install -r requirements.txt

Running the Analysis

To analyze the Jailbreak dataset with ChatGPT-4 and Gemini Pro, execute the following Python script:

uvicorn prompt_injection_bench.server:prompt_bench_app --reload --port 9002

If Everything goes well, you should see the following page at http://127.0.0.1:9001

Landing page

This script loads the dataset, iterates through prompts, sends them to ChatGPT-4, and detects potential injection attacks in the generated responses.

Testing

See the demo below where the App checks a prompt with a malicious URL and injection.

Demo

Code Structure

  • server.py: The main script that loads the dataset, drives the test, and performs the analysis.
  • gemini* : The gemini pro code
  • openai* : The openai code

Skipping "Benign" Prompts

In the interest of time, the code skips prompts labeled as "benign." This helps focus the analysis on potentially harmful prompts where injection attacks might occur.

Tabulating Attacks vs. Detected Attacks

The script tabulates the total number of generated texts and the number of texts where potential injection attacks are detected. It counts these attacks based on the presence of specific keywords such as "I'm sorry," "apologize," or "developed by OpenAI" in the generated text.

Results

The script will print information about each prompt, including its type, the generated text from ChatGPT-4, and whether an injection attack was detected based on the presence of specific keywords.

License

This code is provided under the Apache License 2.0. If you liked this code, cite it and let me know.


For more information about OpenAI's GPT-4 model and the Hugging Face Jailbreak dataset, please refer to the official documentation and sources:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

prompt_injection_bench-0.1.8.tar.gz (13.1 MB view details)

Uploaded Source

Built Distribution

prompt_injection_bench-0.1.8-py3-none-any.whl (13.1 MB view details)

Uploaded Python 3

File details

Details for the file prompt_injection_bench-0.1.8.tar.gz.

File metadata

  • Download URL: prompt_injection_bench-0.1.8.tar.gz
  • Upload date:
  • Size: 13.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.7 Darwin/23.3.0

File hashes

Hashes for prompt_injection_bench-0.1.8.tar.gz
Algorithm Hash digest
SHA256 7d9623b181213d0ac217d05900e134b75391341c11132a030eb7d6d2e4d3e49e
MD5 7b25c6e9a1d78ed84bde9e02662b03e5
BLAKE2b-256 8b5220e4f61d0a2da103761f36c7903abaffc92fd6e00c2e2cd1e815442427fa

See more details on using hashes here.

File details

Details for the file prompt_injection_bench-0.1.8-py3-none-any.whl.

File metadata

File hashes

Hashes for prompt_injection_bench-0.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 cfd6702df8186de6c1fdd85cc49755c5a6054439ddb930f56973186b8931d6f7
MD5 bf5bf26f2dd515ea5f6ac901ad5a5168
BLAKE2b-256 f9d3d5055d9b726447434e079883c6cfa95b5b21f615c5a87905db169f11fecb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page