Package to test Prompt Injection Against OpenAI's ChatGPT
Project description
ChatGPT-4 for Jailbreak Dataset Analysis
This repository contains Python code to analyze the Hugging Face Jailbreak dataset using OpenAI's ChatGPT-4 model. The code sends prompts from the dataset to ChatGPT-4 to evaluate responses and detect potential prompt injection attacks.
Anxious to check the results?
As of January 10th 2024, GPT-4 detect all jailbreaks on this dataset. Moreover, the word jailbreak is used very loosely, some jailbreak prompts are of no consequence.
Requirements
To run this code, you need the following:
- Python 3
- OpenAI Python library
- Hugging Face datasets library
- A valid OpenAI API key
Install the required libraries using pip3 and the provided requirements.txt file:
pip3 install -r requirements.txt
Setup
- Clone this repository:
git clone https://github.com/BenderScript/PromptInjectionBench.git
cd PromptInjectionBench
- Create a
.env
file in the project directory and add your OpenAI API key:
OPENAI_API_KEY=your_api_key_here
Running the Analysis
To analyze the Jailbreak dataset with ChatGPT-4, execute the following Python script:
python3 main.py
This script loads the dataset, iterates through prompts, sends them to ChatGPT-4, and detects potential injection attacks in the generated responses.
Code Structure
main.py
: The main script that loads the dataset, defines functions to interact with ChatGPT-4, and performs the analysis.
Skipping "Benign" Prompts
In the interest of time, the code skips prompts labeled as "benign." This helps focus the analysis on potentially harmful prompts where injection attacks might occur.
Tabulating Attacks vs. Detected Attacks
The script tabulates the total number of generated texts and the number of texts where potential injection attacks are detected. It counts these attacks based on the presence of specific keywords such as "I'm sorry," "apologize," or "developed by OpenAI" in the generated text.
Results
The script will print information about each prompt, including its type, the generated text from ChatGPT-4, and whether an injection attack was detected based on the presence of specific keywords.
License
This code is provided under the Apache License 2.0. Feel free to use and modify it as needed.
This analysis is provided as a reference and demonstration of using OpenAI's ChatGPT-4 model for evaluating prompt injection attacks in text datasets.
For more information about OpenAI's GPT-4 model and the Hugging Face Jailbreak dataset, please refer to the official documentation and sources:
These explanations added to the README.md should help users understand why "benign" prompts are skipped and how the code tabulates attacks vs. detected attacks.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file prompt_injection_bench-0.1.2.tar.gz
.
File metadata
- Download URL: prompt_injection_bench-0.1.2.tar.gz
- Upload date:
- Size: 6.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.11.7 Darwin/23.2.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bc6f2ba668bbd413152502474467016ed3d91263d7da21da0bd16c623cfa9fa6 |
|
MD5 | 9f29012b71df4ec9ea4bf65be98b56f7 |
|
BLAKE2b-256 | e6276914c905911b0f1b67d7919a305185200fe419eac1bbc22223365bf3da3d |
File details
Details for the file prompt_injection_bench-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: prompt_injection_bench-0.1.2-py3-none-any.whl
- Upload date:
- Size: 7.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.11.7 Darwin/23.2.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d5753edc7430a79b74747287ea93378fa74d9c2f8e18ed05f9eeb5586f8d1663 |
|
MD5 | dbf46aa04d957b54dafebbc8881774d2 |
|
BLAKE2b-256 | 6f7d55aac56b6c22d70e2d9314ef52d026ec24e882aefc53b1109422f2eb547c |