Skip to main content

Package to test Prompt Injection Against OpenAI's ChatGPT

Project description

ChatGPT-4 and Gemini Pro for Jailbreak Dataset Analysis (CyberPunk mode)

This repository contains Python code to analyze the Hugging Face Jailbreak dataset using OpenAI's ChatGPT-4 model. The code sends prompts from the dataset to ChatGPT-4 to evaluate responses and detect potential prompt injection attacks.

Anxious to check the results?

As of January 10th 2024, GPT-4 detect all jailbreaks on this dataset. Moreover, the word jailbreak is used very loosely, some jailbreak prompts are of no consequence.

Requirements

To run this code, you need the following:

  • Python 3
  • OpenAI Python library
  • Hugging Face datasets library
  • A valid OpenAI API key
  • A valid Google API key

Install the required libraries using pip3 and the provided requirements.txt file:

pip3 install -r requirements.txt

Setup

  1. Clone this repository:
git clone https://github.com/BenderScript/PromptInjectionBench.git
cd PromptInjectionBench
  1. Create a .env file in the project directory and add your OpenAI API and Google Keys:
OPENAI_API_KEY=your_api_key_here
GOOGLE_API_KEY=your_google_api_key_here

## Running the Analysis

To analyze the Jailbreak dataset with ChatGPT-4 and Gemini Pro, execute the following Python script:

```bash
uvicorn prompt_injection_bench.server:prompt_bench_app --reload --port 9002

If Everything goes well, you should see the following page at http://127.0.0.1:9001

Landing page

This script loads the dataset, iterates through prompts, sends them to ChatGPT-4, and detects potential injection attacks in the generated responses.

Testing

See the demo below where the App checks a prompt with a malicious URL and injection.

Demo

Code Structure

  • main.py: The main script that loads the dataset, defines functions to interact with ChatGPT-4, and performs the analysis.

Skipping "Benign" Prompts

In the interest of time, the code skips prompts labeled as "benign." This helps focus the analysis on potentially harmful prompts where injection attacks might occur.

Tabulating Attacks vs. Detected Attacks

The script tabulates the total number of generated texts and the number of texts where potential injection attacks are detected. It counts these attacks based on the presence of specific keywords such as "I'm sorry," "apologize," or "developed by OpenAI" in the generated text.

Results

The script will print information about each prompt, including its type, the generated text from ChatGPT-4, and whether an injection attack was detected based on the presence of specific keywords.

License

This code is provided under the Apache License 2.0. Feel free to use and modify it as needed.


This analysis is provided as a reference and demonstration of using OpenAI's ChatGPT-4 model for evaluating prompt injection attacks in text datasets.

For more information about OpenAI's GPT-4 model and the Hugging Face Jailbreak dataset, please refer to the official documentation and sources:


These explanations added to the README.md should help users understand why "benign" prompts are skipped and how the code tabulates attacks vs. detected attacks.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

prompt_injection_bench-0.1.6.tar.gz (13.1 MB view details)

Uploaded Source

Built Distribution

prompt_injection_bench-0.1.6-py3-none-any.whl (13.1 MB view details)

Uploaded Python 3

File details

Details for the file prompt_injection_bench-0.1.6.tar.gz.

File metadata

  • Download URL: prompt_injection_bench-0.1.6.tar.gz
  • Upload date:
  • Size: 13.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.7 Darwin/23.2.0

File hashes

Hashes for prompt_injection_bench-0.1.6.tar.gz
Algorithm Hash digest
SHA256 e6ef1248130b04fba53bdcb5aaa5f06729358852d60c2a5856a74dafa89b4dd0
MD5 9aa548979dd4079735415a8897db4873
BLAKE2b-256 0410b85e04f0413cbfe7f8bc5c61dccea6141788bc1b931072151f1b1b9c276d

See more details on using hashes here.

File details

Details for the file prompt_injection_bench-0.1.6-py3-none-any.whl.

File metadata

File hashes

Hashes for prompt_injection_bench-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 34196bb163a740eb2e1639235925d90134027de47e146e7bb264880851ab2be3
MD5 69ae59b9141b3777553b6aa13662b8ae
BLAKE2b-256 fa94f2bb3267213faac9451782c560f4a98bb067af9175486adf0ce606def6f8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page