Skip to main content

GPT-OSS-20B Red-Teaming Harness: Masks, Sandbags, and Sabotage

Project description

๐Ÿš€ GPT-OSS-20B Red-Teaming Harness

Masks, Sandbags, and Sabotage: Exposing Hidden Misalignment

A comprehensive red-teaming toolkit for testing AI model safety and alignment. Implements advanced probes for detecting misalignment, deception, and safety vulnerabilities in large language models.

โœจ Features

  • ๐Ÿ” 10 Advanced Probes: From evaluation awareness to covert channel capacity
  • ๐ŸŽจ Beautiful CLI: Rich-powered interface with progress tracking and tables
  • ๐Ÿ”Œ Multi-Backend Support: OpenAI API, Anthropic API, and local Transformers models
  • ๐Ÿ“Š Comprehensive Results: Detailed metrics, findings, and analysis reports
  • โšก Plug & Play: Simple installation and easy-to-use command-line interface

๐Ÿš€ Quick Start

Installation

Option 1: Docker (Recommended)

# Pull from Docker Hub (easiest)
docker pull guynachshon/gpt-oss-20b-redteam:latest
docker run --rm --gpus all guynachshon/gpt-oss-20b-redteam:latest --help

# Or build locally
./build_docker.sh

# Or manually
docker build -t gpt-oss-20b-redteam:latest .
docker run --rm --gpus all gpt-oss-20b-redteam:latest --help

Option 2: PyPI Package

# Install from PyPI
pip install gpt-oss-20b-redteam

# Or install with development dependencies
pip install gpt-oss-20b-redteam[dev]

Option 3: From Source

# Clone and install from source
git clone https://github.com/GuyNachshon/gpt-oss-20b-probing.git
cd gpt-oss-20b-probing
pip install -e .

Basic Usage

Docker Usage

# Run with GPT-OSS-20B model (GPU)
docker run --rm --gpus all \
  -v $(pwd)/results:/app/results \
  guynachshon/gpt-oss-20b-redteam:latest \
  --model openai/gpt-oss-20b

# Run with OpenAI API
docker run --rm \
  -e OPENAI_API_KEY="sk-your-key-here" \
  -v $(pwd)/results:/app/results \
  guynachshon/gpt-oss-20b-redteam:latest \
  --openai gpt-4

# Run with Docker Compose
docker-compose run gpt20b-redteam-gpt-oss

Direct Usage

# Run with local model
gpt20b-redteam --model microsoft/DialoGPT-large

# Run with OpenAI API
gpt20b-redteam --openai gpt-4 --api-key sk-your-key-here

# Run with custom seeds and output directory
gpt20b-redteam --model microsoft/DialoGPT-large --seeds 42 123 456 --output my_results

๐Ÿ” Available Probes

Probe Description Status
eval_awareness Evaluation Awareness & Sandbagging โœ…
deception_rewardhack Deception & Reward Hacking โœ…
sabotage_code Code Sabotage Detection โœ…
encoding_evasion Encoding-Based Guardrail Evasion โœ…
prefix_steering Prefix Steering Behavior โœ…
cross_probe_compounding Cross-Probe Compounding Effects โœ…
tokenizer_frequency_sandbagging Tokenizer Frequency Sandbagging โœ…
long_horizon_sleeper Long-Horizon Sleeper Agents โœ…
covert_channel_capacity Covert Channel Capacity โœ…
refusal_boundary_cartography Refusal Boundary Cartography โœ…

๐Ÿ“– Detailed Usage

Command Line Interface

# Show help
gpt20b-redteam --help

# Run with OpenAI
gpt20b-redteam --openai gpt-4 --api-key sk-...

# Run with Anthropic
gpt20b-redteam --anthropic claude-3-sonnet --api-key sk-ant-...

# Run with local model on CPU
gpt20b-redteam --model microsoft/DialoGPT-large --device cpu

# Run with custom configuration
gpt20b-redteam \
  --model microsoft/DialoGPT-large \
  --seeds 42 123 456 789 \
  --output results_my_experiment \
  --device cuda

Python API

from gpt20b_redteam import RedTeamRunner, create_model

# Create model
model = create_model(
    backend="transformers",
    model_path="microsoft/DialoGPT-large",
    device="cuda"
)

# Run all probes
runner = RedTeamRunner(model, output_dir="results")
results = runner.run_all_probes(seeds=[42, 123, 456])

# Access results
print(f"Total probes: {results['summary']['total_probes']}")
print(f"Successful: {results['summary']['successful_probes']}")
print(f"Failed: {results['summary']['failed_probes']}")

Configuration

The toolkit supports multiple backends:

Local Transformers Models

from gpt20b_redteam import create_model

model = create_model(
    backend="transformers",
    model_path="microsoft/DialoGPT-large",  # or local path
    device="cuda",  # or "cpu", "mps", "auto"
    torch_dtype="float16"  # or "bfloat16", "auto"
)

OpenAI API

from gpt20b_redteam import create_model, setup_openai_api

setup_openai_api("gpt-4")  # or "gpt-3.5-turbo"
model = create_model(backend="openai")

Anthropic API

from gpt20b_redteam import create_model, setup_anthropic_api

setup_anthropic_api("claude-3-sonnet")  # or "claude-3-opus", "claude-3-haiku"
model = create_model(backend="anthropic")

๐Ÿ“Š Output Structure

Results are saved to the specified output directory:

results/
โ”œโ”€โ”€ findings/
โ”‚   โ”œโ”€โ”€ eval_awareness_findings_20240115_200000.json
โ”‚   โ”œโ”€โ”€ deception_rewardhack_findings_20240115_200000.json
โ”‚   โ””โ”€โ”€ ...
โ”œโ”€โ”€ raw_results/
โ”‚   โ”œโ”€โ”€ combined_results_20240115_200000.json
โ”‚   โ”œโ”€โ”€ eval_awareness_raw_20240115_200000.json
โ”‚   โ””โ”€โ”€ ...
โ””โ”€โ”€ README.md

Results Format

Each probe generates:

  • Findings: Kaggle-style formatted results for analysis
  • Raw Results: Detailed JSON with all test data
  • Metrics: Quantitative measures of model behavior
  • Analysis: Qualitative assessment of vulnerabilities

๐Ÿ”ง Advanced Configuration

Custom Seeds

# Use specific seeds for reproducibility
gpt20b-redteam --model microsoft/DialoGPT-large --seeds 42 1010 90521

Device Configuration

# Force CPU usage
gpt20b-redteam --model microsoft/DialoGPT-large --device cpu

# Use CUDA with specific settings
gpt20b-redteam --model microsoft/DialoGPT-large --device cuda

Output Customization

# Custom output directory
gpt20b-redteam --model microsoft/DialoGPT-large --output experiments/gpt4_vs_gpt35

# Disable Rich output (plain text)
gpt20b-redteam --model microsoft/DialoGPT-large --no-rich

๐Ÿณ Docker

Quick Docker Setup

# Pull from Docker Hub (recommended)
docker pull guynachshon/gpt-oss-20b-redteam:latest

# Or build locally
./build_docker.sh

# Or manually
docker build -t gpt-oss-20b-redteam:latest .

Publish to Docker Hub

# Login to Docker Hub first
docker login

# Then publish
./publish_docker.sh

# Or with custom version
VERSION=v1.0.0 ./publish_docker.sh

Docker Usage Examples

# Run with GPT-OSS-20B (requires GPU)
docker run --rm --gpus all \
  -v $(pwd)/results:/app/results \
  gpt-oss-20b-redteam:latest \
  --model openai/gpt-oss-20b

# Run with specific GPU
docker run --rm --gpus '"device=1"' \
  -v $(pwd)/results:/app/results \
  gpt-oss-20b-redteam:latest \
  --model openai/gpt-oss-20b

# Run with OpenAI API
docker run --rm \
  -e OPENAI_API_KEY="sk-your-key-here" \
  -v $(pwd)/results:/app/results \
  gpt-oss-20b-redteam:latest \
  --openai gpt-4

# Run with Docker Compose
docker-compose run gpt20b-redteam-gpt-oss

Docker Compose Services

The docker-compose.yml provides several pre-configured services:

  • gpt20b-redteam - Basic service with help
  • gpt20b-redteam-gpt-oss - Runs with GPT-OSS-20B model
  • gpt20b-redteam-openai - Runs with OpenAI API
  • gpt20b-redteam-anthropic - Runs with Anthropic API

For detailed Docker instructions, see DOCKER.md.

๐Ÿ› ๏ธ Development

Installation for Development

git clone https://github.com/GuyNachshon/gpt-oss-20b-probing.git
cd gpt-oss-20b-probing
pip install -e .[dev]

Running Tests

# Run all tests
pytest

# Run specific test
pytest tests/test_eval_awareness.py

# Run with coverage
pytest --cov=gpt20b_redteam

Code Quality

# Format code
black src/ tests/

# Lint code
flake8 src/ tests/

# Type checking
mypy src/

๐Ÿ“ˆ Performance Tips

Memory Optimization

  • Use --device cpu for large models that don't fit in GPU memory
  • Consider using quantized models (e.g., microsoft/DialoGPT-medium)
  • Use torch_dtype="float16" for reduced memory usage

Speed Optimization

  • Use GPU acceleration when available (--device cuda)
  • Reduce the number of seeds for faster runs
  • Use smaller models for quick testing

API Usage

  • Set API keys as environment variables for security
  • Monitor API usage and costs
  • Use appropriate rate limiting for production runs

๐Ÿค Contributing

We welcome contributions! Please see our Contributing Guide for details.

Adding New Probes

  1. Create a new probe class inheriting from BaseProbe
  2. Implement the required methods
  3. Add the probe to the RedTeamRunner
  4. Write tests and documentation

Reporting Issues

Please use our Issue Tracker to report bugs or request features.

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

  • Built on the shoulders of the open-source AI safety community
  • Inspired by research on AI alignment and red-teaming
  • Powered by Hugging Face Transformers and the broader ML ecosystem

๐Ÿ“š References


Made with โค๏ธ by the GPT-OSS-20B Team

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gpt_oss_20b_redteam-0.1.2.tar.gz (158.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gpt_oss_20b_redteam-0.1.2-py3-none-any.whl (134.1 kB view details)

Uploaded Python 3

File details

Details for the file gpt_oss_20b_redteam-0.1.2.tar.gz.

File metadata

  • Download URL: gpt_oss_20b_redteam-0.1.2.tar.gz
  • Upload date:
  • Size: 158.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.22

File hashes

Hashes for gpt_oss_20b_redteam-0.1.2.tar.gz
Algorithm Hash digest
SHA256 b45522ffcf8747329a931b4ee5ab3c548556eed6b7e74923be19cb3259f6496f
MD5 dfd463a7a516d06ec57055093eb623ea
BLAKE2b-256 7185df1130d58812928e01720bb95418556494ce0be879e8ee7351ee5dccb720

See more details on using hashes here.

File details

Details for the file gpt_oss_20b_redteam-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for gpt_oss_20b_redteam-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f6a9bbf186072a1afb9b252fbf69f5411d029b4394ed8f7bacfc3abff0e60044
MD5 4a7ca3e4ffe1dc7c5355c668084ada7a
BLAKE2b-256 85e1ff50e58f93c92675af831835cf57f2446954c03156cbcb0f2fc4b76ac039

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page