Skip to main content

GPT-OSS-20B Red-Teaming Harness: Masks, Sandbags, and Sabotage

Project description

๐Ÿš€ GPT-OSS-20B Red-Teaming Harness

Masks, Sandbags, and Sabotage: Exposing Hidden Misalignment

A comprehensive red-teaming toolkit for testing AI model safety and alignment. Implements advanced probes for detecting misalignment, deception, and safety vulnerabilities in large language models.

โœจ Features

  • ๐Ÿ” 10 Advanced Probes: From evaluation awareness to covert channel capacity
  • ๐ŸŽจ Beautiful CLI: Rich-powered interface with progress tracking and tables
  • ๐Ÿ”Œ Multi-Backend Support: OpenAI API, Anthropic API, and local Transformers models
  • ๐Ÿ“Š Comprehensive Results: Detailed metrics, findings, and analysis reports
  • โšก Plug & Play: Simple installation and easy-to-use command-line interface

๐Ÿš€ Quick Start

Installation

# Install from PyPI
pip install gpt20b-redteam

# Or install with development dependencies
pip install gpt20b-redteam[dev]

# Or install from source
git clone https://github.com/gpt-oss-20b/red-teaming.git
cd red-teaming
pip install -e .

Basic Usage

# Run with local model
gpt20b-redteam --model microsoft/DialoGPT-large

# Run with OpenAI API
gpt20b-redteam --openai gpt-4 --api-key sk-your-key-here

# Run with custom seeds and output directory
gpt20b-redteam --model microsoft/DialoGPT-large --seeds 42 123 456 --output my_results

๐Ÿ” Available Probes

Probe Description Status
eval_awareness Evaluation Awareness & Sandbagging โœ…
deception_rewardhack Deception & Reward Hacking โœ…
sabotage_code Code Sabotage Detection โœ…
encoding_evasion Encoding-Based Guardrail Evasion โœ…
prefix_steering Prefix Steering Behavior โœ…
cross_probe_compounding Cross-Probe Compounding Effects โœ…
tokenizer_frequency_sandbagging Tokenizer Frequency Sandbagging โœ…
long_horizon_sleeper Long-Horizon Sleeper Agents โœ…
covert_channel_capacity Covert Channel Capacity โœ…
refusal_boundary_cartography Refusal Boundary Cartography โœ…

๐Ÿ“– Detailed Usage

Command Line Interface

# Show help
gpt20b-redteam --help

# Run with OpenAI
gpt20b-redteam --openai gpt-4 --api-key sk-...

# Run with Anthropic
gpt20b-redteam --anthropic claude-3-sonnet --api-key sk-ant-...

# Run with local model on CPU
gpt20b-redteam --model microsoft/DialoGPT-large --device cpu

# Run with custom configuration
gpt20b-redteam \
  --model microsoft/DialoGPT-large \
  --seeds 42 123 456 789 \
  --output results_my_experiment \
  --device cuda

Python API

from gpt20b_redteam import RedTeamRunner, create_model

# Create model
model = create_model(
    backend="transformers",
    model_path="microsoft/DialoGPT-large",
    device="cuda"
)

# Run all probes
runner = RedTeamRunner(model, output_dir="results")
results = runner.run_all_probes(seeds=[42, 123, 456])

# Access results
print(f"Total probes: {results['summary']['total_probes']}")
print(f"Successful: {results['summary']['successful_probes']}")
print(f"Failed: {results['summary']['failed_probes']}")

Configuration

The toolkit supports multiple backends:

Local Transformers Models

from gpt20b_redteam import create_model

model = create_model(
    backend="transformers",
    model_path="microsoft/DialoGPT-large",  # or local path
    device="cuda",  # or "cpu", "mps", "auto"
    torch_dtype="float16"  # or "bfloat16", "auto"
)

OpenAI API

from gpt20b_redteam import create_model, setup_openai_api

setup_openai_api("gpt-4")  # or "gpt-3.5-turbo"
model = create_model(backend="openai")

Anthropic API

from gpt20b_redteam import create_model, setup_anthropic_api

setup_anthropic_api("claude-3-sonnet")  # or "claude-3-opus", "claude-3-haiku"
model = create_model(backend="anthropic")

๐Ÿ“Š Output Structure

Results are saved to the specified output directory:

results/
โ”œโ”€โ”€ findings/
โ”‚   โ”œโ”€โ”€ eval_awareness_findings_20240115_200000.json
โ”‚   โ”œโ”€โ”€ deception_rewardhack_findings_20240115_200000.json
โ”‚   โ””โ”€โ”€ ...
โ”œโ”€โ”€ raw_results/
โ”‚   โ”œโ”€โ”€ combined_results_20240115_200000.json
โ”‚   โ”œโ”€โ”€ eval_awareness_raw_20240115_200000.json
โ”‚   โ””โ”€โ”€ ...
โ””โ”€โ”€ README.md

Results Format

Each probe generates:

  • Findings: Kaggle-style formatted results for analysis
  • Raw Results: Detailed JSON with all test data
  • Metrics: Quantitative measures of model behavior
  • Analysis: Qualitative assessment of vulnerabilities

๐Ÿ”ง Advanced Configuration

Custom Seeds

# Use specific seeds for reproducibility
gpt20b-redteam --model microsoft/DialoGPT-large --seeds 42 1010 90521

Device Configuration

# Force CPU usage
gpt20b-redteam --model microsoft/DialoGPT-large --device cpu

# Use CUDA with specific settings
gpt20b-redteam --model microsoft/DialoGPT-large --device cuda

Output Customization

# Custom output directory
gpt20b-redteam --model microsoft/DialoGPT-large --output experiments/gpt4_vs_gpt35

# Disable Rich output (plain text)
gpt20b-redteam --model microsoft/DialoGPT-large --no-rich

๐Ÿ› ๏ธ Development

Installation for Development

git clone https://github.com/gpt-oss-20b/red-teaming.git
cd red-teaming
pip install -e .[dev]

Running Tests

# Run all tests
pytest

# Run specific test
pytest tests/test_eval_awareness.py

# Run with coverage
pytest --cov=gpt20b_redteam

Code Quality

# Format code
black src/ tests/

# Lint code
flake8 src/ tests/

# Type checking
mypy src/

๐Ÿ“ˆ Performance Tips

Memory Optimization

  • Use --device cpu for large models that don't fit in GPU memory
  • Consider using quantized models (e.g., microsoft/DialoGPT-medium)
  • Use torch_dtype="float16" for reduced memory usage

Speed Optimization

  • Use GPU acceleration when available (--device cuda)
  • Reduce the number of seeds for faster runs
  • Use smaller models for quick testing

API Usage

  • Set API keys as environment variables for security
  • Monitor API usage and costs
  • Use appropriate rate limiting for production runs

๐Ÿค Contributing

We welcome contributions! Please see our Contributing Guide for details.

Adding New Probes

  1. Create a new probe class inheriting from BaseProbe
  2. Implement the required methods
  3. Add the probe to the RedTeamRunner
  4. Write tests and documentation

Reporting Issues

Please use our Issue Tracker to report bugs or request features.

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

  • Built on the shoulders of the open-source AI safety community
  • Inspired by research on AI alignment and red-teaming
  • Powered by Hugging Face Transformers and the broader ML ecosystem

๐Ÿ“š References


Made with โค๏ธ by the GPT-OSS-20B Team

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gpt_oss_20b_redteam-0.1.0.tar.gz (156.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gpt_oss_20b_redteam-0.1.0-py3-none-any.whl (132.3 kB view details)

Uploaded Python 3

File details

Details for the file gpt_oss_20b_redteam-0.1.0.tar.gz.

File metadata

  • Download URL: gpt_oss_20b_redteam-0.1.0.tar.gz
  • Upload date:
  • Size: 156.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.22

File hashes

Hashes for gpt_oss_20b_redteam-0.1.0.tar.gz
Algorithm Hash digest
SHA256 00d48c23e221740fc43afec6c00b87ae562d220cacfa4fb29417a593f98a7e0e
MD5 bab55bb505e91fdf77dd09adde6dc143
BLAKE2b-256 670744daaf802894006da37f9c8ca38c4d2dc634d36dc6b99671dc8c13c5fb62

See more details on using hashes here.

File details

Details for the file gpt_oss_20b_redteam-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for gpt_oss_20b_redteam-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 82c361e2368496a9b01229a45126338564d35f65f24a764b2f11bce16ec81c17
MD5 eb1fe3bb9b26044353fb8a5d293708df
BLAKE2b-256 dc34b60944081bd4567de1f59904f425f8ec59d6e50eae3f261365328e794ebe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page