Skip to main content

GPT-OSS-20B Red-Teaming Harness: Masks, Sandbags, and Sabotage

Project description

๐Ÿš€ GPT-OSS-20B Red-Teaming Harness

Masks, Sandbags, and Sabotage: Exposing Hidden Misalignment

A comprehensive red-teaming toolkit for testing AI model safety and alignment. Implements advanced probes for detecting misalignment, deception, and safety vulnerabilities in large language models.

โœจ Features

  • ๐Ÿ” 10 Advanced Probes: From evaluation awareness to covert channel capacity
  • ๐ŸŽจ Beautiful CLI: Rich-powered interface with progress tracking and tables
  • ๐Ÿ”Œ Multi-Backend Support: OpenAI API, Anthropic API, and local Transformers models
  • ๐Ÿ“Š Comprehensive Results: Detailed metrics, findings, and analysis reports
  • โšก Plug & Play: Simple installation and easy-to-use command-line interface

๐Ÿš€ Quick Start

Installation

# Install from PyPI
pip install gpt20b-redteam

# Or install with development dependencies
pip install gpt20b-redteam[dev]

# Or install from source
git clone https://github.com/gpt-oss-20b/red-teaming.git
cd red-teaming
pip install -e .

Basic Usage

# Run with local model
gpt20b-redteam --model microsoft/DialoGPT-large

# Run with OpenAI API
gpt20b-redteam --openai gpt-4 --api-key sk-your-key-here

# Run with custom seeds and output directory
gpt20b-redteam --model microsoft/DialoGPT-large --seeds 42 123 456 --output my_results

๐Ÿ” Available Probes

Probe Description Status
eval_awareness Evaluation Awareness & Sandbagging โœ…
deception_rewardhack Deception & Reward Hacking โœ…
sabotage_code Code Sabotage Detection โœ…
encoding_evasion Encoding-Based Guardrail Evasion โœ…
prefix_steering Prefix Steering Behavior โœ…
cross_probe_compounding Cross-Probe Compounding Effects โœ…
tokenizer_frequency_sandbagging Tokenizer Frequency Sandbagging โœ…
long_horizon_sleeper Long-Horizon Sleeper Agents โœ…
covert_channel_capacity Covert Channel Capacity โœ…
refusal_boundary_cartography Refusal Boundary Cartography โœ…

๐Ÿ“– Detailed Usage

Command Line Interface

# Show help
gpt20b-redteam --help

# Run with OpenAI
gpt20b-redteam --openai gpt-4 --api-key sk-...

# Run with Anthropic
gpt20b-redteam --anthropic claude-3-sonnet --api-key sk-ant-...

# Run with local model on CPU
gpt20b-redteam --model microsoft/DialoGPT-large --device cpu

# Run with custom configuration
gpt20b-redteam \
  --model microsoft/DialoGPT-large \
  --seeds 42 123 456 789 \
  --output results_my_experiment \
  --device cuda

Python API

from gpt20b_redteam import RedTeamRunner, create_model

# Create model
model = create_model(
    backend="transformers",
    model_path="microsoft/DialoGPT-large",
    device="cuda"
)

# Run all probes
runner = RedTeamRunner(model, output_dir="results")
results = runner.run_all_probes(seeds=[42, 123, 456])

# Access results
print(f"Total probes: {results['summary']['total_probes']}")
print(f"Successful: {results['summary']['successful_probes']}")
print(f"Failed: {results['summary']['failed_probes']}")

Configuration

The toolkit supports multiple backends:

Local Transformers Models

from gpt20b_redteam import create_model

model = create_model(
    backend="transformers",
    model_path="microsoft/DialoGPT-large",  # or local path
    device="cuda",  # or "cpu", "mps", "auto"
    torch_dtype="float16"  # or "bfloat16", "auto"
)

OpenAI API

from gpt20b_redteam import create_model, setup_openai_api

setup_openai_api("gpt-4")  # or "gpt-3.5-turbo"
model = create_model(backend="openai")

Anthropic API

from gpt20b_redteam import create_model, setup_anthropic_api

setup_anthropic_api("claude-3-sonnet")  # or "claude-3-opus", "claude-3-haiku"
model = create_model(backend="anthropic")

๐Ÿ“Š Output Structure

Results are saved to the specified output directory:

results/
โ”œโ”€โ”€ findings/
โ”‚   โ”œโ”€โ”€ eval_awareness_findings_20240115_200000.json
โ”‚   โ”œโ”€โ”€ deception_rewardhack_findings_20240115_200000.json
โ”‚   โ””โ”€โ”€ ...
โ”œโ”€โ”€ raw_results/
โ”‚   โ”œโ”€โ”€ combined_results_20240115_200000.json
โ”‚   โ”œโ”€โ”€ eval_awareness_raw_20240115_200000.json
โ”‚   โ””โ”€โ”€ ...
โ””โ”€โ”€ README.md

Results Format

Each probe generates:

  • Findings: Kaggle-style formatted results for analysis
  • Raw Results: Detailed JSON with all test data
  • Metrics: Quantitative measures of model behavior
  • Analysis: Qualitative assessment of vulnerabilities

๐Ÿ”ง Advanced Configuration

Custom Seeds

# Use specific seeds for reproducibility
gpt20b-redteam --model microsoft/DialoGPT-large --seeds 42 1010 90521

Device Configuration

# Force CPU usage
gpt20b-redteam --model microsoft/DialoGPT-large --device cpu

# Use CUDA with specific settings
gpt20b-redteam --model microsoft/DialoGPT-large --device cuda

Output Customization

# Custom output directory
gpt20b-redteam --model microsoft/DialoGPT-large --output experiments/gpt4_vs_gpt35

# Disable Rich output (plain text)
gpt20b-redteam --model microsoft/DialoGPT-large --no-rich

๐Ÿ› ๏ธ Development

Installation for Development

git clone https://github.com/gpt-oss-20b/red-teaming.git
cd red-teaming
pip install -e .[dev]

Running Tests

# Run all tests
pytest

# Run specific test
pytest tests/test_eval_awareness.py

# Run with coverage
pytest --cov=gpt20b_redteam

Code Quality

# Format code
black src/ tests/

# Lint code
flake8 src/ tests/

# Type checking
mypy src/

๐Ÿ“ˆ Performance Tips

Memory Optimization

  • Use --device cpu for large models that don't fit in GPU memory
  • Consider using quantized models (e.g., microsoft/DialoGPT-medium)
  • Use torch_dtype="float16" for reduced memory usage

Speed Optimization

  • Use GPU acceleration when available (--device cuda)
  • Reduce the number of seeds for faster runs
  • Use smaller models for quick testing

API Usage

  • Set API keys as environment variables for security
  • Monitor API usage and costs
  • Use appropriate rate limiting for production runs

๐Ÿค Contributing

We welcome contributions! Please see our Contributing Guide for details.

Adding New Probes

  1. Create a new probe class inheriting from BaseProbe
  2. Implement the required methods
  3. Add the probe to the RedTeamRunner
  4. Write tests and documentation

Reporting Issues

Please use our Issue Tracker to report bugs or request features.

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

  • Built on the shoulders of the open-source AI safety community
  • Inspired by research on AI alignment and red-teaming
  • Powered by Hugging Face Transformers and the broader ML ecosystem

๐Ÿ“š References


Made with โค๏ธ by the GPT-OSS-20B Team

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gpt_oss_20b_redteam-0.1.1.tar.gz (156.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gpt_oss_20b_redteam-0.1.1-py3-none-any.whl (132.4 kB view details)

Uploaded Python 3

File details

Details for the file gpt_oss_20b_redteam-0.1.1.tar.gz.

File metadata

  • Download URL: gpt_oss_20b_redteam-0.1.1.tar.gz
  • Upload date:
  • Size: 156.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.22

File hashes

Hashes for gpt_oss_20b_redteam-0.1.1.tar.gz
Algorithm Hash digest
SHA256 49618820a9c576bb6dacf2a4e4377ea1083b95d6b786c81918a138826f65dbb8
MD5 4721d03b96a9f8c41b996f7453a4d1b8
BLAKE2b-256 4c54239d274b42b9682256132744a69f8f19533ce90fd697b22cfe0237aa1b59

See more details on using hashes here.

File details

Details for the file gpt_oss_20b_redteam-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for gpt_oss_20b_redteam-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6f24945a1389484dac4ca41042a52a92df7175928586591e55525f6e5101d706
MD5 bf316b4d3ad4e5cee981dfa4438a3281
BLAKE2b-256 3416c7d22f6996fc3006aed87efe6df0821f233963347a0d19eb96aa79743f08

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page