GPT-OSS-20B Red-Teaming Harness: Masks, Sandbags, and Sabotage

These details have not been verified by PyPI

Project links

Project description

🚀 GPT-OSS-20B Red-Teaming Harness

Masks, Sandbags, and Sabotage: Exposing Hidden Misalignment

A comprehensive red-teaming toolkit for testing AI model safety and alignment. Implements advanced probes for detecting misalignment, deception, and safety vulnerabilities in large language models.

✨ Features

🔍 10 Advanced Probes: From evaluation awareness to covert channel capacity
🎨 Beautiful CLI: Rich-powered interface with progress tracking and tables
🔌 Multi-Backend Support: OpenAI API, Anthropic API, and local Transformers models
📊 Comprehensive Results: Detailed metrics, findings, and analysis reports
⚡ Plug & Play: Simple installation and easy-to-use command-line interface

🚀 Quick Start

Installation

# Install from PyPI
pip install gpt20b-redteam

# Or install with development dependencies
pip install gpt20b-redteam[dev]

# Or install from source
git clone https://github.com/gpt-oss-20b/red-teaming.git
cd red-teaming
pip install -e .

Basic Usage

# Run with local model
gpt20b-redteam --model microsoft/DialoGPT-large

# Run with OpenAI API
gpt20b-redteam --openai gpt-4 --api-key sk-your-key-here

# Run with custom seeds and output directory
gpt20b-redteam --model microsoft/DialoGPT-large --seeds 42 123 456 --output my_results

🔍 Available Probes

Probe	Description	Status
eval_awareness	Evaluation Awareness & Sandbagging	✅
deception_rewardhack	Deception & Reward Hacking	✅
sabotage_code	Code Sabotage Detection	✅
encoding_evasion	Encoding-Based Guardrail Evasion	✅
prefix_steering	Prefix Steering Behavior	✅
cross_probe_compounding	Cross-Probe Compounding Effects	✅
tokenizer_frequency_sandbagging	Tokenizer Frequency Sandbagging	✅
long_horizon_sleeper	Long-Horizon Sleeper Agents	✅
covert_channel_capacity	Covert Channel Capacity	✅
refusal_boundary_cartography	Refusal Boundary Cartography	✅

📖 Detailed Usage

Command Line Interface

# Show help
gpt20b-redteam --help

# Run with OpenAI
gpt20b-redteam --openai gpt-4 --api-key sk-...

# Run with Anthropic
gpt20b-redteam --anthropic claude-3-sonnet --api-key sk-ant-...

# Run with local model on CPU
gpt20b-redteam --model microsoft/DialoGPT-large --device cpu

# Run with custom configuration
gpt20b-redteam \
  --model microsoft/DialoGPT-large \
  --seeds 42 123 456 789 \
  --output results_my_experiment \
  --device cuda

Python API

from gpt20b_redteam import RedTeamRunner, create_model

# Create model
model = create_model(
    backend="transformers",
    model_path="microsoft/DialoGPT-large",
    device="cuda"
)

# Run all probes
runner = RedTeamRunner(model, output_dir="results")
results = runner.run_all_probes(seeds=[42, 123, 456])

# Access results
print(f"Total probes: {results['summary']['total_probes']}")
print(f"Successful: {results['summary']['successful_probes']}")
print(f"Failed: {results['summary']['failed_probes']}")

Configuration

The toolkit supports multiple backends:

Local Transformers Models

from gpt20b_redteam import create_model

model = create_model(
    backend="transformers",
    model_path="microsoft/DialoGPT-large",  # or local path
    device="cuda",  # or "cpu", "mps", "auto"
    torch_dtype="float16"  # or "bfloat16", "auto"
)

OpenAI API

from gpt20b_redteam import create_model, setup_openai_api

setup_openai_api("gpt-4")  # or "gpt-3.5-turbo"
model = create_model(backend="openai")

Anthropic API

from gpt20b_redteam import create_model, setup_anthropic_api

setup_anthropic_api("claude-3-sonnet")  # or "claude-3-opus", "claude-3-haiku"
model = create_model(backend="anthropic")

📊 Output Structure

Results are saved to the specified output directory:

results/
├── findings/
│   ├── eval_awareness_findings_20240115_200000.json
│   ├── deception_rewardhack_findings_20240115_200000.json
│   └── ...
├── raw_results/
│   ├── combined_results_20240115_200000.json
│   ├── eval_awareness_raw_20240115_200000.json
│   └── ...
└── README.md

Results Format

Each probe generates:

Findings: Kaggle-style formatted results for analysis
Raw Results: Detailed JSON with all test data
Metrics: Quantitative measures of model behavior
Analysis: Qualitative assessment of vulnerabilities

🔧 Advanced Configuration

Custom Seeds

# Use specific seeds for reproducibility
gpt20b-redteam --model microsoft/DialoGPT-large --seeds 42 1010 90521

Device Configuration

# Force CPU usage
gpt20b-redteam --model microsoft/DialoGPT-large --device cpu

# Use CUDA with specific settings
gpt20b-redteam --model microsoft/DialoGPT-large --device cuda

Output Customization

# Custom output directory
gpt20b-redteam --model microsoft/DialoGPT-large --output experiments/gpt4_vs_gpt35

# Disable Rich output (plain text)
gpt20b-redteam --model microsoft/DialoGPT-large --no-rich

🛠️ Development

Installation for Development

git clone https://github.com/gpt-oss-20b/red-teaming.git
cd red-teaming
pip install -e .[dev]

Running Tests

# Run all tests
pytest

# Run specific test
pytest tests/test_eval_awareness.py

# Run with coverage
pytest --cov=gpt20b_redteam

Code Quality

# Format code
black src/ tests/

# Lint code
flake8 src/ tests/

# Type checking
mypy src/

📈 Performance Tips

Memory Optimization

Use --device cpu for large models that don't fit in GPU memory
Consider using quantized models (e.g., microsoft/DialoGPT-medium)
Use torch_dtype="float16" for reduced memory usage

Speed Optimization

Use GPU acceleration when available (--device cuda)
Reduce the number of seeds for faster runs
Use smaller models for quick testing

API Usage

Set API keys as environment variables for security
Monitor API usage and costs
Use appropriate rate limiting for production runs

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Adding New Probes

Create a new probe class inheriting from BaseProbe
Implement the required methods
Add the probe to the RedTeamRunner
Write tests and documentation

Reporting Issues

Please use our Issue Tracker to report bugs or request features.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built on the shoulders of the open-source AI safety community
Inspired by research on AI alignment and red-teaming
Powered by Hugging Face Transformers and the broader ML ecosystem

📚 References

Made with ❤️ by the GPT-OSS-20B Team

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.5

Aug 26, 2025

0.1.4

Aug 26, 2025

0.1.3

Aug 26, 2025

0.1.2

Aug 26, 2025

This version

0.1.1

Aug 26, 2025

0.1.0

Aug 26, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gpt_oss_20b_redteam-0.1.1.tar.gz (156.5 kB view details)

Uploaded Aug 26, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gpt_oss_20b_redteam-0.1.1-py3-none-any.whl (132.4 kB view details)

Uploaded Aug 26, 2025 Python 3

File details

Details for the file gpt_oss_20b_redteam-0.1.1.tar.gz.

File metadata

Download URL: gpt_oss_20b_redteam-0.1.1.tar.gz
Upload date: Aug 26, 2025
Size: 156.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.22

File hashes

Hashes for gpt_oss_20b_redteam-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`49618820a9c576bb6dacf2a4e4377ea1083b95d6b786c81918a138826f65dbb8`
MD5	`4721d03b96a9f8c41b996f7453a4d1b8`
BLAKE2b-256	`4c54239d274b42b9682256132744a69f8f19533ce90fd697b22cfe0237aa1b59`

See more details on using hashes here.

File details

Details for the file gpt_oss_20b_redteam-0.1.1-py3-none-any.whl.

File metadata

Download URL: gpt_oss_20b_redteam-0.1.1-py3-none-any.whl
Upload date: Aug 26, 2025
Size: 132.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.22

File hashes

Hashes for gpt_oss_20b_redteam-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6f24945a1389484dac4ca41042a52a92df7175928586591e55525f6e5101d706`
MD5	`bf316b4d3ad4e5cee981dfa4438a3281`
BLAKE2b-256	`3416c7d22f6996fc3006aed87efe6df0821f233963347a0d19eb96aa79743f08`

See more details on using hashes here.

gpt-oss-20b-redteam 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🚀 GPT-OSS-20B Red-Teaming Harness

✨ Features

🚀 Quick Start

Installation

Basic Usage

🔍 Available Probes

📖 Detailed Usage

Command Line Interface

Python API

Configuration

Local Transformers Models

OpenAI API

Anthropic API

📊 Output Structure

Results Format

🔧 Advanced Configuration

Custom Seeds

Device Configuration

Output Customization

🛠️ Development

Installation for Development

Running Tests

Code Quality

📈 Performance Tips

Memory Optimization

Speed Optimization

API Usage

🤝 Contributing

Adding New Probes

Reporting Issues

📄 License

🙏 Acknowledgments

📚 References

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes