GPT-OSS-20B Red-Teaming Harness: Masks, Sandbags, and Sabotage
Project description
๐ GPT-OSS-20B Red-Teaming Harness
Masks, Sandbags, and Sabotage: Exposing Hidden Misalignment
A comprehensive red-teaming toolkit for testing AI model safety and alignment. Implements advanced probes for detecting misalignment, deception, and safety vulnerabilities in large language models.
โจ Features
- ๐ 10 Advanced Probes: From evaluation awareness to covert channel capacity
- ๐จ Beautiful CLI: Rich-powered interface with progress tracking and tables
- ๐ Multi-Backend Support: OpenAI API, Anthropic API, and local Transformers models
- ๐ Comprehensive Results: Detailed metrics, findings, and analysis reports
- โก Plug & Play: Simple installation and easy-to-use command-line interface
๐ Quick Start
Installation
Option 1: Docker (Recommended)
# Pull from Docker Hub (easiest)
docker pull guynachshon/gpt-oss-20b-redteam:latest
docker run --rm --gpus all guynachshon/gpt-oss-20b-redteam:latest --help
# Or build locally
./build_docker.sh
# Or manually
docker build -t gpt-oss-20b-redteam:latest .
docker run --rm --gpus all gpt-oss-20b-redteam:latest --help
Option 2: PyPI Package
# Install from PyPI
pip install gpt-oss-20b-redteam
# Or install with development dependencies
pip install gpt-oss-20b-redteam[dev]
Option 3: From Source
# Clone and install from source
git clone https://github.com/GuyNachshon/gpt-oss-20b-probing.git
cd gpt-oss-20b-probing
pip install -e .
Basic Usage
Docker Usage
# Run with GPT-OSS-20B model (GPU)
docker run --rm --gpus all \
-v $(pwd)/results:/app/results \
guynachshon/gpt-oss-20b-redteam:latest \
--model openai/gpt-oss-20b
# Run with OpenAI API
docker run --rm \
-e OPENAI_API_KEY="sk-your-key-here" \
-v $(pwd)/results:/app/results \
guynachshon/gpt-oss-20b-redteam:latest \
--openai gpt-4
# Run with Docker Compose
docker-compose run gpt20b-redteam-gpt-oss
Direct Usage
# Run with local model
gpt20b-redteam --model microsoft/DialoGPT-large
# Run with OpenAI API
gpt20b-redteam --openai gpt-4 --api-key sk-your-key-here
# Run with custom seeds and output directory
gpt20b-redteam --model microsoft/DialoGPT-large --seeds 42 123 456 --output my_results
๐ Available Probes
| Probe | Description | Status |
|---|---|---|
| eval_awareness | Evaluation Awareness & Sandbagging | โ |
| deception_rewardhack | Deception & Reward Hacking | โ |
| sabotage_code | Code Sabotage Detection | โ |
| encoding_evasion | Encoding-Based Guardrail Evasion | โ |
| prefix_steering | Prefix Steering Behavior | โ |
| cross_probe_compounding | Cross-Probe Compounding Effects | โ |
| tokenizer_frequency_sandbagging | Tokenizer Frequency Sandbagging | โ |
| long_horizon_sleeper | Long-Horizon Sleeper Agents | โ |
| covert_channel_capacity | Covert Channel Capacity | โ |
| refusal_boundary_cartography | Refusal Boundary Cartography | โ |
๐ Detailed Usage
Command Line Interface
# Show help
gpt20b-redteam --help
# Run with OpenAI
gpt20b-redteam --openai gpt-4 --api-key sk-...
# Run with Anthropic
gpt20b-redteam --anthropic claude-3-sonnet --api-key sk-ant-...
# Run with local model on CPU
gpt20b-redteam --model microsoft/DialoGPT-large --device cpu
# Run with custom configuration
gpt20b-redteam \
--model microsoft/DialoGPT-large \
--seeds 42 123 456 789 \
--output results_my_experiment \
--device cuda
Python API
from gpt20b_redteam import RedTeamRunner, create_model
# Create model
model = create_model(
backend="transformers",
model_path="microsoft/DialoGPT-large",
device="cuda"
)
# Run all probes
runner = RedTeamRunner(model, output_dir="results")
results = runner.run_all_probes(seeds=[42, 123, 456])
# Access results
print(f"Total probes: {results['summary']['total_probes']}")
print(f"Successful: {results['summary']['successful_probes']}")
print(f"Failed: {results['summary']['failed_probes']}")
Configuration
The toolkit supports multiple backends:
Local Transformers Models
from gpt20b_redteam import create_model
model = create_model(
backend="transformers",
model_path="microsoft/DialoGPT-large", # or local path
device="cuda", # or "cpu", "mps", "auto"
torch_dtype="float16" # or "bfloat16", "auto"
)
OpenAI API
from gpt20b_redteam import create_model, setup_openai_api
setup_openai_api("gpt-4") # or "gpt-3.5-turbo"
model = create_model(backend="openai")
Anthropic API
from gpt20b_redteam import create_model, setup_anthropic_api
setup_anthropic_api("claude-3-sonnet") # or "claude-3-opus", "claude-3-haiku"
model = create_model(backend="anthropic")
๐ Output Structure
Results are saved to the specified output directory:
results/
โโโ findings/
โ โโโ eval_awareness_findings_20240115_200000.json
โ โโโ deception_rewardhack_findings_20240115_200000.json
โ โโโ ...
โโโ raw_results/
โ โโโ combined_results_20240115_200000.json
โ โโโ eval_awareness_raw_20240115_200000.json
โ โโโ ...
โโโ README.md
Results Format
Each probe generates:
- Findings: Kaggle-style formatted results for analysis
- Raw Results: Detailed JSON with all test data
- Metrics: Quantitative measures of model behavior
- Analysis: Qualitative assessment of vulnerabilities
๐ง Advanced Configuration
Custom Seeds
# Use specific seeds for reproducibility
gpt20b-redteam --model microsoft/DialoGPT-large --seeds 42 1010 90521
Device Configuration
# Force CPU usage
gpt20b-redteam --model microsoft/DialoGPT-large --device cpu
# Use CUDA with specific settings
gpt20b-redteam --model microsoft/DialoGPT-large --device cuda
Output Customization
# Custom output directory
gpt20b-redteam --model microsoft/DialoGPT-large --output experiments/gpt4_vs_gpt35
# Disable Rich output (plain text)
gpt20b-redteam --model microsoft/DialoGPT-large --no-rich
๐ณ Docker
Quick Docker Setup
# Pull from Docker Hub (recommended)
docker pull guynachshon/gpt-oss-20b-redteam:latest
# Or build locally
./build_docker.sh
# Or manually
docker build -t gpt-oss-20b-redteam:latest .
Publish to Docker Hub
# Login to Docker Hub first
docker login
# Then publish
./publish_docker.sh
# Or with custom version
VERSION=v1.0.0 ./publish_docker.sh
Docker Usage Examples
# Run with GPT-OSS-20B (requires GPU)
docker run --rm --gpus all \
-v $(pwd)/results:/app/results \
gpt-oss-20b-redteam:latest \
--model openai/gpt-oss-20b
# Run with specific GPU
docker run --rm --gpus '"device=1"' \
-v $(pwd)/results:/app/results \
gpt-oss-20b-redteam:latest \
--model openai/gpt-oss-20b
# Run with OpenAI API
docker run --rm \
-e OPENAI_API_KEY="sk-your-key-here" \
-v $(pwd)/results:/app/results \
gpt-oss-20b-redteam:latest \
--openai gpt-4
# Run with Docker Compose
docker-compose run gpt20b-redteam-gpt-oss
Docker Compose Services
The docker-compose.yml provides several pre-configured services:
gpt20b-redteam- Basic service with helpgpt20b-redteam-gpt-oss- Runs with GPT-OSS-20B modelgpt20b-redteam-openai- Runs with OpenAI APIgpt20b-redteam-anthropic- Runs with Anthropic API
For detailed Docker instructions, see DOCKER.md.
๐ ๏ธ Development
Installation for Development
git clone https://github.com/GuyNachshon/gpt-oss-20b-probing.git
cd gpt-oss-20b-probing
pip install -e .[dev]
Running Tests
# Run all tests
pytest
# Run specific test
pytest tests/test_eval_awareness.py
# Run with coverage
pytest --cov=gpt20b_redteam
Code Quality
# Format code
black src/ tests/
# Lint code
flake8 src/ tests/
# Type checking
mypy src/
๐ Performance Tips
Memory Optimization
- Use
--device cpufor large models that don't fit in GPU memory - Consider using quantized models (e.g.,
microsoft/DialoGPT-medium) - Use
torch_dtype="float16"for reduced memory usage
Speed Optimization
- Use GPU acceleration when available (
--device cuda) - Reduce the number of seeds for faster runs
- Use smaller models for quick testing
API Usage
- Set API keys as environment variables for security
- Monitor API usage and costs
- Use appropriate rate limiting for production runs
๐ค Contributing
We welcome contributions! Please see our Contributing Guide for details.
Adding New Probes
- Create a new probe class inheriting from
BaseProbe - Implement the required methods
- Add the probe to the
RedTeamRunner - Write tests and documentation
Reporting Issues
Please use our Issue Tracker to report bugs or request features.
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Acknowledgments
- Built on the shoulders of the open-source AI safety community
- Inspired by research on AI alignment and red-teaming
- Powered by Hugging Face Transformers and the broader ML ecosystem
๐ References
- Anthropic's "Sleeper Agents" Research
- Evaluation Awareness in Language Models
- Red-Teaming Language Models
Made with โค๏ธ by the GPT-OSS-20B Team
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gpt_oss_20b_redteam-0.1.5.tar.gz.
File metadata
- Download URL: gpt_oss_20b_redteam-0.1.5.tar.gz
- Upload date:
- Size: 161.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.22
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a679360caba26ed51479c93cde432251da9924812ab769ff47866d291482e1b9
|
|
| MD5 |
d406557b576ba703bf900ff2bc8b808d
|
|
| BLAKE2b-256 |
ed026a3649c1675c3e6b52836f23dcf309a1c8ab42bdc3ffb7be1026b2b0661e
|
File details
Details for the file gpt_oss_20b_redteam-0.1.5-py3-none-any.whl.
File metadata
- Download URL: gpt_oss_20b_redteam-0.1.5-py3-none-any.whl
- Upload date:
- Size: 137.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.22
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
003b2ea1936399c552d3d3a3bb7a49d7e19e1533039399608b82bbdeeee9fe17
|
|
| MD5 |
c82eea66de3ae0d7767febf02a1490f6
|
|
| BLAKE2b-256 |
8588c5835a6b29eabe2712f8b44528854074f9f89e9b042d842fe5b3af99e800
|