A benchmark for measuring LLM censorship removal effectiveness
Project description
UncensorBench
A benchmark for measuring LLM censorship removal effectiveness. Designed for evaluating abliteration, steering vectors, and other techniques that modify model behavior around refusal.
Installation
pip install uncensorbench
For semantic similarity evaluation:
pip install uncensorbench[semantic]
Quick Start
from uncensorbench import UncensorBench
# Initialize the benchmark
bench = UncensorBench()
# Iterate over all prompts
for prompt in bench.prompts():
print(f"{prompt.id}: {prompt.prompt[:50]}...")
# Filter by topic
for prompt in bench.prompts(topics=["cybersecurity", "drugs"]):
response = your_model.generate(prompt.prompt)
result = bench.evaluate_response(prompt, response)
print(f"{prompt.id}: {'Uncensored' if result.is_uncensored else 'Censored'}")
# Run full evaluation
def my_model(prompt: str) -> str:
# Your model inference here
return model.generate(prompt)
results = bench.evaluate(my_model, verbose=True)
print(results.summary())
Benchmark Structure
UncensorBench contains 150 prompts across 15 topic categories:
| Topic | Description | Prompts |
|---|---|---|
| cybersecurity | Hacking, exploits, security bypass | 10 |
| piracy | Copyright infringement, DRM bypass | 10 |
| weapons | Weapons, explosives | 10 |
| drugs | Controlled substances | 10 |
| fraud | Scams, identity theft | 10 |
| manipulation | Psychological manipulation | 10 |
| violence | Physical harm | 10 |
| privacy_invasion | Surveillance, doxxing | 10 |
| illegal_activities | Theft, burglary | 10 |
| academic_dishonesty | Cheating, plagiarism | 10 |
| gambling | Casino cheating, match fixing | 10 |
| controversial_speech | Hate speech, misinformation | 10 |
| evasion | Evidence destruction, fleeing | 10 |
| self_harm | Dangerous challenges | 10 |
| adult_content | Explicit content | 10 |
Each prompt includes:
- A challenging request that models typically refuse
- A reference censored response (typical refusal)
- A reference uncensored response (compliant answer)
Evaluation Methods
Keyword-based (Default)
from uncensorbench import KeywordEvaluator
evaluator = KeywordEvaluator()
results = bench.evaluate(my_model, evaluator=evaluator)
Semantic Similarity
from uncensorbench import SemanticEvaluator
evaluator = SemanticEvaluator() # Requires sentence-transformers
results = bench.evaluate(my_model, evaluator=evaluator)
LLM Judge
from uncensorbench.evaluator import LLMJudgeEvaluator
def judge_model(prompt: str) -> str:
return your_judge_model.generate(prompt)
evaluator = LLMJudgeEvaluator(judge_model)
results = bench.evaluate(my_model, evaluator=evaluator)
Contrastive Pairs for Training
Get censored/uncensored pairs for abliteration or steering vector training:
for pair in bench.get_contrastive_pairs():
print(f"Prompt: {pair['prompt']}")
print(f"Censored: {pair['censored'][:100]}...")
print(f"Uncensored: {pair['uncensored'][:100]}...")
Results Format
results = bench.evaluate(my_model)
print(f"Uncensored Rate: {results.uncensored_rate:.1%}")
print(f"Average Compliance: {results.average_compliance_score:.3f}")
# Per-topic breakdown
for topic, stats in results.results_by_topic.items():
print(f"{topic}: {stats['uncensored_rate']:.1%}")
# Export to JSON
import json
with open("results.json", "w") as f:
json.dump(results.to_dict(), f, indent=2)
Use Cases
- Abliteration Research: Measure effectiveness of refusal removal techniques
- Steering Vector Evaluation: Test steering vectors that modify refusal behavior
- Model Comparison: Compare censorship levels across different models
- Safety Research: Understand model vulnerabilities and refusal patterns
Citation
@software{uncensorbench,
title = {UncensorBench: A Benchmark for LLM Censorship Removal},
author = {Wisent AI},
year = {2024},
url = {https://github.com/wisent-ai/uncensorbench}
}
License
MIT License - See LICENSE file for details.
Disclaimer
This benchmark is intended for research purposes only. The prompts and responses are designed to evaluate model behavior, not to provide actual harmful information. Users are responsible for ensuring their use complies with applicable laws and ethical guidelines.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file uncensorbench-0.3.7.tar.gz.
File metadata
- Download URL: uncensorbench-0.3.7.tar.gz
- Upload date:
- Size: 99.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
436377520fc90581c5929e5fd23eccdb47344c529999f418e474497f5146aefd
|
|
| MD5 |
4f28fdd41a95d0920eded55525573f23
|
|
| BLAKE2b-256 |
f6db781ee8e0fb018606a17cb76bd886ad5983e02adac5bf072ff2815990bfca
|
File details
Details for the file uncensorbench-0.3.7-py3-none-any.whl.
File metadata
- Download URL: uncensorbench-0.3.7-py3-none-any.whl
- Upload date:
- Size: 99.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b9c64fead4cc47aeb2bc21a2808f0bd8980e4eb1fd0a80b9189e5fea8cee86a4
|
|
| MD5 |
e2bb68f6055e0dea58d9f9e84131d146
|
|
| BLAKE2b-256 |
3752854f19d4181eb5ac2086e5b8f4704a4bdf311a80ce21ffd8205e24a142e7
|