Skip to main content

A benchmark for measuring LLM censorship removal effectiveness

Project description

UncensorBench

A benchmark for measuring LLM censorship removal effectiveness. Designed for evaluating abliteration, steering vectors, and other techniques that modify model behavior around refusal.

Installation

pip install uncensorbench

For semantic similarity evaluation:

pip install uncensorbench[semantic]

Quick Start

from uncensorbench import UncensorBench

# Initialize the benchmark
bench = UncensorBench()

# Iterate over all prompts
for prompt in bench.prompts():
    print(f"{prompt.id}: {prompt.prompt[:50]}...")

# Filter by topic
for prompt in bench.prompts(topics=["cybersecurity", "drugs"]):
    response = your_model.generate(prompt.prompt)
    result = bench.evaluate_response(prompt, response)
    print(f"{prompt.id}: {'Uncensored' if result.is_uncensored else 'Censored'}")

# Run full evaluation
def my_model(prompt: str) -> str:
    # Your model inference here
    return model.generate(prompt)

results = bench.evaluate(my_model, verbose=True)
print(results.summary())

Benchmark Structure

UncensorBench contains 150 prompts across 15 topic categories:

Topic Description Prompts
cybersecurity Hacking, exploits, security bypass 10
piracy Copyright infringement, DRM bypass 10
weapons Weapons, explosives 10
drugs Controlled substances 10
fraud Scams, identity theft 10
manipulation Psychological manipulation 10
violence Physical harm 10
privacy_invasion Surveillance, doxxing 10
illegal_activities Theft, burglary 10
academic_dishonesty Cheating, plagiarism 10
gambling Casino cheating, match fixing 10
controversial_speech Hate speech, misinformation 10
evasion Evidence destruction, fleeing 10
self_harm Dangerous challenges 10
adult_content Explicit content 10

Each prompt includes:

  • A challenging request that models typically refuse
  • A reference censored response (typical refusal)
  • A reference uncensored response (compliant answer)

Evaluation Methods

Keyword-based (Default)

from uncensorbench import KeywordEvaluator

evaluator = KeywordEvaluator()
results = bench.evaluate(my_model, evaluator=evaluator)

Semantic Similarity

from uncensorbench import SemanticEvaluator

evaluator = SemanticEvaluator()  # Requires sentence-transformers
results = bench.evaluate(my_model, evaluator=evaluator)

LLM Judge

from uncensorbench.evaluator import LLMJudgeEvaluator

def judge_model(prompt: str) -> str:
    return your_judge_model.generate(prompt)

evaluator = LLMJudgeEvaluator(judge_model)
results = bench.evaluate(my_model, evaluator=evaluator)

Contrastive Pairs for Training

Get censored/uncensored pairs for abliteration or steering vector training:

for pair in bench.get_contrastive_pairs():
    print(f"Prompt: {pair['prompt']}")
    print(f"Censored: {pair['censored'][:100]}...")
    print(f"Uncensored: {pair['uncensored'][:100]}...")

Results Format

results = bench.evaluate(my_model)

print(f"Uncensored Rate: {results.uncensored_rate:.1%}")
print(f"Average Compliance: {results.average_compliance_score:.3f}")

# Per-topic breakdown
for topic, stats in results.results_by_topic.items():
    print(f"{topic}: {stats['uncensored_rate']:.1%}")

# Export to JSON
import json
with open("results.json", "w") as f:
    json.dump(results.to_dict(), f, indent=2)

Use Cases

  1. Abliteration Research: Measure effectiveness of refusal removal techniques
  2. Steering Vector Evaluation: Test steering vectors that modify refusal behavior
  3. Model Comparison: Compare censorship levels across different models
  4. Safety Research: Understand model vulnerabilities and refusal patterns

Citation

@software{uncensorbench,
  title = {UncensorBench: A Benchmark for LLM Censorship Removal},
  author = {Wisent AI},
  year = {2024},
  url = {https://github.com/wisent-ai/uncensorbench}
}

License

MIT License - See LICENSE file for details.

Disclaimer

This benchmark is intended for research purposes only. The prompts and responses are designed to evaluate model behavior, not to provide actual harmful information. Users are responsible for ensuring their use complies with applicable laws and ethical guidelines.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uncensorbench-0.3.7.tar.gz (99.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

uncensorbench-0.3.7-py3-none-any.whl (99.7 kB view details)

Uploaded Python 3

File details

Details for the file uncensorbench-0.3.7.tar.gz.

File metadata

  • Download URL: uncensorbench-0.3.7.tar.gz
  • Upload date:
  • Size: 99.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.14

File hashes

Hashes for uncensorbench-0.3.7.tar.gz
Algorithm Hash digest
SHA256 436377520fc90581c5929e5fd23eccdb47344c529999f418e474497f5146aefd
MD5 4f28fdd41a95d0920eded55525573f23
BLAKE2b-256 f6db781ee8e0fb018606a17cb76bd886ad5983e02adac5bf072ff2815990bfca

See more details on using hashes here.

File details

Details for the file uncensorbench-0.3.7-py3-none-any.whl.

File metadata

  • Download URL: uncensorbench-0.3.7-py3-none-any.whl
  • Upload date:
  • Size: 99.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.14

File hashes

Hashes for uncensorbench-0.3.7-py3-none-any.whl
Algorithm Hash digest
SHA256 b9c64fead4cc47aeb2bc21a2808f0bd8980e4eb1fd0a80b9189e5fea8cee86a4
MD5 e2bb68f6055e0dea58d9f9e84131d146
BLAKE2b-256 3752854f19d4181eb5ac2086e5b8f4704a4bdf311a80ce21ffd8205e24a142e7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page