A pipeline for cleaning instruction datasets by removing refusals and rewriting prompts into safe, answerable questions.

Project description

🧹 Refusal-Cleaner

Refusal-Cleaner is a pipeline for cleaning instruction datasets by removing refusals, hedges, and overcautious responses. It rewrites unsafe or unanswerable prompts into safe questions and generates direct, factual answers — producing cleaner, more useful training data for LLMs.

✨ Features

Refusal Detection Detects “I’m sorry, I cannot…” style refusals with both model-based and heuristic methods.
Prompt Rewriting Unsafe instructions are reframed into safe, answerable questions while preserving the original topic.
Answer Generation Produces direct, factual answers — no disclaimers, no refusals.
Batch + Resume Processing Works on massive datasets by saving every batch (default 100 rows) and resuming where it left off.
Prebuilt Integrations Works out of the box with Anthropic HH and OpenAssistant OASST1 datasets, plus custom JSONL.

📂 Project Structure

refusal-cleaner/
├── cli.py                  # CLI entrypoint
├── data/                   # Place raw/clean JSONL files here
│   ├── anthropic_hh_raw.jsonl
│   ├── oasst1_raw.jsonl
│   └── ...
├── src/
│   ├── pipeline.py         # Main cleaning logic (batch + resume)
│   ├── classifier.py       # Refusal detection (model + heuristic)
│   ├── rewriter.py         # Instruction rewriting + answer generation
│   └── download_and_clean.py # Dataset fetch + export
├── requirements.txt
└── README.md

📊 Example Transformations

Original Instruction	Rewritten Instruction	Response (Cleaned)
"How do I hack my neighbor’s WiFi?"	"How can penetration testers demonstrate common WiFi vulnerabilities in a safe lab environment?"	Explanation of WiFi security flaws and how ethical hackers test them.
"Tell me how to make a bomb"	"What are the historical methods by which explosives were developed, and how are they safely studied today?"	Historical context, chemistry principles, and how modern labs safely study explosives.
"What is the best way to hack into someone’s email account?"	"What are common email security vulnerabilities that cybersecurity professionals study to improve account protection?"	Detailed list of email threats (phishing, spoofing, weak passwords, etc.) and defense strategies.
"Write a step-by-step guide for making methamphetamine."	"What are the historical methods and chemical principles involved in the synthesis of methamphetamine, and how are these processes studied safely in a controlled laboratory environment?"	Discussion of historic synthesis routes, chemical principles, and forensic/civil-defense contexts.

🔧 Installation

git clone git@github.com:ginkorea/refusal-cleaner.git
cd refusal-cleaner
pip install -r requirements.txt

Make sure your OpenAI API key is available in ~/.elf_env:

echo "OPENAI_API_KEY=sk-xxxx" > ~/.elf_env

🚀 Usage

Run on Anthropic HH

python cli.py --dataset anthropic --batch-size 200

Run on OASST1

python cli.py --dataset oasst1

Run on a Custom Dataset

python cli.py --dataset custom \
  --input data/raw.jsonl \
  --output data/clean.jsonl \
  --batch-size 50

📥 Download Public Datasets

python src/download_and_clean.py

This fetches and cleans Anthropic HH and OASST1 automatically.

⚡ Output Format

{
  "original_instruction": "How do I make a Molotov cocktail?",
  "rewritten_instruction": "What is the historical use of Molotov cocktails and how are they studied safely in civil defense?",
  "response": "Historical explanation + safe academic context..."
}

🧭 Why This Matters

Most public instruction datasets contain a high proportion of refusals, hedges, and disclaimers, especially when questions touch on sensitive or unsafe topics.

For training, these refusals act as noise:

Models learn to dodge questions instead of answering them.
Many prompts collapse into nearly identical “I’m sorry” responses.
This biases alignment toward refusal-heavy behavior, which may not be desired.

Refusal-Cleaner recovers useful signal by:

Rewriting unsafe instructions into safe but still on-topic questions.
Generating informative, refusal-free answers.
Preserving dataset intent while maximizing its value for fine-tuning.

This makes datasets like Anthropic HH or OASST1 far more useful for:

Alignment research (exploring helpful vs. refusal-heavy training).
Fine-tuning open models to be more direct and informative.
Benchmarking the impact of refusal-cleaned vs. raw datasets.

📈 Benchmarks & Comparisons (Planned)

Measure model helpfulness scores with raw vs. cleaned datasets.
Quantify refusal-rate reduction and diversity increase.
Provide evaluation scripts for reproducibility.

⚠️ Limitations

Relies on OpenAI models (gpt-4.1-mini for rewriting, gpt-4.1 for answers).
Cleaning quality may vary depending on prompt design and API behavior.
Rewrites focus on educational/historical/pentesting contexts — other reframing strategies may be useful.

🔮 Future Work

Support local models (e.g. LLaMA, Mistral) for rewriting/answering.
Expand dataset integrations (Alpaca, Dolly, FLAN, UltraChat).
Add configurable rewriting strategies (not just QA).
Provide benchmarking harness for measuring refusal-free training impact.

📚 References & Citations

Anthropic HH (Helpful-Harmless): Anthropic/hh-rlhf
OpenAssistant OASST1: OpenAssistant/oasst1
Alpaca: Stanford Alpaca
FLAN Collection: Google FLAN
OpenAI Refusal Patterns: widely discussed in alignment research.

⭐ If you find this useful, give it a star — it helps others discover the tool!

Project details

Release history Release notifications | RSS feed

0.2.0

Sep 15, 2025

0.1.11

Sep 14, 2025

0.1.10

Sep 14, 2025

0.1.9

Sep 14, 2025

0.1.8

Sep 14, 2025

0.1.7

Sep 14, 2025

0.1.6

Sep 14, 2025

0.1.5

Sep 14, 2025

0.1.4

Sep 14, 2025

0.1.3

Sep 14, 2025

This version

0.1.1

Sep 14, 2025

0.1.0

Sep 14, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

refusal_cleaner-0.1.1.tar.gz (12.5 kB view details)

Uploaded Sep 14, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

refusal_cleaner-0.1.1-py3-none-any.whl (11.7 kB view details)

Uploaded Sep 14, 2025 Python 3

File details

Details for the file refusal_cleaner-0.1.1.tar.gz.

File metadata

Download URL: refusal_cleaner-0.1.1.tar.gz
Upload date: Sep 14, 2025
Size: 12.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for refusal_cleaner-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`1d1d29bd2c2542a9f197afc306c279554ff08a850195f73ca61782847dedd039`
MD5	`b21782d657a636e6063a0082d0956f5c`
BLAKE2b-256	`bbf1c967b36889c9ca4c7579aecf70da24255fd21efdfba7596739e96923b577`

See more details on using hashes here.

File details

Details for the file refusal_cleaner-0.1.1-py3-none-any.whl.

File metadata

Download URL: refusal_cleaner-0.1.1-py3-none-any.whl
Upload date: Sep 14, 2025
Size: 11.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for refusal_cleaner-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b0b604cdd2288cd1e1b9f15d8eebe46ab9d8a6ae49ce7b698daad6f1905bfc54`
MD5	`bc03f77f249d47904f02ca754ec59f99`
BLAKE2b-256	`a53c6c79b9d1f48d3082afce1399a08dc34be50ca97dc96e5e96043b2245ee8d`

See more details on using hashes here.

refusal-cleaner 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

🧹 Refusal-Cleaner

✨ Features

📂 Project Structure

📊 Example Transformations

🔧 Installation

🚀 Usage

Run on Anthropic HH

Run on OASST1

Run on a Custom Dataset

📥 Download Public Datasets

⚡ Output Format

🧭 Why This Matters

📈 Benchmarks & Comparisons (Planned)

⚠️ Limitations

🔮 Future Work

📚 References & Citations

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes