A comprehensive pipeline for reducing toxicity in text

These details have not been verified by PyPI

Project links

Homepage

Project description

ViToxReduce Pipeline

A comprehensive pipeline for reducing toxicity in text using a streamlined 3-tier architecture.

Pipeline Architecture

📋 Description

ViToxReduce is an automated system for reducing toxicity in text using a 3-tier architecture:

Safety Checker: Detects toxic sentences
Span Tagger: Locates toxic phrases
Contextual Rewriter: Safely rewrites sentences

🏗️ System Architecture

The ViToxReduce pipeline uses a streamlined 3-tier architecture to efficiently process and detoxify text:

Pipeline Architecture

Example Flow

The following diagram shows an example of how the pipeline processes a toxic sentence:

Example Flow

Tier 1: Safety Checker

Function: Classifies input sentences as safe (safe) or unsafe (toxic)
Model: PhoBERT (Sequence Classification)
Processing Flow:
- If Safe → Returns the original sentence unchanged, terminates processing (saves computational resources for ~50% of non-toxic sentences)
- If Unsafe → Proceeds to Tier 2 for further processing

Tier 2: Span Tagger

Function: Scans sentences and labels each word to identify toxic phrase locations (Toxic Spans)
Model: PhoBERT (Token Classification / BIO Tagging)
Output: A list of spans S = [s_1, s_2, ...] (character indices)
Role: Provides "hints" to Tier 3 about which parts to focus on, preventing over-editing of safe regions

Tier 3: Contextual Rewriter

Function: Generates new sentences that remove or replace toxic spans while preserving context and intent of remaining parts
Model: BARTpho (Seq2Seq - Span-guided)
Input: Original sentence X + Span list S (combined into prompt: "Original sentence: X. Words to fix: S")
Generation Mechanism: Uses One-time Rewriting with Beam Search to select the best result (Top-1) without iterative checking, improving processing speed

📦 Installation

System Requirements

Python >= 3.7
CUDA (optional, for GPU acceleration)

Install via pip

pip install vitoxreduce

(Dev mode) From source:

pip install -r requirements.txt
pip install -e .

Required Models

You need 3 fine-tuned models (can be loaded directly from Hugging Face Hub):

BARTpho Rewriter (Tier 3) — joshswift/bartpho-rewriter
PhoBERT Span Locator (Tier 2) — joshswift/phobert-span
PhoBERT Toxicity Classifier (Tier 1) — joshswift/phobert-toxicity

You can use the repo IDs directly (online), or download locally:

pip install huggingface_hub
huggingface-cli login          # if private or rate-limited
huggingface-cli download joshswift/bartpho-rewriter   --local-dir ./models/rewriter
huggingface-cli download joshswift/phobert-span       --local-dir ./models/span
huggingface-cli download joshswift/phobert-toxicity   --local-dir ./models/toxicity

Dataset (optional, for testing/eval)

Hugging Face dataset: joshswift/vitoxrewrite
Download locally (JSONL kept intact):

pip install huggingface_hub
huggingface-cli login  # if needed
huggingface-cli download joshswift/vitoxrewrite --local-dir ./dataset

After download, you will have:

dataset/vitoxrewrite_train.jsonl
dataset/vitoxrewrite_validation.jsonl
dataset/vitoxrewrite_test.jsonl

🚀 Quick Start

Option 1: Automated Setup Script (Recommended for First-Time Users)

The easiest way to get started is using the automated setup script that installs the package, downloads models, and runs a test:

# Clone or download the repository
cd vitoxreduce_pipeline_github

# Run the automated setup script
python examples/example_usage.py

What the script does:

Installs vitoxreduce package from PyPI
Downloads 3 required models from Hugging Face (skips if already exists)
Runs a smoke test with sample text
Saves results to ./results/smoke_test_TIMESTAMP.json

Customize the script:

# Use custom model directory
python examples/example_usage.py --models-dir ./my_models

# Use Hugging Face token (if repos are private/rate-limited)
python examples/example_usage.py --token hf_xxxxxxxxxxxxx

# Custom output file
python examples/example_usage.py --output ./my_results.json

Option 2: Manual Setup (Step-by-Step)

If you prefer to set up manually or need more control, follow these steps:

1) Install Package

pip install vitoxreduce

2) Download Models (Optional - can use online models instead)

pip install huggingface_hub
huggingface-cli download joshswift/bartpho-rewriter   --local-dir ./models/rewriter
huggingface-cli download joshswift/phobert-span       --local-dir ./models/span
huggingface-cli download joshswift/phobert-toxicity   --local-dir ./models/toxicity

3) Run CLI

# Single sentence (using online models)
vitoxreduce \
  --input "Từ lúc mấy bro cmt cực kì cl gì đấy..." \
  --rewriter_model joshswift/bartpho-rewriter \
  --span_locator_model joshswift/phobert-span \
  --toxicity_detector_model joshswift/phobert-toxicity \
  --output result.json \
  --verbose

# Or use local models
vitoxreduce \
  --input "Your text here" \
  --rewriter_model ./models/rewriter \
  --span_locator_model ./models/span \
  --toxicity_detector_model ./models/toxicity \
  --output result.json \
  --verbose

4) Python API

from vitoxreduce import ViToxReducePipeline

pipeline = ViToxReducePipeline(
    rewriter_model_path="joshswift/bartpho-rewriter",  # or local path
    span_locator_model_path="joshswift/phobert-span",
    toxicity_detector_model_path="joshswift/phobert-toxicity",
    num_beams=5,
)

result = pipeline.process("Your text here", verbose=True)
print("Rewritten:", result["rewritten"])
print("Safe?:", result["rewritten_is_safe"])
print("Toxicity:", result["toxicity_score"], "→", result["rewritten_toxicity_score"])

⚙️ Command-Line Arguments

Required Arguments

--rewriter_model: Path to BARTpho rewriter model (Tier 3) or Hugging Face repo ID
--span_locator_model: Path to PhoBERT span locator model (Tier 2) or Hugging Face repo ID
--toxicity_detector_model: Path to PhoBERT toxicity classifier (Tier 1) or Hugging Face repo ID
--input: Input text, text file (one sentence per line), or JSONL file with 'comment' field

Optional Arguments

--output: Output JSON file path (default: auto-generated or stdout)
--num_beams: Number of beams for beam search (default: 5)
--toxicity_threshold: Threshold for unsafe classification (default: 0.5)
--device: Device to use - cuda, cpu, or auto (default: auto)
--verbose: Print detailed processing information
--mode: Processing mode - auto, single, file, or jsonl (default: auto)

📊 Output Format

The pipeline returns a dictionary with the following keys:

{
    "original": "Original sentence",
    "is_safe": False,                    # True if safe, False if unsafe
    "toxicity_score": 0.85,              # Toxicity score (0.0-1.0)
    "spans": [(10, 15), (20, 25)],       # List of detected spans (character indices)
    "span_texts": ["toxic", "word"],     # List of span texts
    "rewritten": "Rewritten sentence",   # Rewritten sentence (or original if safe)
    "rewritten_is_safe": True,           # True if rewritten sentence is safe
    "rewritten_toxicity_score": 0.15,    # Toxicity score of rewritten sentence
    "processing_time": 1.2                # Processing time (seconds)
}

📁 Project Structure

vitoxreduce_pipeline_github/
├── vitoxreduce/                    # Main package
│   ├── pipeline.py                 # Main 3-tier pipeline
│   ├── tier1_toxicity_detector.py # Tier 1: Safety Checker
│   ├── tier2_span_locator.py      # Tier 2: Span Tagger
│   ├── tier3_rewrite_generator.py # Tier 3: Contextual Rewriter
│   └── ...
├── scripts/                        # CLI scripts
│   └── run_pipeline.py            # Main CLI entry point
├── examples/                       # Usage examples
│   └── example_usage.py           # Automated setup & test script
├── dataset/                        # Sample dataset (optional)
└── requirements.txt                # Python dependencies

🔍 Evaluation Metrics

The pipeline supports the following metrics (when reference is available):

BLEU: Measures similarity with reference
SIM: Semantic similarity (using SBERT)
FL: Fluency score (based on GPT-2 PPL)
STA: Toxicity Drop (toxicity reduction)
J-score: Joint score = SIM × FL × normalized_tox_drop

Performance Comparison

The above chart shows ViToxReduce's performance compared to baseline methods across key metrics including BLEU, SIM, J-score, and toxicity reduction.

📖 API Documentation

ViToxReducePipeline

Main pipeline class for processing text.

`init(self, toxicity_detector_model_path, span_locator_model_path, rewriter_model_path, ...)`

Initialize the pipeline with model paths.

Parameters:

toxicity_detector_model_path (str, required): Path to PhoBERT toxicity classifier
span_locator_model_path (str, required): Path to PhoBERT span locator model
rewriter_model_path (str, required): Path to BARTpho rewriter model
toxicity_threshold (float, optional): Threshold for unsafe classification (default: 0.5)
num_beams (int, optional): Number of beams for beam search (default: 5)
device (str, optional): Device to use (cuda/cpu, default: auto)

`process(self, text, verbose=False)`

Process a single sentence through the 3-tier pipeline.

Parameters:

text (str): Sentence to process
verbose (bool): Print detailed processing information

Returns:

dict: Processing result with keys: original, is_safe, toxicity_score, spans, span_texts, rewritten, rewritten_is_safe, rewritten_toxicity_score, processing_time

`process_batch(self, texts, verbose=False)`

Process a batch of sentences.

Parameters:

texts (List[str]): List of sentences to process
verbose (bool): Print detailed processing information

Returns:

List[dict]: List of processing results

⚠️ Important Notes

Model Paths: All 3 model paths are required. You can use Hugging Face repo IDs (e.g., joshswift/bartpho-rewriter) or local paths.
Tier 1 Optimization: If a sentence is detected as safe, the pipeline returns the original sentence and skips Tier 2 & 3, saving computational resources.
GPU/CPU Auto-Detection: The pipeline automatically detects and uses GPU if available. If no GPU is found, it automatically falls back to CPU. Use --device cpu to force CPU mode or --device cuda to force GPU mode.
Output Files: Results are saved to JSON files containing original text, rewritten text, toxicity scores, spans, and processing statistics.

🐛 Troubleshooting

Error: "Model not found"

Check that model paths are correct
Ensure models are trained and saved in the correct format

Error: "Out of memory"

Reduce --num_beams
Use --device cpu if GPU runs out of memory
Reduce batch size when processing

Error: "Span Locator model not loaded"

Ensure the correct path is specified with --span_locator_model

Import Errors

Ensure all dependencies are installed: pip install -r requirements.txt
Check Python version >= 3.7

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📝 License

Apache License 2.0

See LICENSE file for details.

👥 Authors

ViToxReduce Pipeline - Text Toxicity Reduction System (3-Tier Streamlined Architecture)

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.0.3

Dec 25, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vitoxreduce-1.0.3.tar.gz (34.1 kB view details)

Uploaded Dec 25, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vitoxreduce-1.0.3-py3-none-any.whl (35.0 kB view details)

Uploaded Dec 25, 2025 Python 3

File details

Details for the file vitoxreduce-1.0.3.tar.gz.

File metadata

Download URL: vitoxreduce-1.0.3.tar.gz
Upload date: Dec 25, 2025
Size: 34.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for vitoxreduce-1.0.3.tar.gz
Algorithm	Hash digest
SHA256	`c3418dcecd26650937eee85f52ad5a9a44f392c0b633a82635b7a5f7aaf70778`
MD5	`034633a58309db6b30a6bcc5676bf1fa`
BLAKE2b-256	`e2408996d29bf3eceb7316ab0cbd3618db5cbb1fe6fae92634513165fe093d4a`

See more details on using hashes here.

File details

Details for the file vitoxreduce-1.0.3-py3-none-any.whl.

File metadata

Download URL: vitoxreduce-1.0.3-py3-none-any.whl
Upload date: Dec 25, 2025
Size: 35.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for vitoxreduce-1.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8ca6148a7d5926f0b3e2d623faea58ebe7cff1274b90575197a6cab94504f3f7`
MD5	`d598cde99b2c6db5514abff5511cb380`
BLAKE2b-256	`9b585aee4266108120bd2d3de7ca2a5c8b11e3ebd5bb5f7037c486da71e01423`

See more details on using hashes here.

vitoxreduce 1.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ViToxReduce Pipeline

📋 Description

🏗️ System Architecture

Example Flow

Tier 1: Safety Checker

Tier 2: Span Tagger

Tier 3: Contextual Rewriter

📦 Installation

System Requirements

Install via pip

Required Models

Dataset (optional, for testing/eval)

🚀 Quick Start

Option 1: Automated Setup Script (Recommended for First-Time Users)

Option 2: Manual Setup (Step-by-Step)

1) Install Package

2) Download Models (Optional - can use online models instead)

3) Run CLI

4) Python API

⚙️ Command-Line Arguments

Required Arguments

Optional Arguments

📊 Output Format

📁 Project Structure

🔍 Evaluation Metrics

Performance Comparison

📖 API Documentation

ViToxReducePipeline

__init__(self, toxicity_detector_model_path, span_locator_model_path, rewriter_model_path, ...)

process(self, text, verbose=False)

process_batch(self, texts, verbose=False)

⚠️ Important Notes

🐛 Troubleshooting

Error: "Model not found"

Error: "Out of memory"

Error: "Span Locator model not loaded"

Import Errors

🤝 Contributing

📝 License

👥 Authors

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`init(self, toxicity_detector_model_path, span_locator_model_path, rewriter_model_path, ...)`

`process(self, text, verbose=False)`

`process_batch(self, texts, verbose=False)`