A library for Robust Vietnamese Audio-Based Toxic Span Detection and Censoring

These details have not been verified by PyPI

Project links

Model (Hugging Face)

Project description

ViToSA 2.0: A MULTI-TASK APPROACH TOWARDS ROBUST VIETNAMESE AUDIO-BASED TOXIC SPAN DETECTION | ICASSP 2026

Official implementation of the paper:
“A Multi-Task Approach Towards Robust Vietnamese Audio-Based Toxic Span Detection” (ICASSP 2026).

This package provides an end-to-end pipeline for Vietnamese speech-based toxic span detection, combining ASR and toxic span detection in a unified model. It also supports automatic audio censoring, replacing toxic spans with beep sounds in the output waveform.

Key Features

Automated Audio Censoring: Takes an input audio file containing toxic language and outputs a clean .wav file where profanity is masked with a beep.
Unified Multi-Task Architecture: Integrates ASR and Toxic Span Detection (TSD) into a single model for high speed.
SOTA Performance: Achieves F1-macro 0.9212 on the ViToSA-v2 dataset using PhoWhisper + BiLSTM-CRF + Knowledge Distillation.
High Efficiency: Reduces inference latency by over 56% compared to traditional pipelines.

Installation

pip install vitosa-speech

System requirements

This package relies on pydub for audio processing, which requires ffmpeg to be installed.

Ubuntu / Debian
```
sudo apt-get install ffmpeg
```
macOS (Homebrew)
```
brew install ffmpeg
```
Windows Download ffmpeg from https://ffmpeg.org and add it to your system PATH.

Quick Start

This library allows you to input a raw audio file and get a censored audio file as the output.

1. Load the Model

The model is pre-trained on the ViToSA-v2 dataset

import torch
from vitosa-speech-II import load_my_model
# Automatically detect device (CUDA/CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load the pre-trained model
model, processor = load_my_model(device)

2. Run Inference (Detect & Censor)

from vitosa-speech-II import return_labels, censor_audio_with_beep
from IPython.display import Audio, display # Optional: to play in notebook

# Path to your input file
input_audio = "samples/toxic_speech.wav"

# Step 1: Detect toxic spans
words_with_labels = return_labels(input_audio, model, processor, device)

# Step 2: Generate Censored Audio
# This function creates a new audio file with beeps over toxic words
output_audio_path = censor_audio_with_beep(
    audio_path=input_audio, 
    model=model, 
    processor=processor, 
    words_with_labels=words_with_labels, 
    device=device
)

# Result
print(f"✅ Censored audio saved to: {output_audio_path}")

# Optional: Play the result (if in Jupyter/Colab)
# display(Audio(output_audio_path))

Methodology

Our system works in two steps:

Detection: The multi-task model (PhoWhisper + BiLSTM-CRF) processes the audio to identify the exact start and end timestamps of toxic words.
Censoring: We reconstruct the audio by keeping safe segments and generating a sine wave (beep) to overlay exactly where the toxic tokens occur, ensuring the rest of the sentence remains intelligible.

Contact

For more information: luannt@uit.edu.vn

Project details

These details have not been verified by PyPI

Project links

Model (Hugging Face)

Release history Release notifications | RSS feed

This version

0.0.1

Jan 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vitosa_speech_ii-0.0.1.tar.gz (9.3 kB view details)

Uploaded Jan 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vitosa_speech_ii-0.0.1-py3-none-any.whl (8.2 kB view details)

Uploaded Jan 24, 2026 Python 3

File details

Details for the file vitosa_speech_ii-0.0.1.tar.gz.

File metadata

Download URL: vitosa_speech_ii-0.0.1.tar.gz
Upload date: Jan 24, 2026
Size: 9.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for vitosa_speech_ii-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`93ad3fff3f7eef2bbf01bf81ad8cc525d09f71027fa4863a58bd102f66b2c3e2`
MD5	`e83086793953a74fa14ef790ef16ffca`
BLAKE2b-256	`9e87d5e7f4bd25e4f0cc2dd73cdaf6be0442a795d245923f46a182c64936646d`

See more details on using hashes here.

File details

Details for the file vitosa_speech_ii-0.0.1-py3-none-any.whl.

File metadata

Download URL: vitosa_speech_ii-0.0.1-py3-none-any.whl
Upload date: Jan 24, 2026
Size: 8.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for vitosa_speech_ii-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`53de97488242ee30de8ca1a7cbb8f7e95fc8b19b2a10261dee61e307bfedbe7a`
MD5	`25943db129a6e4d0d728e6313e1f6dc2`
BLAKE2b-256	`61f571af7a1c540ac5188477eaf9a86d28313c041377d39b9d6f5ab1ffc47619`

See more details on using hashes here.

vitosa-speech-II 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ViToSA 2.0: A MULTI-TASK APPROACH TOWARDS ROBUST VIETNAMESE AUDIO-BASED TOXIC SPAN DETECTION | ICASSP 2026

Key Features

Installation

System requirements

Quick Start

1. Load the Model

2. Run Inference (Detect & Censor)

Methodology

Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes