Text filter designed to cleanse text of profanity and offensive language, specifically tailored for Ukrainian, Russian, and Surzhik.

These details have not been verified by PyPI

Project description

slaviclean

SlaviCleaner is a profanity filtering library designed for cleaning text from offensive language, specifically tailored for Ukrainian and Russian languages. It detects, masks, and reports offensive words while providing different levels of filtering.

This module uses spaCy for natural language processing and employs advanced techniques for detecting profanities, including handling obfuscated words, variants of swear words, and morphology forms.

Features

Detects and masks offensive words in slavic languages (Ukrainian, Russian).
Handles obfuscated, substituted, and morphologically varied forms of profanity.
Provides detailed reporting of profanities detected, including the type of relationship (e.g., euphemism, vulgar, loanword).
Allows the customization of filtering level with three options: complete, basic, minimal.
Offers support for subtree-level profanity filtering.
Handles masked and obfuscated profanity patterns effectively.

Installation

To install SlaviCleaner, run:

pip install slaviclean

Usage

Initializing

from slaviclean import SlaviCleaner

scleaner = SlaviCleaner()

Initializing with preloads

You can preload the necessary language models for faster processing. The preload option loads the models for the supported languages (uk, ru, surzhyk).

from slaviclean import SlaviCleaner

scleaner = SlaviCleaner(preload=True)

Core Functions

`get_available_languages()`

Retrieves a set of languages supported by the profanity filter.

Returns:
- A set of language codes (e.g., {'uk', 'ru', 'surzhyk'}).
Example:

from slaviclean import SlaviCleaner

scleaner = SlaviCleaner(preload=True)
languages = scleaner.get_available_languages()

print(languages)  
# Output: {'uk', 'ru', 'surzhyk'}

`sanitize(message, lang, min_subtree_size, mask_symbol, slevel, analyze_morph)`

Filters profanities from the given message and returns a detailed report.

Arguments:
- message (str): The input message to filter.
- lang (str): The language of the message (supports 'uk', 'ru', and 'surzhyk', default is 'surzhyk').
- min_subtree_size (float): Minimum size of the token subtree for dependency parsing (default is 3).
- mask_symbol (str): Symbol used to mask profanities (default is '*').
- slevel (str): Severity level of filtering (can be 'complete', 'basic', or 'minimal', default is 'complete').
- analyze_morph (bool): Whether to analyze the morphology of words (default is False).
Returns:
- A SanitizeReport containing the masked message and list of detected profanities.
Example:

from slaviclean import SlaviCleaner

scleaner = SlaviCleaner(preload=True)
message = "От же ж, к у р в а, страхуй, бо об’ївся г***м супом, облив себе соком, ще й сумка, су4k@, відірвалась"
 
sanitize_report = scleaner.sanitize(message, lang='uk')

print(sanitize_report)  
# Output: 
#   SanitizeReport(
#      message='От же ж, курва, страхуй, бо об’ївся г***м супом, облив себе соком, ще й сумка, су4k@, відірвалась', 
#      masked_message='От же ж, *****, страхуй, бо об’ївся ***** супом, облив себе соком, ще й сумка, *****, відірвалась', 
#      profanities=[
#           Profanity(span=(9, 14), nearest='курва', tags=['vulgar', 'euphemism', 'loanword']), 
#           Profanity(span=(36, 41), nearest='г***м', tags=['masked']), 
#           Profanity(span=(79, 84), nearest='сучка', tags=['insulting', 'slur', 'vulgar'])])

Available Severity Levels

complete
Cleans all profanities, including euphemisms, vulgarities, and loanwords.
basic
Cleans more aggressive profanity, without including euphemisms.
minimal
Only cleans the most offensive words.

Supported Languages

SlaviCleaner currently supports the following languages:

Ukrainian (uk)
Russian (ru)
Surzhyk (surzhyk)

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

The spaCy library is used for NLP tasks like tokenization, part-of-speech tagging, and dependency parsing.
The pymorphy3 library is used for morphological analysis.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.1

Oct 27, 2025

0.1.0

Aug 20, 2025

0.0.6

Feb 13, 2025

0.0.5

Feb 13, 2025

0.0.4

Feb 13, 2025

This version

0.0.3

Feb 12, 2025

0.0.2

Feb 12, 2025

0.0.1

Feb 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slaviclean-0.0.3.tar.gz (287.1 kB view details)

Uploaded Feb 12, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

slaviclean-0.0.3-py3-none-any.whl (297.1 kB view details)

Uploaded Feb 12, 2025 Python 3

File details

Details for the file slaviclean-0.0.3.tar.gz.

File metadata

Download URL: slaviclean-0.0.3.tar.gz
Upload date: Feb 12, 2025
Size: 287.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for slaviclean-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`37c331464b084047e87bd083329e8b7bb4720fb3b79b7daebc9fa0a1fad3a1a6`
MD5	`b026a9d15baa9f7de71d276c32c398c2`
BLAKE2b-256	`b40ffcb7cf1341c70a5db78b3ff0e1da24d2b5bade54ec1ddf65e522d01f4be8`

See more details on using hashes here.

File details

Details for the file slaviclean-0.0.3-py3-none-any.whl.

File metadata

Download URL: slaviclean-0.0.3-py3-none-any.whl
Upload date: Feb 12, 2025
Size: 297.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for slaviclean-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4b39139fdab5bb55dc4378953b7d10dbc318c6d0bb512c9310ff73bf5da28c26`
MD5	`fecea1fa792c38fd6e63a698e7749a8e`
BLAKE2b-256	`04da723e2b70fb293136e713f692862e080e458605101e2719a6c447e870d3b6`

See more details on using hashes here.

slaviclean 0.0.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

slaviclean

Features

Installation

Usage

Initializing

Initializing with preloads

Core Functions

`get_available_languages()`

`sanitize(message, lang, min_subtree_size, mask_symbol, slevel, analyze_morph)`

Available Severity Levels

Supported Languages

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes