Text filter designed to cleanse text of profanity and offensive language, specifically tailored for Ukrainian, Russian, and Surzhik.
Project description
slaviclean
SlaviCleaner is a profanity filtering library designed for cleaning text from offensive language, specifically tailored for Ukrainian and Russian languages. It detects, masks, and reports offensive words while providing different levels of filtering.
This module uses spaCy for natural language processing and employs advanced techniques for detecting profanities, including handling obfuscated words, variants of swear words, and morphology forms.
Features
- Detects and masks offensive words in slavic languages (Ukrainian, Russian).
- Handles obfuscated, substituted, and morphologically varied forms of profanity.
- Provides detailed reporting of profanities detected, including the type of relationship (e.g., euphemism, vulgar, loanword).
- Allows the customization of filtering level with three options:
complete,basic,minimal. - Offers support for subtree-level profanity filtering.
- Handles masked and obfuscated profanity patterns effectively.
Installation
To install SlaviCleaner, run:
pip install slaviclean
Usage
Initializing
from slaviclean import SlaviCleaner
scleaner = SlaviCleaner()
Initializing with preloads
You can preload the necessary language models for faster processing.
The preload option loads the models for the supported languages (uk, ru, surzhyk).
from slaviclean import SlaviCleaner
scleaner = SlaviCleaner(preload=True)
Core Functions
get_available_languages()
Retrieves a set of languages supported by the profanity filter.
-
Returns:
- A set of language codes (e.g.,
{'uk', 'ru', 'surzhyk'}).
- A set of language codes (e.g.,
-
Example:
from slaviclean import SlaviCleaner
scleaner = SlaviCleaner(preload=True)
languages = scleaner.get_available_languages()
print(languages)
# Output: {'uk', 'ru', 'surzhyk'}
sanitize(message, lang, min_subtree_size, mask_symbol, slevel, analyze_morph)
Filters profanities from the given message and returns a detailed report.
-
Arguments:
message(str): The input message to filter.lang(str): The language of the message (supports'uk','ru', and'surzhyk', default is'surzhyk').min_subtree_size(float): Minimum size of the token subtree for dependency parsing (default is3).mask_symbol(str): Symbol used to mask profanities (default is'*').slevel(str): Severity level of filtering (can be'complete','basic', or'minimal', default is'complete').analyze_morph(bool): Whether to analyze the morphology of words (default isFalse).
-
Returns:
- A
SanitizeReportcontaining the masked message and list of detected profanities.
- A
-
Example:
from slaviclean import SlaviCleaner
scleaner = SlaviCleaner(preload=True)
message = "От же ж, к у р в а, страхуй, бо об’ївся г***м супом, облив себе соком, ще й сумка, су4k@, відірвалась"
sanitize_report = scleaner.sanitize(message, lang='uk')
print(sanitize_report)
# Output:
# SanitizeReport(
# message='От же ж, курва, страхуй, бо об’ївся г***м супом, облив себе соком, ще й сумка, су4k@, відірвалась',
# masked_message='От же ж, *****, страхуй, бо об’ївся ***** супом, облив себе соком, ще й сумка, *****, відірвалась',
# profanities=[
# Profanity(span=(9, 14), nearest='курва', tags=['vulgar', 'euphemism', 'loanword']),
# Profanity(span=(36, 41), nearest='г***м', tags=['masked']),
# Profanity(span=(79, 84), nearest='сучка', tags=['insulting', 'slur', 'vulgar'])])
Available Severity Levels
complete
Cleans all profanities, including euphemisms, vulgarities, and loanwords.basic
Cleans more aggressive profanity, without including euphemisms.minimal
Only cleans the most offensive words.
Supported Languages
SlaviCleaner currently supports the following languages:
- Ukrainian (
uk) - Russian (
ru) - Surzhyk (
surzhyk)
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- The spaCy library is used for NLP tasks like tokenization, part-of-speech tagging, and dependency parsing.
- The pymorphy3 library is used for morphological analysis.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file slaviclean-0.0.3.tar.gz.
File metadata
- Download URL: slaviclean-0.0.3.tar.gz
- Upload date:
- Size: 287.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
37c331464b084047e87bd083329e8b7bb4720fb3b79b7daebc9fa0a1fad3a1a6
|
|
| MD5 |
b026a9d15baa9f7de71d276c32c398c2
|
|
| BLAKE2b-256 |
b40ffcb7cf1341c70a5db78b3ff0e1da24d2b5bade54ec1ddf65e522d01f4be8
|
File details
Details for the file slaviclean-0.0.3-py3-none-any.whl.
File metadata
- Download URL: slaviclean-0.0.3-py3-none-any.whl
- Upload date:
- Size: 297.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4b39139fdab5bb55dc4378953b7d10dbc318c6d0bb512c9310ff73bf5da28c26
|
|
| MD5 |
fecea1fa792c38fd6e63a698e7749a8e
|
|
| BLAKE2b-256 |
04da723e2b70fb293136e713f692862e080e458605101e2719a6c447e870d3b6
|