Skip to main content

Text filter designed to cleanse text of profanity and offensive language, specifically tailored for Ukrainian, Russian, and Surzhik.

Project description

slaviclean

Python Versions Version

SlaviCleaner is a profanity filtering library designed for cleaning text from offensive language, specifically tailored for Ukrainian and Russian languages. It detects, masks, and reports offensive words while providing different levels of filtering.

This module uses spaCy for natural language processing and employs advanced techniques for detecting profanities, including handling obfuscated words, variants of swear words, and morphology forms.

Features

  • Detects and masks offensive words in slavic languages (Ukrainian, Russian).
  • Handles obfuscated, substituted, and morphologically varied forms of profanity.
  • Provides detailed reporting of profanities detected, including the type of relationship (e.g., euphemism, vulgar, loanword).
  • Allows the customization of filtering level with three options: complete, basic, minimal.
  • Offers support for subtree-level profanity filtering.
  • Handles masked and obfuscated profanity patterns effectively.

Installation

To install SlaviCleaner, run:

pip install slaviclean

Usage

Initializing

from slaviclean import SlaviCleaner

scleaner = SlaviCleaner()

Initializing with preloads

You can preload the necessary language models for faster processing. The preload option loads the models for the supported languages (uk, ru, surzhyk).

from slaviclean import SlaviCleaner

scleaner = SlaviCleaner(preload=True)

Core Functions

get_available_languages()

Retrieves a set of languages supported by the profanity filter.

  • Returns:

    • A set of language codes (e.g., {'uk', 'ru', 'surzhyk'}).
  • Example:

from slaviclean import SlaviCleaner

scleaner = SlaviCleaner(preload=True)
languages = scleaner.get_available_languages()

print(languages)  
# Output: {'uk', 'ru', 'surzhyk'}

sanitize(message, lang, min_subtree_size, mask_symbol, slevel, analyze_morph)

Filters profanities from the given message and returns a detailed report.

  • Arguments:

    • message (str): The input message to filter.
    • lang (str): The language of the message (supports 'uk', 'ru', and 'surzhyk', default is 'surzhyk').
    • min_subtree_size (float): Minimum size of the token subtree for dependency parsing (default is 3).
    • mask_symbol (str): Symbol used to mask profanities (default is '*').
    • slevel (str): Severity level of filtering (can be 'complete', 'basic', or 'minimal', default is 'complete').
    • analyze_morph (bool): Whether to analyze the morphology of words (default is False).
  • Returns:

    • A SanitizeReport containing the masked message and list of detected profanities.
  • Example:

from slaviclean import SlaviCleaner

scleaner = SlaviCleaner(preload=True)
message = "От же ж, к у р в а, страхуй, бо об’ївся г***м супом, облив себе соком, ще й сумка, су4k@, відірвалась"
 
sanitize_report = scleaner.sanitize(message, lang='uk')

print(sanitize_report)  
# Output: 
#   SanitizeReport(
#      message='От же ж, курва, страхуй, бо об’ївся г***м супом, облив себе соком, ще й сумка, су4k@, відірвалась', 
#      masked_message='От же ж, *****, страхуй, бо об’ївся ***** супом, облив себе соком, ще й сумка, *****, відірвалась', 
#      profanities=[
#           Profanity(span=(9, 14), nearest='курва', tags=['vulgar', 'euphemism', 'loanword']), 
#           Profanity(span=(36, 41), nearest='г***м', tags=['masked']), 
#           Profanity(span=(79, 84), nearest='сучка', tags=['insulting', 'slur', 'vulgar'])])

Available Severity Levels

  • complete
    Cleans all profanities, including euphemisms, vulgarities, and loanwords.
  • basic
    Cleans more aggressive profanity, without including euphemisms.
  • minimal
    Only cleans the most offensive words.

Supported Languages

SlaviCleaner currently supports the following languages:

  • Ukrainian (uk)
  • Russian (ru)
  • Surzhyk (surzhyk)

License

This project is licensed under the MIT License - see the LICENSE file for details.


Acknowledgments

  • The spaCy library is used for NLP tasks like tokenization, part-of-speech tagging, and dependency parsing.
  • The pymorphy3 library is used for morphological analysis.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slaviclean-0.0.1.tar.gz (287.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

slaviclean-0.0.1-py3-none-any.whl (297.1 kB view details)

Uploaded Python 3

File details

Details for the file slaviclean-0.0.1.tar.gz.

File metadata

  • Download URL: slaviclean-0.0.1.tar.gz
  • Upload date:
  • Size: 287.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for slaviclean-0.0.1.tar.gz
Algorithm Hash digest
SHA256 c068051eae14d78f02244b024b3ad049ac48871fd9ca1141019a2f493bfaf65d
MD5 1f836beeb64486b90b07f1eb3d405924
BLAKE2b-256 1d03dbf5313c23fdc45394bb7627d4fca5358c5bfa60d0bde7e6a7bbfd4777c7

See more details on using hashes here.

File details

Details for the file slaviclean-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: slaviclean-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 297.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for slaviclean-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6ea8ce73ebdee7c9ce88afcdf0c8a9c6c11505b83bccaa53fac1b5fabd7d4a5b
MD5 0d52853d727da2837df5156ae90eb7ec
BLAKE2b-256 1b85f75e9f58d8275bfe1c449f1b8dcc7d1df1d21b8693fe11ca378bcbf16c9e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page