Skip to main content

Text filter designed to cleanse text of profanity and offensive language, specifically tailored for Ukrainian, Russian, and Surzhik.

Project description

slaviclean

Python Versions Version Coverage

SlaviCleaner is a profanity filtering library designed for cleaning text from offensive language, specifically tailored for Ukrainian and Russian languages. It detects, masks, and reports offensive words while providing different levels of filtering.

This module uses spaCy for natural language processing and employs advanced techniques for detecting profanities, including handling obfuscated words, variants of swear words, and morphology forms.

Features

  • Detects and masks offensive words in slavic languages (Ukrainian, Russian).
  • Handles obfuscated, substituted, and morphologically varied forms of profanity.
  • Provides detailed reporting of profanities detected, including the type of relationship (e.g., euphemism, vulgar, loanword).
  • Allows the customization of filtering level with three options: complete, basic, minimal.
  • Offers support for subtree-level profanity filtering.
  • Handles masked and obfuscated profanity patterns effectively.

Installation

To install SlaviCleaner, run:

pip install slaviclean

Usage

Initializing

from slaviclean import SlaviCleaner

scleaner = SlaviCleaner()

Initializing with preloads

You can preload the necessary language models for faster processing. The preload option loads the models for the supported languages (uk, ru, surzhyk).

from slaviclean import SlaviCleaner

scleaner = SlaviCleaner(preload=True)

Core Functions

get_available_languages()

Retrieves a set of languages supported by the profanity filter.

  • Returns:

    • A set of language codes (e.g., {'uk', 'ru', 'surzhyk'}).
  • Example:

from slaviclean import SlaviCleaner

scleaner = SlaviCleaner(preload=True)
languages = scleaner.get_available_languages()

print(languages)  
# Output: {'uk', 'ru', 'surzhyk'}

sanitize(message, lang, min_subtree_size, mask_symbol, slevel, analyze_morph)

Filters profanities from the given message and returns a detailed report.

  • Arguments:

    • message (str): The input message to filter.
    • lang (str): The language of the message (supports 'uk', 'ru', and 'surzhyk', default is 'surzhyk').
    • min_subtree_size (float): Minimum size of the token subtree for dependency parsing (default is 3).
    • mask_symbol (str): Symbol used to mask profanities (default is '*').
    • slevel (str): Severity level of filtering (can be 'complete', 'basic', or 'minimal', default is 'complete').
    • analyze_morph (bool): Whether to analyze the morphology of words (default is False).
  • Returns:

    • A SanitizeReport containing the masked message and list of detected profanities.
  • Example:

from slaviclean import SlaviCleaner

scleaner = SlaviCleaner(preload=True)
message = "От же ж, к у р в а, страхуй, бо об’ївся г***м супом, облив себе соком, ще й сумка, су4k@, відірвалась"
 
sanitize_report = scleaner.sanitize(message, lang='uk')

print(sanitize_report)  
# Output: 
#   SanitizeReport(
#      message='От же ж, курва, страхуй, бо об’ївся г***м супом, облив себе соком, ще й сумка, су4k@, відірвалась', 
#      masked_message='От же ж, *****, страхуй, бо об’ївся ***** супом, облив себе соком, ще й сумка, *****, відірвалась', 
#      profanities=[
#           Profanity(span=(9, 14), nearest='курва', tags=['vulgar', 'euphemism', 'loanword']), 
#           Profanity(span=(36, 41), nearest='г***м', tags=['masked']), 
#           Profanity(span=(79, 84), nearest='сучка', tags=['insulting', 'slur', 'vulgar'])])

Available Severity Levels

  • complete
    Cleans all profanities, including euphemisms, vulgarities, and loanwords.
  • basic
    Cleans more aggressive profanity, without including euphemisms.
  • minimal
    Only cleans the most offensive words.

Supported Languages

SlaviCleaner currently supports the following languages:

  • Ukrainian (uk)
  • Russian (ru)
  • Surzhyk (surzhyk)

License

This project is licensed under the MIT License - see the LICENSE file for details.


Acknowledgments

  • The spaCy library is used for NLP tasks like tokenization, part-of-speech tagging, and dependency parsing.
  • The pymorphy3 library is used for morphological analysis.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slaviclean-0.1.1.tar.gz (287.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

slaviclean-0.1.1-py3-none-any.whl (296.9 kB view details)

Uploaded Python 3

File details

Details for the file slaviclean-0.1.1.tar.gz.

File metadata

  • Download URL: slaviclean-0.1.1.tar.gz
  • Upload date:
  • Size: 287.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.10

File hashes

Hashes for slaviclean-0.1.1.tar.gz
Algorithm Hash digest
SHA256 843b68f997d3926d344c281f21e883ec650c2048b0a66796abbec8ad8ed91e67
MD5 8bfe72a7c129b42807893757d5611e98
BLAKE2b-256 5f3f6d3d780237de99be12e4ec0a632f85724ae8b8fbac035d06e9f7f67ca691

See more details on using hashes here.

File details

Details for the file slaviclean-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: slaviclean-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 296.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.10

File hashes

Hashes for slaviclean-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8ea841baffc22d9a92ce924d2ab9349a1544baed2b7eeb3f5c695ed72c845eb4
MD5 218cfdce956c6914dd215a8e0848fafc
BLAKE2b-256 7814612adae63406bb29e355460eff2426700dfb5d62f0a0261a7b6a816a1078

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page