Skip to main content

Text filter designed to cleanse text of profanity and offensive language, specifically tailored for Ukrainian, Russian, and Surzhik.

Project description

slaviclean

Python Versions Version

SlaviCleaner is a profanity filtering library designed for cleaning text from offensive language, specifically tailored for Ukrainian and Russian languages. It detects, masks, and reports offensive words while providing different levels of filtering.

This module uses spaCy for natural language processing and employs advanced techniques for detecting profanities, including handling obfuscated words, variants of swear words, and morphology forms.

Features

  • Detects and masks offensive words in slavic languages (Ukrainian, Russian).
  • Handles obfuscated, substituted, and morphologically varied forms of profanity.
  • Provides detailed reporting of profanities detected, including the type of relationship (e.g., euphemism, vulgar, loanword).
  • Allows the customization of filtering level with three options: complete, basic, minimal.
  • Offers support for subtree-level profanity filtering.
  • Handles masked and obfuscated profanity patterns effectively.

Installation

To install SlaviCleaner, run:

pip install slaviclean

Usage

Initializing

from slaviclean import SlaviCleaner

scleaner = SlaviCleaner()

Initializing with preloads

You can preload the necessary language models for faster processing. The preload option loads the models for the supported languages (uk, ru, surzhyk).

from slaviclean import SlaviCleaner

scleaner = SlaviCleaner(preload=True)

Core Functions

get_available_languages()

Retrieves a set of languages supported by the profanity filter.

  • Returns:

    • A set of language codes (e.g., {'uk', 'ru', 'surzhyk'}).
  • Example:

from slaviclean import SlaviCleaner

scleaner = SlaviCleaner(preload=True)
languages = scleaner.get_available_languages()

print(languages)  
# Output: {'uk', 'ru', 'surzhyk'}

sanitize(message, lang, min_subtree_size, mask_symbol, slevel, analyze_morph)

Filters profanities from the given message and returns a detailed report.

  • Arguments:

    • message (str): The input message to filter.
    • lang (str): The language of the message (supports 'uk', 'ru', and 'surzhyk', default is 'surzhyk').
    • min_subtree_size (float): Minimum size of the token subtree for dependency parsing (default is 3).
    • mask_symbol (str): Symbol used to mask profanities (default is '*').
    • slevel (str): Severity level of filtering (can be 'complete', 'basic', or 'minimal', default is 'complete').
    • analyze_morph (bool): Whether to analyze the morphology of words (default is False).
  • Returns:

    • A SanitizeReport containing the masked message and list of detected profanities.
  • Example:

from slaviclean import SlaviCleaner

scleaner = SlaviCleaner(preload=True)
message = "От же ж, к у р в а, страхуй, бо об’ївся г***м супом, облив себе соком, ще й сумка, су4k@, відірвалась"
 
sanitize_report = scleaner.sanitize(message, lang='uk')

print(sanitize_report)  
# Output: 
#   SanitizeReport(
#      message='От же ж, курва, страхуй, бо об’ївся г***м супом, облив себе соком, ще й сумка, су4k@, відірвалась', 
#      masked_message='От же ж, *****, страхуй, бо об’ївся ***** супом, облив себе соком, ще й сумка, *****, відірвалась', 
#      profanities=[
#           Profanity(span=(9, 14), nearest='курва', tags=['vulgar', 'euphemism', 'loanword']), 
#           Profanity(span=(36, 41), nearest='г***м', tags=['masked']), 
#           Profanity(span=(79, 84), nearest='сучка', tags=['insulting', 'slur', 'vulgar'])])

Available Severity Levels

  • complete
    Cleans all profanities, including euphemisms, vulgarities, and loanwords.
  • basic
    Cleans more aggressive profanity, without including euphemisms.
  • minimal
    Only cleans the most offensive words.

Supported Languages

SlaviCleaner currently supports the following languages:

  • Ukrainian (uk)
  • Russian (ru)
  • Surzhyk (surzhyk)

License

This project is licensed under the MIT License - see the LICENSE file for details.


Acknowledgments

  • The spaCy library is used for NLP tasks like tokenization, part-of-speech tagging, and dependency parsing.
  • The pymorphy3 library is used for morphological analysis.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slaviclean-0.0.3.tar.gz (287.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

slaviclean-0.0.3-py3-none-any.whl (297.1 kB view details)

Uploaded Python 3

File details

Details for the file slaviclean-0.0.3.tar.gz.

File metadata

  • Download URL: slaviclean-0.0.3.tar.gz
  • Upload date:
  • Size: 287.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for slaviclean-0.0.3.tar.gz
Algorithm Hash digest
SHA256 37c331464b084047e87bd083329e8b7bb4720fb3b79b7daebc9fa0a1fad3a1a6
MD5 b026a9d15baa9f7de71d276c32c398c2
BLAKE2b-256 b40ffcb7cf1341c70a5db78b3ff0e1da24d2b5bade54ec1ddf65e522d01f4be8

See more details on using hashes here.

File details

Details for the file slaviclean-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: slaviclean-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 297.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for slaviclean-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 4b39139fdab5bb55dc4378953b7d10dbc318c6d0bb512c9310ff73bf5da28c26
MD5 fecea1fa792c38fd6e63a698e7749a8e
BLAKE2b-256 04da723e2b70fb293136e713f692862e080e458605101e2719a6c447e870d3b6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page