Skip to main content

A profanity filter for Russian comments.

Project description

About

check-swear is a machine learning and regular expression-based library designed to detect and filter profanity in text-based communication. Initially aimed at monitoring and improving the language used in school and student chats, check-swear offers a versatile solution that can be integrated into various environments requiring profanity filtering.

Features

  • Machine Learning Driven: Utilizes SVM classification algorithm to understand context and nuances, ensuring high accuracy in detecting offensive language.
  • Regular Expression Support: Incorporates a comprehensive set of regular expressions to catch commonly used profane words and phrases.
  • Customizable Filters: Offers the flexibility to customize and extend the list of profane words based on the specific needs of different user groups or cultural sensitivities.
  • Easy Integration: Designed with simplicity in mind, SwearCheck can be easily integrated into chat applications, forums, and any platform requiring content moderation.

Getting Started

To get started with check-swear, simply install the package via pip:

pip install check-swear

Note on Importing the Library

Despite the library being named check-swear, when you import it into your Python project, you will need to replace the hyphen (-) with an underscore (_) This is a common convention in Python packaging because Python modules and packages cannot have hyphens in their names. The hyphen is not a valid character for Python identifiers, so it's replaced with an underscore for the actual package name.

import check_swear

Usage

from check_swear import SwearingCheck

sch = SwearingCheck() # create filter

rude_comment = "а не пошел бы ты нахуй, дружище"
friendly_comment = "svm - алгоритм машинного обучения"

sch.predict(rude_comment)
# [1]

sch.predict_proba(rude_comment)
# [0.9822432776183899]

sch.predict(friendly_comment)
# [0]

sch.predict_proba(friendly_comment)
# [0.027772391001567764]

Model and Regular Expression Checks

The library utilizes a pre-trained SVM (Support Vector Machine) model for profanity detection, which is adept at classifying text but isn't flawless. To enhance accuracy, each comment undergoes a preliminary scan with two sets of regular expressions before the machine learning model processes it. These regex checks aim to catch clear profanity patterns. If you wish to bypass this regex pre-check for any reason, you can set the reg_pred=False parameter when using the filter.

clear_ml_sch = SwearingCheck(reg_pred=False)

hard2detect = "а вот это охуеньчик))"

clear_ml_sch.predict_proba(hard2detect)
# [0.02542796]

sch.predict_proba(hard2detect)
# [0.5127139801037626]

Understanding Probability Scores: Even benign comments sometimes contain character sequences that resemble profane words, which could lead the filter to assign a roughly 30% probability of the comment being offensive. It's a cautious indicator, hinting at potential profanity without outright condemnation. If the regular expression engine detects a match in our default list or any custom list you supply, the probability jumps to around 50%, reflecting a stronger suspicion. Keep in mind that despite the robust training on over 700,000 comments, the nuances of language and the ever-evolving lexicon of slang can sometimes elude even the most sophisticated models. We are committed to continuously expanding our dataset of profane words and phrases.

Additional Features of check-swear

  • Custom Stop Words List: Enhance regular expression detection by adding your own list of stop words.

  • Flexible Input Formats: The model accepts both single strings and lists of strings for analysis.

  • Bin Parameter: Divide large texts into manageable bins parts for efficient processing.

  • Transliteration Support: The library understands transliteration, recognizing Russian words written with English letters, making it robust in handling a variety of text inputs.

adv_sch = SwearingCheck(reg_pred=True, bins=3, stop_words=["питон"])

long_comment = "буду с тобой асболютно честен но твой проект на питоне это просто абсолютно полная hueta.."

adv_sch.predict_proba(long_comment)
# [0.02110824940143035, 0.5090685358094555, 0.9741733209291503]

adv_sch.output_text_
# ['буду с тобой асболютно честен', 'но твой проект на питоне', 'это просто абсолютно полная hueta..']

# array of strings
array_comment = ["всем привет", "ты s__УкА blYa"]

adv_sch.predict_proba(array_comment)
# [0.023436897211045367, 0.9999479672960417]

Conclusion on Model Limitations:

Please be aware that while check-swear is a robust tool for identifying profane content, it is not without limitations. Creative individuals may always find novel ways to bypass filters with new slang or coded language. Despite this, check-swear effectively identifies the majority of profane comments (about 0.95 F1 score), helping maintain a respectful and professional discourse in various settings.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

check-swear-0.1.4.tar.gz (12.3 kB view details)

Uploaded Source

Built Distribution

check_swear-0.1.4-py3-none-any.whl (1.6 MB view details)

Uploaded Python 3

File details

Details for the file check-swear-0.1.4.tar.gz.

File metadata

  • Download URL: check-swear-0.1.4.tar.gz
  • Upload date:
  • Size: 12.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.10

File hashes

Hashes for check-swear-0.1.4.tar.gz
Algorithm Hash digest
SHA256 655550e4299b4373fde5b84bf689bd407ffd8e025240b77cc77384be6132ee16
MD5 90d22e5c871a9a76199d7c6f4bfb049e
BLAKE2b-256 3805056d1bef8c760e22983af64e2dae75f38a1ab15b34facec3c458b6ef74ab

See more details on using hashes here.

File details

Details for the file check_swear-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: check_swear-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.10

File hashes

Hashes for check_swear-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 864ad81b963ae541c6cc05cff5f28b7e46dc74f6a87b807a8b69a44d95b98155
MD5 6012890881a941bb5f791aa34c4fb3b8
BLAKE2b-256 262efec9fc45f13aaca2e669e23dcf3c82634b8220dd2954d6dcaaf61cb530d9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page