Skip to main content

A Python library for detecting and filtering profanity

Project description

profanity-filter2: A Python library for detecting and filtering profanity

License PyPI - Python Version PyPI

Attention!

This library is forked from profanity-filter, because its author dropped support.

Table of contents

Overview

profanity-filter is a universal library for detecting and filtering profanity. Support for English and Russian is included.

Features

  1. Full text or individual words censoring.
  2. Multilingual support, including profanity filtering in texts written in mixed languages.
  3. Deep analysis. The library detects not only the exact profane word matches but also derivative and distorted profane words using the Levenshtein automata, ignoring dictionary words, containing profane words as a part.
  4. Spacy component for using the library as a part of the pipeline.
  5. Explanation of decisions (attribute original_profane_word).
  6. Partial word censoring.
  7. Extensibility support. New languages can be added by supplying dictionaries.
  8. RESTful web service.

Caveats

  1. Context-free. The library cannot detect using profane phrases consisted of decent words. Vice versa, the library cannot detect appropriate usage of a profane word.

Usage

Here are the basic examples of how to use the library. For more examples please see tests folder.

Basics

from profanity_filter import ProfanityFilter

pf = ProfanityFilter()

pf.censor("That's bullshit!")
# "That's ********!"

pf.censor_word('fuck')
# Word(uncensored='fuck', censored='****', original_profane_word='fuck')

Deep analysis

from profanity_filter import ProfanityFilter

pf = ProfanityFilter()

pf.censor("fuckfuck")
# "********"

pf.censor_word('oofuko')
# Word(uncensored='oofuko', censored='******', original_profane_word='fuck')

pf.censor_whole_words = False
pf.censor_word('h0r1h0r1')
# Word(uncensored='h0r1h0r1', censored='***1***1', original_profane_word='h0r')

Multilingual analysis

from profanity_filter import ProfanityFilter

pf = ProfanityFilter(languages=['ru', 'en'])

pf.censor("Да бля, это просто shit какой-то!")
# "Да ***, это просто **** какой-то!"

Using as a part of Spacy pipeline

import spacy
from profanity_filter import ProfanityFilter

nlp = spacy.load('en')
profanity_filter = ProfanityFilter(nlps={'en': nlp})  # reuse spacy Language (optional)
nlp.add_pipe(profanity_filter.spacy_component, last=True)

doc = nlp('This is shiiit!')

doc._.is_profane
# True

doc[:2]._.is_profane
# False

for token in doc:
    print(f'{token}: '
          f'censored={token._.censored}, '
          f'is_profane={token._.is_profane}, '
          f'original_profane_word={token._.original_profane_word}'
    )
# This: censored=This, is_profane=False, original_profane_word=None
# is: censored=is, is_profane=False, original_profane_word=None
# shiiit: censored=******, is_profane=True, original_profane_word=shit
# !: censored=!, is_profane=False, original_profane_word=None

Customizations

from profanity_filter import ProfanityFilter

pf = ProfanityFilter()

pf.censor_char = '@'
pf.censor("That's bullshit!")
# "That's @@@@@@@@!"

pf.censor_char = '*'
pf.custom_profane_word_dictionaries = {'en': {'love', 'dog'}}
pf.censor("I love dogs and penguins!")
# "I **** **** and penguins"

pf.restore_profane_word_dictionaries()
pf.is_clean("That's awesome!")
# True

pf.is_clean("That's bullshit!")
# False

pf.is_profane("That's bullshit!")
# True

pf.extra_profane_word_dictionaries = {'en': {'chocolate', 'orange'}}
pf.censor("Fuck orange chocolates")
# "**** ****** **********"

Console Executable

$ profanity_filter -h
usage: profanity_filter [-h] [-t TEXT | -f PATH] [-l LANGUAGES] [-o OUTPUT_FILE] [--show]

Profanity filter console utility

optional arguments:
  -h, --help            show this help message and exit
  -t TEXT, --text TEXT  Test the given text for profanity
  -f PATH, --file PATH  Test the given file for profanity
  -l LANGUAGES, --languages LANGUAGES
                        Test for profanity using specified languages (comma
                        separated)
  -o OUTPUT_FILE, --output OUTPUT_FILE
                        Write the censored output to a file
  --show                Print the censored text

RESTful web service

Run:

$ uvicorn profanity_filter.web:app --reload
INFO: Uvicorn running on http://127.0.0.1:8000
...

Go to the {BASE_URL}/docs for interactive documentation.

Installation

First two parts of installation instructions are designed for the users who want to filter English profanity. If you want to filter profanity in another language you still need to read it.

Basic installation

For minimal setup you need to install profanity-filter with is bundled with spacy and download spacy model for tokenization and lemmatization:

$ pip install profanity-filter2
$ # Skip next line if you want to filter profanity in another language
$ python -m spacy download en_core_web_sm

For more info about Spacy models read: https://spacy.io/usage/models/.

Deep analysis

To get deep analysis functionality install additional libraries and dictionary for your language.

Firstly, install hunspell and hunspell-devel packages with your system package manager.

For Amazon Linux AMI run:

$ sudo yum install hunspell

For openSUSE run:

$ sudo zypper install hunspell hunspell-devel

Then run:

$ pip install -U profanity-filter[deep-analysis] git+https://github.com/rominf/hunspell_serializable@49c00fabf94cacf9e6a23a0cd666aac10cb1d491#egg=hunspell_serializable git+https://github.com/rominf/pyffs@6c805fbfd7771727138b169b32484b53c0b0fad1#egg=pyffs
$ # Skip next lines if you want deep analysis support for another language (will be covered in next section)
$ cd profanity_filter/data
$ wget https://cgit.freedesktop.org/libreoffice/dictionaries/plain/en/en_US.aff
$ wget https://cgit.freedesktop.org/libreoffice/dictionaries/plain/en/en_US.dic
$ mv en_US.aff en.aff
$ mv en_US.dic en.dic

Other language support

Let's take Russian for example on how to add new language support.

Russian language support

Firstly, we need to provide file profanity_filter/data/ru_core_news_sm_profane_words.txt which contains a newline separated list of profane words. For Russian it's already present, so we skip file generation.

Next, we need to download the appropriate Spacy model. Unfortunately, Spacy model for Russian is not yet ready, so we will use an English model for tokenization. If you had not install Spacy model for English, it's the right time to do so. As a consequence, even if you want to filter just Russian profanity, you need to specify English in ProfanityFilter constructor as shown in usage examples.

Next, we download dictionaries in Hunspell format for deep analysis from the site https://cgit.freedesktop.org/libreoffice/dictionaries/plain/:

> cd profanity_filter/data
> wget https://cgit.freedesktop.org/libreoffice/dictionaries/plain/ru_RU/ru_RU.aff
> wget https://cgit.freedesktop.org/libreoffice/dictionaries/plain/ru_RU/ru_RU.dic
> mv ru_RU.aff ru.aff
> mv ru_RU.dic ru.dic
Pymorphy2

For Russian and Ukrainian languages to achieve better results we suggest you to install pymorphy2. To install pymorphy2 with Russian dictionary run:

$ pip install -U profanity-filter2[pymorphy2-ru] git+https://github.com/kmike/pymorphy2@ca1c13f6998ae2d835bdd5033c17197dcba84cf4#egg=pymorphy2

Multilingual support

You need to install polyglot package and it's requirements for language detection. See https://polyglot.readthedocs.io/en/latest/Installation.html for more detailed instructions.

For Amazon Linux AMI run:

$ sudo yum install libicu-devel

For openSUSE run:

$ sudo zypper install libicu-devel

Then run:

$ pip install -U profanity-filter2[multilingual]

RESTful web service

Run:

$ pip install -U profanity-filter2[web]

Troubleshooting

You can always check will deep, morphological, and multilingual analyses work by inspecting the value of module variable AVAILABLE_ANALYSES. If you've followed all steps and installed support for all analyses you will see the following:

from profanity_filter import AVAILABLE_ANALYSES

print(', '.join(sorted(analysis.value for analysis in AVAILABLE_ANALYSES)))
# deep, morphological, multilingual

If something is not right, you can import dependencies yourself to see the import exceptions:

from profanity_filter.analysis.deep import *
from profanity_filter.analysis.morphological import *
from profanity_filter.analysis.multilingual import *

Credits

English profane word dictionary: https://github.com/areebbeigh/profanityfilter/ (author Areeb Beigh).

Russian profane word dictionary: https://github.com/PixxxeL/djantimat (author Ivan Sergeev).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

profanity-filter2-1.4.3.tar.gz (845.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

profanity_filter2-1.4.3-py3-none-any.whl (842.0 kB view details)

Uploaded Python 3

File details

Details for the file profanity-filter2-1.4.3.tar.gz.

File metadata

  • Download URL: profanity-filter2-1.4.3.tar.gz
  • Upload date:
  • Size: 845.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.7

File hashes

Hashes for profanity-filter2-1.4.3.tar.gz
Algorithm Hash digest
SHA256 2c7b407ea7a3562a9fbf7af9c758244e9af3539be8bfe549d51ee41957748176
MD5 2a02d0dc1991f8238c6366f742f61a55
BLAKE2b-256 f7ef8218dc487a2ca9897e425a1bbc169ae8b6e763f30584555ef26232066c98

See more details on using hashes here.

File details

Details for the file profanity_filter2-1.4.3-py3-none-any.whl.

File metadata

  • Download URL: profanity_filter2-1.4.3-py3-none-any.whl
  • Upload date:
  • Size: 842.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.7

File hashes

Hashes for profanity_filter2-1.4.3-py3-none-any.whl
Algorithm Hash digest
SHA256 fee9ac1110bfa421f65a543ea3af9b382c841b97da1f74bbc54acee60835e2fc
MD5 f4938a24a6e61a5f5b9d696927b4fa6e
BLAKE2b-256 301f2b8b4cde8b46e1e4bb760b8d4984518e61632ad7497dbe5a7c9fdfba5c90

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page