Skip to main content

A Python library to clean swear words (and their leetspeak) in strings

Project description

better_profanity

A Python library to clean swear words (and their leetspeak) in strings

release Build Status python license

Inspired from package profanity of Ben Friedland, this library is significantly faster than the original one, by using string comparison instead of regex.

It supports modified spellings (such as p0rn, h4ndjob, handj0b and b*tch).

Requirements

To make use of static typing and many relevant optimisations, this package only works with Python 3.5+.

Installation

$ pip install better_profanity

Unicode characters

Only Unicode characters from categories Ll, Lu, Mc and Mn are added. More on Unicode categories can be found here.

However, this library has not supported all languages yet, such as Chinese.

Usage

By default, on the first .censor() call, function .load_censor_words() generates all possible leetspeak words, from profanity_wordlist.txt, to be used to compare against the input texts. The full mapping of the library can be found in profanity.py.

For example, the word handjob would be loaded into:

'handjob', 'handj*b', 'handj0b', 'handj@b', 'h@ndjob', 'h@ndj*b', 'h@ndj0b', 'h@ndj@b',
'h*ndjob', 'h*ndj*b', 'h*ndj0b', 'h*ndj@b', 'h4ndjob', 'h4ndj*b', 'h4ndj0b', 'h4ndj@b'

Wordlist

Most of the words in the default wordlist are referred from Full List of Bad Words and Top Swear Words Banned by Google.

The wordlist contains a total of 106,992 words, including 318 words from the default profanity_wordlist.txt and their variants by modified spellings.

Its total size in memory is 10.49+MB.

1. Censor swear words from a text

By default, profanity replaces each swear words with 4 asterisks ****.

from better_profanity import profanity

if __name__ == "__main__":
    text = "You p1ec3 of sHit."

    censored_text = profanity.censor(text)
    print(censored_text)
    # You **** of ****.

2. Censor doesn't care about word dividers

The function .censor() also hide words separated not just by an empty space but also other dividers, such as _, , and .. Except for @, $, *, ", '.

from better_profanity import profanity

if __name__ == "__main__":
    text = "...sh1t...hello_cat_fuck,,,,123"

    censored_text = profanity.censor(text)
    print(censored_text)
    # "...****...hello_cat_****,,,,123"

3. Censor swear words with custom character

4 instances of the character in second parameter in .censor() will be used to replace the swear words.

from better_profanity import profanity

if __name__ == "__main__":
    text = "You p1ec3 of sHit."

    censored_text = profanity.censor(text, '-')
    print(censored_text)
    # You ---- of ----.

4. Check if the string contains any swear words

Function .contains_profanity() return True if any words in the given string has a word existing in the wordlist.

from better_profanity import profanity

if __name__ == "__main__":
    dirty_text = "That l3sbi4n did a very good H4ndjob."

    profanity.contains_profanity(dirty_text)
    # True

5. Censor swear words with a custom wordlist

Function .load_censor_words() takes a List of strings as censored words. The provided list will replace the default wordlist.

from better_profanity import profanity

if __name__ == "__main__":
    custom_badwords = ['happy', 'jolly', 'merry']
    profanity.load_censor_words(custom_badwords)

    print(profanity.contains_profanity("Fuck you!"))
    # Fuck you

    print(profanity.contains_profanity("Have a merry day! :)"))
    # Have a **** day! :)

6. Censor Unicode characters

No extra steps needed!

from better_profanity import profanity

if __name__ == "__main__":
    bad_text = "Эффекти́вного противоя́дия от я́да фу́гу не существу́ет до сих пор"
    profanity.load_censor_words(["противоя́дия"])

    censored_text = profanity.censor(text)
    print(censored_text)
    # Эффекти́вного **** от я́да фу́гу не существу́ет до сих пор

Limitations

  1. As the library compares each word by characters, the censor could easily be bypassed by adding any character(s) to the word:
profanity.censor('I just have sexx')
# returns 'I just have sexx'

profanity.censor('jerkk off')
# returns 'jerkk off'
  1. Any word in wordlist that have non-space separators cannot be recognised, such as s & m, and therefore, won't be filtered out. This problem was raised in issue #5.

Testing

$ python tests.py

Versions

  • v0.3.3 - Fix incompatibility with Python 3.5.
  • v0.3.2 - Fix a typo in documentation.
  • v0.3.1 - Remove unused dependencies.
  • v0.3.0 - Add support for Unicode characters (Categories: Ll, Lu, Mc and Mn) #2.
  • v0.2.0 - Bug fix + faster censoring
  • v0.1.0 - Initial release

Contributing

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Special thanks to

Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

better_profanity-0.3.4.tar.gz (24.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

better_profanity-0.3.4-py3-none-any.whl (41.2 kB view details)

Uploaded Python 3

File details

Details for the file better_profanity-0.3.4.tar.gz.

File metadata

  • Download URL: better_profanity-0.3.4.tar.gz
  • Upload date:
  • Size: 24.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.4.2 requests/2.19.1 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.23.4 CPython/3.6.5

File hashes

Hashes for better_profanity-0.3.4.tar.gz
Algorithm Hash digest
SHA256 e5813cd6bac247879dc2b987accfbb2f409b737f046b59691d68d0b171245224
MD5 b30a759a01c09b42e14f578385cebb26
BLAKE2b-256 2e9c07965b277d456c64fe0172194ca39a8a57af0e33ac91e10034b9efdbcc12

See more details on using hashes here.

File details

Details for the file better_profanity-0.3.4-py3-none-any.whl.

File metadata

  • Download URL: better_profanity-0.3.4-py3-none-any.whl
  • Upload date:
  • Size: 41.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.4.2 requests/2.19.1 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.23.4 CPython/3.6.5

File hashes

Hashes for better_profanity-0.3.4-py3-none-any.whl
Algorithm Hash digest
SHA256 5e719342c38b4f1fa4441edf740d7603fbf44d42dbea97a172cb55acdcbddf58
MD5 0470a44ce9ab83654604c3f0f3d5144e
BLAKE2b-256 a0a0c347d34955a6247dca63afd6c4543e7cbe281b1ad2261220accdc299d346

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page