A fast and flexible utility for censoring and filtering text
Project description
Fast Censor
fast_censor
- A fast and flexible package for filtering out profanity or other strings from text, ~100 times faster than alternatives
- the fastest string utility for profanity detection / censoring
- Allows for detection with repeated characters and character substitution
- Zero-dependency and works for python 3.6 -- 3.11
Installation
From source
cd fast-censor # enter into project directory
python setup.py install
# or with pip locally
pip install -e .
From GitHub
pip install git+https://github.com/mbuchove/fast_censor.git
Uses
from fast_censor import FastCensor
# to load default (encoded) profanity word list
censor = FastCensor()
# load alternate path, example is a plain text word list without encoding
censor_clean = fast_censor.FastCensor(
wordlist=fast_censor.WordListHandler.get_default_wordlist_path("clean_wordlist_decoded.txt"),
wordlist_encoded=False,
)
censor_clean.add_words(['bat', 'rick'])
# censor texts or simply get the indices of matches
matches = censor_clean.check_text("this bat is for riii1ick")
# >>> [(5, 9), (17, 25)]
censored_text = censor_clean.censor("fuuudge you")
# >>> "******* you"
Character substitutions
FastCensor's profanity matcher allows the flexibility to match words when specified characters are substituted for others, as is customary in 1337 speak. A default is set for commonly used substitutions.
To set your own, for example, you would pass the following into FastCensor
substitutions = {'a': '@4'}
- all matching is case-insensitive
Character repititon
By default, words will still match even if a matching character is repeated any number of times. This includes any valid substitute for that character
For example, "baaa@@aatt" will match "bat"
You can turn this off by passing allow_repititions=False to censor_text or check_text
Delimiters
Use the delimiters parameter of FastCensor to set the delimiter characters, which determine the boundaries of a word.
Profanity matches will not extend across any delimiting character.
For example, if '_' is a delimiter, "ba_t" would not match "bat"
Editing and saving wordlist
censor.add_word('new_word') # to add a new word
censor.write_words_file("word_lists/new_wordlist_encoded.txt", encode=True)
Encoding
By default, the word lists are base64-encoded, so you can avoid displaying vulgar or offensive words.
If you would like to save a word list in plain text, set encode=False in write_words_file
Benchmarks
See: This Gist for performance measures of filtering compared to other packages
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fast_censor-0.3.2.tar.gz.
File metadata
- Download URL: fast_censor-0.3.2.tar.gz
- Upload date:
- Size: 15.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1a330679e92f9b42c7beef570e5d452691aac008c0e196ff7a0c704c5ecccef5
|
|
| MD5 |
0315a0a221be94e6771fa7afbf10e366
|
|
| BLAKE2b-256 |
1fdd95bb2aabf0098a2bbbb30508ec2c5b1540091465ab33263b9558e60526d9
|
File details
Details for the file fast_censor-0.3.2-py3-none-any.whl.
File metadata
- Download URL: fast_censor-0.3.2-py3-none-any.whl
- Upload date:
- Size: 16.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e9bd67936f95a5c9705e113ddd638d3c3329f1171f1367a188214d49e77d463f
|
|
| MD5 |
3a7a68211f81ad8e1004bf0c2180731e
|
|
| BLAKE2b-256 |
0155e9e6cef3289a8006e21a730c446223939bef5de3b6a6f23c70b95fca326b
|