A fast, robust library to check for offensive language in strings.
Project description
profanity-check
A fast, robust Python library to check for profanity or offensive language in strings.
How It Works
profanity-check
uses a linear SVM model trained on 200k human-labeled samples of clean and profane text strings. Its model is simple but surprisingly effective, meaning profanity-check
is both robust and extremely performant.
Why Use profanity-check?
Many profanity detection libraries use a hard-coded list of bad words to detect and filter profanity. For example, profanity uses this wordlist, and even better-profanity still uses a wordlist. There are obviously glaring issues with this approach, and, while they might be performant, these libraries are not accurate at all.
Other libraries like profanity-filter use more sophisticated methods that are much more accurate but at the cost of performance. A benchmark (performed December 2018 on a new 2018 Macbook Pro) using a Kaggle dataset of Wikipedia comments yielded roughly the following results:
Package | 1 Prediction (ms) | 10 Predictions (ms) | 100 Predictions (ms) |
---|---|---|---|
profanity-check | 0.2 | 0.5 | 3.5 |
profanity-filter | 60 | 1200 | 13000 |
profanity | 0.3 | 1.2 | 24 |
profanity-check
is anywhere from 300 - 4000 times faster than profanity-filter
in this benchmark!
Installation
$ pip install profanity-check
Usage
from profanity_check import predict, predict_prob
predict(['predict() takes an array and returns a 1 for each string if it's offensive, else 0.'])
# [0]
predict(['fuck you'])
# [1]
predict_prob(['predict_prob() takes an array and returns the probability each string is offensive'])
# [0.08686173]
predict_prob(['go to hell, you scum'])
# [0.7618861]
Note that both predict()
and predict_prob
return numpy
arrays.
More on How It Works
Special thanks to the authors of the datasets used in this project. profanity-check
was trained on a combined dataset from 2 sources:
- t-davidson/hate-speech-and-offensive-language, used in their paper Automated Hate Speech Detection and the Problem of Offensive Language
- the Toxic Comment Classification Challenge on Kaggle.
profanity-check
relies heavily on the excellent scikit-learn
library. It's mostly powered by scikit-learn
classes CountVectorizer
, LinearSVC
, and CalibratedClassifierCV
.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for profanity_check-1.0.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 04602f5cd9f4379c8518f62ff1a96eba581a8972c75b0f3d7831f4cbda70d2e9 |
|
MD5 | 8ca1e02da7a8578495d516a0f8696664 |
|
BLAKE2b-256 | 81eaff64fa9d8fe520fea274309e2c05ed9317b49291b1829a95e36c1d959dbc |