The package that detect & switch the curse word in the sentence by using deep learning
Project description
anti-cursing
"anti-cursing" is a python package that detects and switches negative or any kind of cursing word from sentences or comments whatever๐คฌ
You just install the package the way you install any other package and then you can use it in your code.
The whole thing is gonna be updated soon.
So this is the very first idea
But you can find my package in pypi(https://pypi.org/project/anti-cursing/0.0.1/)
๐๐ปPlz bare with the program to install model's weight and bias from huggingface at the first time you use the package.
Concept
There are often situations where you have to code something, detect a forbidden word, and change it to another word. Hardcoding all parts is very inconvenient, and in the Python ecosystem, there are many packages to address. One of them is "anti-cursing".
The package, which operates exclusively for Korean, does not simply change the banned word by setting it up, but detects and replaces the banned word by learning a deep learning model.
Therefore, it is easy to cope with new malicious words as long as they are learned. For this purpose, semi-supervied learning through pseudo labeling is used.
Additionally, instead of changing malicious words to special characters such as --- or ***, you can convert them into emojis to make them more natural.
Contents
- Installation
- Usage
- Model comparison
- Dataset
- Used API
- License
- Working Example
- References
- Project Status
- Future Work
Installation
You can install the package using pip:
pip install anti-cursing
it doesn't work yet, but it will soon!!๐จ๐ปโ๐ป
Usage
from anti_cursing.utils import antiCursing
antiCursing.anti_cur("๋๋ ๋๊ฐ ์ข์ง๋ง, ๋๋ ๋๋ฌด ๊ฐ์๋ผ์ผ")
๋๋ ๋๊ฐ ์ข์ง๋ง, ๋๋ ๋๋ฌด ๐ผ๐ป์ผ
Model-comparison
Classification | KcElectra | KoBERT | RoBERTa-base | RoBERTa-large |
---|---|---|---|---|
Validation Accuracy | 0.88680 | 0.85721 | 0.83421 | 0.86994 |
Validation Loss | 1.00431 | 1.23237 | 1.30012 | 1.16179 |
Training Loss | 0.09908 | 0.03761 | 0.0039 | 0.06255 |
Epoch | 10 | 40 | 20 | 20 |
Batch-size | 8 | 32 | 16 | 32 |
transformers | beomi/KcELECTRA-base | skt/kobert-base-v1 | xlm-roberta-base | klue/roberta-large |
Dataset
-
Smilegate-AI
- https://github.com/smilegate-ai/korean_unsmile_dataset
- Korean Sentiment Analysis
- paper
-
Naver portal news articles crawling
- https://news.naver.com
- Non-labeled Data for Test Dataset
-
๐ Emoji unicode crawling for encoding
Used-api
Google translator
- https://cloud.google.com/translate/docs (API DOCS)
License
This repository is licensed under the MIT license. See LICENSE for details.
Click here to see the License information --> License
Working-example
---- some video is gonna be placed here ----
References
Sentiment Analysis Based on Deep Learning : A Comparative Study
- Nhan Cach Dang, Maria N. Moreno-Garcia, Fernando De la Prieta. 2006. Sentiment Analysis Based on Deep Learning : A Comparative Study. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), pages 1โ8, Prague, Czech Republic. Association for Computational Linguistics.
Attention is all you need
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000โ6010.
BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 4171โ4186.
Electra : Pre-training Text Encoders as Discriminators Rather Than Generators
- Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning. 2019. Electra: Pre-training text encoders as discriminators rather than generators. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 4171โ4186.
BIDAF : Bidirectional Attention Flow for Machine Comprehension
- Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, Hannaneh Hajishirzi. 2016. Bidirectional Attention Flow for Machine Comprehension. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2129โ2139.
Effect of Negation in Sentences on Sentiment Analysis and Polarity Detection
- Partha Mukherjeea, Saptarshi Ghoshb, and Saptarshi Ghoshc. 2018. Effect of Negation in Sentences on Sentiment Analysis and Polarity Detection. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2129โ2139.
KOAS : Korean Text Offensiveness Analysis System
- Seonghwan Kim, Seongwon Lee, and Seungwon Do. 2019. KOAS: Korean Text Offensiveness Analysis System. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1โ11.
Korean Unsmile Dataset
- Seonghwan Kim, Seongwon Lee, and Seungwon Do. 2019. Korean Unsmile Dataset. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1โ11.
Project-status
Future-work
update soon plz bare with me ๐๐ป
KOREAN FROM HERE / ์ฌ๊ธฐ๋ถํด ํ๊ตญ์ด ์ค๋ช ์ ๋๋ค.
anti-cursing
**"anti-cursing"**์ ๋ฌธ์ฅ์ด๋ ๋๊ธ์์ ๋ถ์ ์ ์ด๊ฑฐ๋ ๋ชจ๋ ์ข ๋ฅ์ ์์ค์ ๊ฐ์งํ๊ณ ์ ํํ๋ ํ์ด์ฌ ํจํค์ง์ ๋๋ค๐คฌ
๋ค๋ฅธ ํจํค์ง๋ฅผ ์ค์นํ๋ ๋ฐฉ์๊ณผ ๋์ผํ๊ฒ ํจํค์ง๋ฅผ ์ค์นํ ๋ค์ ์ฝ๋์์ ์ฌ์ฉํ ์ ์์ต๋๋ค.
์์ง ์์ด๋์ด ๊ตฌ์ ๋จ๊ณ์ด๊ธฐ ๋๋ฌธ์ ์๋ฌด๊ฒ๋ ์๋ํ์ง ์์ง๋ง ๊ณง ์๋ํ๋๋ก ์ ๋ฐ์ดํธํ ์์ ์ ๋๋ค.
Pypi(https://pypi.org/project/anti-cursing/0.0.1/)์ ํจํค์ง๋ฅด ์ ๋ก๋ํ์ต๋๋ค. ํ์ธํ์ ์ ์์ต๋๋ค.
๐๐ปํจํค์ง๋ฅผ ์ฒ์ ์ค์นํ์๊ณ ์ฌ์ฉํ์ค ๋ ๋ฅ๋ฌ๋ ๋ชจ๋ธ์ ๋ถ๋ฌ์ค๊ธฐ ์ํด huggingface์์ parsing์ ์๋ํฉ๋๋ค. ์ฒ์์๋ง ํด๋น ์์ ์ด ํ์ํ๋ ์๊ฐ์ด ์กฐ๊ธ ๊ฑธ๋ฆผ๊ณผ ์ฉ๋์ ์ฐจ์งํจ์ ๊ณ ๋ คํด์ฃผ์ธ์
Concept
๋ฌด์ธ๊ฐ ์ฝ๋ฉ์ ํ๋ฉฐ, ๊ธ์ง ๋จ์ด๋ฅผ ๊ฐ์งํ๊ณ ๊ทธ๊ฒ์ ๋ค๋ฅธ ๋จ์ด๋ก ๋ฐ๊ฟ์ผํ ์ํฉ์ด ์ข ์ข ์๊น๋๋ค. ๋ชจ๋ ๋ถ๋ถ์ ํ๋์ฝ๋ฉํ๋ ๊ฒ์ด ๋งค์ฐ ๋ถํธํ๋ฉฐ, ํ์ด์ฌ ์ํ๊ณ์์๋ ์ด๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํ ๋ง์ ํจํค์ง๊ฐ ์์ต๋๋ค. ๊ทธ ์ค ํ๋๊ฐ **"anti-cursing"**์ ๋๋ค.
ํ๊ตญ์ด ์ ์ฉ์ผ๋ก ๋์ํ๋ ํด๋น ํจํค์ง๋ ๋จ์ํ ๊ธ์ง ๋จ์ด๋ฅผ ๊ธฐ์กด์ ์ค์ ํ์ฌ ๋ฐ๊พธ๋ ๊ฒ์ด ์๋, ๋ฅ๋ฌ๋ ๋ชจ๋ธ์ ํ์ตํ์ฌ ๊ธ์ง ๋จ์ด๋ฅผ ๊ฐ์งํ๊ณ ๋ฐ๊ฟ๋๋ค. ๋ฐ๋ผ์ ์๋กญ๊ฒ ์๊ธฐ๋ ์ ์ฑ ๋จ์ด์ ๋ํด์๋ ํ์ต๋ง ์ด๋ฃจ์ด์ง๋ค๋ฉด ์ฝ๊ฒ ๋์ฒํ ์ ์์ต๋๋ค. ์ด๋ฅผ ์ํด pseudo labeling์ ํตํ semi-supervied learning์ ์ฌ์ฉํฉ๋๋ค.
์ถ๊ฐ๋ก ์ ์ฑ๋จ์ด๋ฅผ ---๋ ***๊ฐ์ ํน์๋ฌธ์๋ก ๋ณ๊ฒฝํ๋ ๊ฒ์ด ์๋, ์ด๋ชจ์ง๋ก ๋ณํํ์ฌ ๋์ฑ ์์ฐ์ค๋ฝ๊ฒ ๋ฐ๊ฟ ์ ์์ต๋๋ค.
๋ชฉ์ฐจ
- ์ค์น
- ์ฌ์ฉ๋ฒ
- ๋ชจ๋ธ ์ฑ๋ฅ ๋น๊ต
- ๋ฐ์ดํฐ์
- ์ฌ์ฉ API
- License
- ์๋ ์์
- ์ฐธ๊ณ ๋ฌธํ
- ์งํ์ํฉ
- ๋ฐ์
์ค์น
pip๋ฅผ ์ฌ์ฉํ์ฌ ํจํค์ง๋ฅผ ์ค์นํ ์ ์์ต๋๋ค.
pip install anti-cursing
์์ง ์๋ฌด๊ฒ๋ ์๋ํ์ง ์์ง๋ง, ๊ณง ์๋ํ๋๋ก ์ ๋ฐ์ดํธํ ์์ ์ ๋๋ค๐จ๐ปโ๐ป.
์ฌ์ฉ๋ฒ
from anti_cursing.utils import antiCursing
antiCursing.anti_cur("๋๋ ๋๊ฐ ์ข์ง๋ง, ๋๋ ๋๋ฌด ๊ฐ์๋ผ์ผ")
๋๋ ๋๊ฐ ์ข์ง๋ง, ๋๋ ๋๋ฌด ๐ผ๐ป์ผ
๋ชจ๋ธ ์ฑ๋ฅ ๋น๊ต
Classification | KcElectra | KoBERT | RoBERTa-base | RoBERTa-large |
---|---|---|---|---|
Validation Accuracy | 0.88680 | 0.85721 | 0.83421 | 0.86994 |
Validation Loss | 1.00431 | 1.23237 | 1.30012 | 1.16179 |
Training Loss | 0.09908 | 0.03761 | 0.0039 | 0.06255 |
Epoch | 10 | 40 | 20 | 20 |
Batch-size | 8 | 32 | 16 | 32 |
transformers | beomi/KcELECTRA-base | skt/kobert-base-v1 | xlm-roberta-base | klue/roberta-large |
๋ฐ์ดํฐ์
-
Smilegate-AI
- https://github.com/smilegate-ai/korean_unsmile_dataset
- ํ๊ตญ์ด ๊ฐ์ ๋ถ๋ฅ ๋ฐ์ดํฐ์
- paper
-
๋ค์ด๋ฒ ๋ด์ค ๊ธฐ์ฌ ํฌ๋กค๋ง
- https://news.naver.com
- ํ ์คํธ๋ฅผ ์ํ ๋ฐ์ดํฐ์
-
๐ ์ด๋ชจ์ง ์ ๋์ฝ๋ ๋ฐ์ดํฐ์
์ฌ์ฉ API
Google translator
- https://cloud.google.com/translate/docs (API ๋ฌธ์)
License
์ด ํ๋ก์ ํธ๋ MIT ๋ผ์ด์ผ์ค๋ฅผ ๋ฐ๋ฆ ๋๋ค. ์์ธํ ๋ด์ฉ์ LICENSE ํ์ผ์ ์ฐธ๊ณ ํด์ฃผ์ธ์.
๋ผ์ด์ผ์ค ์ ๋ณด --> License
์๋ ์์
---- ์๋ ์์๊ฐ ์ถ๊ฐ๋ ์์ ์ ๋๋ค ----
์ฐธ๊ณ ๋ฌธํ
Sentiment Analysis Based on Deep Learning : A Comparative Study
- Nhan Cach Dang, Maria N. Moreno-Garcia, Fernando De la Prieta. 2006. Sentiment Analysis Based on Deep Learning : A Comparative Study. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), pages 1โ8, Prague, Czech Republic. Association for Computational Linguistics.
Attention is all you need
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000โ6010.
BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 4171โ4186.
Electra : Pre-training Text Encoders as Discriminators Rather Than Generators
- Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning. 2019. Electra: Pre-training text encoders as discriminators rather than generators. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 4171โ4186.
BIDAF : Bidirectional Attention Flow for Machine Comprehension
- Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, Hannaneh Hajishirzi. 2016. Bidirectional Attention Flow for Machine Comprehension. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2129โ2139.
Effect of Negation in Sentences on Sentiment Analysis and Polarity Detection
- Partha Mukherjeea, Saptarshi Ghoshb, and Saptarshi Ghoshc. 2018. Effect of Negation in Sentences on Sentiment Analysis and Polarity Detection. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2129โ2139.
KOAS : Korean Text Offensiveness Analysis System
- Seonghwan Kim, Seongwon Lee, and Seungwon Do. 2019. KOAS: Korean Text Offensiveness Analysis System. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1โ11.
Korean Unsmile Dataset
- Seonghwan Kim, Seongwon Lee, and Seungwon Do. 2019. Korean Unsmile Dataset. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1โ11.
์งํ์ํฉ
๋ฐ์
์์ผ๋ก ์ถ๊ฐ๋ ์์ ์ ๋๋ค ์ ์๋ง ๊ธฐ๋ค๋ ค์ฃผ์ธ์๐๐ป
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file anti_cursing-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: anti_cursing-0.0.2-py3-none-any.whl
- Upload date:
- Size: 31.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 78aa4b2382079b0a19d1a4abf4a7a2af9598bbf23a96d32a56a55258824905c6 |
|
MD5 | 0c75eb945640f4ee662adacc986e3746 |
|
BLAKE2b-256 | 133d25df0f1de802f903166aa10d5e834a24386c4c4cd027620368df6f609330 |