Ai Palette NLP toolkit
Project description
aipalette_nlp
aipalette_nlp
python package is a package that contains a list of NLP functions that will be used for future tasks in Ai Palette. Many useful modules and functions will be included in the package. For now, it has a module that consists of tokenizers of different languages, and another module that has several functions for text preprocessing which also includes detecting language.
How to Use
Install this package using pip : >> pip install aipalette_nlp
, and import it directly in your code.
Modules
Module1: tokenizer
Below is an example of how you can use the word_tokenize
function in the tokenizer module, which will automatically detect input language and call its respective tokenizer.
from aipalette_nlp.tokenizer import word_tokenize
text = "우아아 제 요리에 날개를 달아주는 아름다운 <키친콤마> 식품들이 도착했어요. 저당질, 저탄수화물로 만들어져 건강과 다이어트 그리고 맛까지 한꺼번에 챙길 수 있는 필수템입니다! 처음 호기심에서 시작한 저탄고지 키토식단을 유지한지 어느덧 2년 가까이 되었어요. 저탄고지는 살을 빼기위해 무작정 탄수화물을 끊는다거나 몸에 무리가 갈 수 있는 저칼로리 / 저염식이 아니에요. 내 몸에서 나타나는 반응에 좀더 귀기울이고 끊임없이 공부하고 좋은 음식을 섭취하려고 노력하는 라이프스타일 입니다."
print(word_tokenize(text))
Output:
{'tokenized_text': ['우아아', '제', '요리에', '날개를', '달아주는', '아름다운', '<키친콤마>', '식품들이', '도착했어요', '저당질,', '저탄수화물로', '만들어져', '건강과', '다이어트', '그리고', '맛까지', '한꺼번에', '챙길', '수', '있는', '필수템입니ᄃ', 'ᅡ!', '처음', '호기심에서', '시작 한', '저탄고지', '키토식단을', '유지한지', '어느덧', '2년', '가까이', '되었어요', '저탄고지는', '살을', '빼기위해', '무작정', '탄수화물을', '끊는다거나', '몸에', '무리가', '갈', '수', '있는', '저칼로리', '/', '', '저염식이', '아니에요', '내', '몸에서', '나타나는', '반응에', '좀더', '귀기울이고', '끊임없이', '공부하고', '좋은', '음식을', '섭취하려고', '노력하는', '라이프스타일', '입니다']}
Module2: text_cleaning
Below is an example of how you can use the functions in the text_cleaning module.
from aipalette_nlp.preprocessing import detect_language, clean_text, remove_stopwords
text = """Dinner at @docksidevancouver . Patio season is definitely here!Support your local restaurants.
#foodie #facestuffing #scoutmagazine #vancouvermagazine #dailyhivevancouver #ediblevancouver #eatmagazine #vancouverisawesome #vancouverfoodie #food #foodlover
#curiocityvancouver #foodporn #foodlover #eat #foodgasm #foodinsta #foodinstagram #instafood #instafoodie #foodlover #foodpics #foodiesofinstagram #restaurant #homechef #foodphotography #nomnomnom #georgiastraight #docksiderestaurant #granvilleisland #gnocchi #dinner"""
print("language detected of the given text is : ", detect_language(text))
print(remove_stopwords(text))
print(clean_text(text))
Output:
language detected of the given text is : en
dinner @docksidevancouver . patio season definitely here!support local restaurants. #foodie #facestuffing #scoutmagazine #vancouvermagazine #dailyhivevancouver #ediblevancouver #eatmagazine #vancouverisawesome #vancouverfoodie #food #foodlover #curiocityvancouver #foodporn #foodlover #eat #foodgasm #foodinsta #foodinstagram #instafood #instafoodie #foodlover #foodpics #foodiesofinstagram #restaurant #homechef #foodphotography #nomnomnom #georgiastraight #docksiderestaurant #granvilleisland #gnocchi #dinner
{'hashtags': ['foodie', 'facestuffing', 'scoutmagazine', 'vancouvermagazine', 'dailyhivevancouver', 'ediblevancouver', 'eatmagazine', 'vancouverisawesome', 'vancouverfoodie', 'food', 'foodlover', 'curiocityvancouver', 'foodporn', 'foodlover', 'eat', 'foodgasm', 'foodinsta', 'foodinstagram', 'instafood', 'instafoodie', 'foodlover', 'foodpics', 'foodiesofinstagram', 'restaurant', 'homechef', 'foodphotography', 'nomnomnom', 'georgiastraight', 'docksiderestaurant', 'granvilleisland', 'gnocchi', 'dinner'], 'cleaned_text': 'dinner username patio season definitely support local restaurants', 'text_length': 65}
Complete list of tokenizers supported:
['english', 'french', 'italian', 'portuguese', 'spanish', 'swedish', 'turkish', 'russian', 'mandarin', 'thai', 'japanese', 'korean', 'vietnamese','german', 'arabic']
Text Processing/Cleaning Functions
The clean_text
function from module text_cleaning does the following steps:
- replace the hashtags (#______) in the main caption with the original form of the word.
- replace all the mentioned usernames (@_______) with the word “<username>”.
- remove punctuations
- remove stopwords (use nltk package)
- detect language
- replace all links/urls
Language supported by our language detector :
af, am, an, ar, as, az, be, bg, bn, br, bs, ca, cs, cy, da, de, dz, el, en, eo, es, et, eu, fa, fi, fo, fr, ga, gl, gu, he, hi, hr, ht, hu, hy, id, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lb, lo, lt, lv, mg, mk, ml, mn, mr, ms, mt, nb, ne, nl, nn, no, oc, or, pa, pl, ps, pt, qu, ro, ru, rw, se, si, sk, sl, sq, sr, sv, sw, ta, te, th, tl, tr, ug, uk, ur, vi, vo, wa, xh, zh, zu
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file aipalette_nlp-0.1.0.tar.gz
.
File metadata
- Download URL: aipalette_nlp-0.1.0.tar.gz
- Upload date:
- Size: 24.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0db76860674d5c5b5a9af795db876c4d7404d0e073452b4fb7cf4329c592a424 |
|
MD5 | 0db6365ad7838aa9a4e2765618da9b79 |
|
BLAKE2b-256 | 6bcfe05ab033e7a29f5397de99c50cc2f58fb38bc374a0e91f8f1d2f8b9e742f |
File details
Details for the file aipalette_nlp-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: aipalette_nlp-0.1.0-py3-none-any.whl
- Upload date:
- Size: 25.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f036efd3a20a29c81912a5108542a3996666ac1e1cf6102f115a971aa2a4e962 |
|
MD5 | 7bf1d7250d496216509eb4adf15434fc |
|
BLAKE2b-256 | a0e93368acd0bf0703fb028a3f32f8e1bbb64bfd013cd19e13011bae748c8f3d |