Skip to main content

Ai Palette NLP toolkit

Project description

aipalette_nlp

aipalette_nlp python package is a package that contains a list of NLP functions that will be used for future tasks in Ai Palette. Many useful modules and functions will be included in the package. For now, it has a module that consists of tokenizers of different languages, and another module that has several functions for text preprocessing which also includes detecting language.

How to Use

Install this package using pip : >> pip install aipalette_nlp, and import it directly in your code.

Modules

Module1: tokenizer

Below is an example of how you can use the word_tokenize function in the tokenizer module, which will automatically detect input language and call its respective tokenizer.

from aipalette_nlp.tokenizer import word_tokenize

text = "우아아 제 요리에 날개를 달아주는 아름다운 <키친콤마> 식품들이 도착했어요. 저당질, 저탄수화물로 만들어져 건강과 다이어트 그리고 맛까지 한꺼번에 챙길 수 있는 필수템입니다! 처음 호기심에서 시작한 저탄고지 키토식단을 유지한지 어느덧 2년 가까이 되었어요. 저탄고지는 살을 빼기위해 무작정 탄수화물을 끊는다거나 몸에 무리가 갈 수 있는 저칼로리 / 저염식이 아니에요. 내 몸에서 나타나는 반응에 좀더 귀기울이고 끊임없이 공부하고 좋은 음식을 섭취하려고 노력하는 라이프스타일 입니다."  

print(word_tokenize(text)) 

Output:

{'tokenized_text': ['우아아', '제', '요리에', '날개를', '달아주는', '아름다운', '<키친콤마>', '식품들이', '도착했어요', '저당질,', '저탄수화물로', '만들어져', '건강과', '다이어트', '그리고', '맛까지', '한꺼번에', '챙길', '수', '있는', '필수템입니ᄃ', 'ᅡ!', '처음', '호기심에서', '시작 한', '저탄고지', '키토식단을', '유지한지', '어느덧', '2년', '가까이', '되었어요', '저탄고지는', '살을', '빼기위해', '무작정', '탄수화물을', '끊는다거나', '몸에', '무리가', '갈', '수', '있는', '저칼로리', '/', '', '저염식이', '아니에요', '내', '몸에서', '나타나는', '반응에', '좀더', '귀기울이고', '끊임없이', '공부하고', '좋은', '음식을', '섭취하려고', '노력하는', '라이프스타일', '입니다']}

Module2: text_cleaning

Below is an example of how you can use the functions in the text_cleaning module.

from aipalette_nlp.preprocessing import detect_language, clean_text, remove_stopwords

text = """Dinner at @docksidevancouver . Patio season is definitely here!Support your local restaurants.

#foodie #facestuffing #scoutmagazine #vancouvermagazine #dailyhivevancouver #ediblevancouver #eatmagazine #vancouverisawesome #vancouverfoodie #food #foodlover
#curiocityvancouver #foodporn #foodlover #eat #foodgasm #foodinsta #foodinstagram #instafood #instafoodie #foodlover #foodpics  #foodiesofinstagram #restaurant #homechef #foodphotography #nomnomnom #georgiastraight #docksiderestaurant #granvilleisland #gnocchi #dinner"""

print("language detected of the given text is : ", detect_language(text))
print(remove_stopwords(text))
print(clean_text(text))

Output:

language detected of the given text is : en

dinner @docksidevancouver . patio season definitely here!support local restaurants. #foodie #facestuffing #scoutmagazine #vancouvermagazine #dailyhivevancouver #ediblevancouver #eatmagazine #vancouverisawesome #vancouverfoodie #food #foodlover #curiocityvancouver #foodporn #foodlover #eat #foodgasm #foodinsta #foodinstagram #instafood #instafoodie #foodlover #foodpics #foodiesofinstagram #restaurant #homechef #foodphotography #nomnomnom #georgiastraight #docksiderestaurant #granvilleisland #gnocchi #dinner

{'hashtags': ['foodie', 'facestuffing', 'scoutmagazine', 'vancouvermagazine', 'dailyhivevancouver', 'ediblevancouver', 'eatmagazine', 'vancouverisawesome', 'vancouverfoodie', 'food', 'foodlover', 'curiocityvancouver', 'foodporn', 'foodlover', 'eat', 'foodgasm', 'foodinsta', 'foodinstagram', 'instafood', 'instafoodie', 'foodlover', 'foodpics', 'foodiesofinstagram', 'restaurant', 'homechef', 'foodphotography', 'nomnomnom', 'georgiastraight', 'docksiderestaurant', 'granvilleisland', 'gnocchi', 'dinner'], 'cleaned_text': 'dinner username patio season definitely support local restaurants', 'text_length': 65}


Complete list of tokenizers supported:

['english', 'french', 'italian', 'portuguese', 'spanish', 'swedish', 'turkish', 'russian', 'mandarin', 'thai', 'japanese', 'korean', 'vietnamese','german', 'arabic']


Text Processing/Cleaning Functions

The clean_text function from module text_cleaning does the following steps:

  • replace the hashtags (#______) in the main caption with the original form of the word.
  • replace all the mentioned usernames (@_______) with the word “<username>”.
  • remove punctuations
  • remove stopwords (use nltk package)
  • detect language
  • replace all links/urls

Language supported by our language detector :

af, am, an, ar, as, az, be, bg, bn, br, bs, ca, cs, cy, da, de, dz, el, en, eo, es, et, eu, fa, fi, fo, fr, ga, gl, gu, he, hi, hr, ht, hu, hy, id, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lb, lo, lt, lv, mg, mk, ml, mn, mr, ms, mt, nb, ne, nl, nn, no, oc, or, pa, pl, ps, pt, qu, ro, ru, rw, se, si, sk, sl, sq, sr, sv, sw, ta, te, th, tl, tr, ug, uk, ur, vi, vo, wa, xh, zh, zu

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aipalette_nlp-0.1.0.tar.gz (24.3 kB view hashes)

Uploaded Source

Built Distribution

aipalette_nlp-0.1.0-py3-none-any.whl (25.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page