Skip to main content

Ai Palette NLP toolkit

Project description

aipalette_nlp

aipalette_nlp python package is a package that contains a list of NLP functions that will be used for future tasks in Ai Palette. Many useful modules and functions will be included in the package. For now, it has a module that consists of tokenizers of different languages, and another module that has several functions for text preprocessing which also includes detecting language.

How to Use

Install this package using pip : >> pip install aipalette_nlp, and import it directly in your code.

Modules

Module1: tokenizer

Below is an example of how you can use the word_tokenize function in the tokenizer module, which will automatically detect input language and call its respective tokenizer.

from aipalette_nlp.tokenizer import word_tokenize

text = "우아아 제 요리에 날개를 달아주는 아름다운 <키친콤마> 식품들이 도착했어요. 저당질, 저탄수화물로 만들어져 건강과 다이어트 그리고 맛까지 한꺼번에 챙길 수 있는 필수템입니다! 처음 호기심에서 시작한 저탄고지 키토식단을 유지한지 어느덧 2년 가까이 되었어요. 저탄고지는 살을 빼기위해 무작정 탄수화물을 끊는다거나 몸에 무리가 갈 수 있는 저칼로리 / 저염식이 아니에요. 내 몸에서 나타나는 반응에 좀더 귀기울이고 끊임없이 공부하고 좋은 음식을 섭취하려고 노력하는 라이프스타일 입니다."  

print(word_tokenize(text)) 

Output:

{'tokenized_text': ['우아아', '제', '요리에', '날개를', '달아주는', '아름다운', '<키친콤마>', '식품들이', '도착했어요', '저당질,', '저탄수화물로', '만들어져', '건강과', '다이어트', '그리고', '맛까지', '한꺼번에', '챙길', '수', '있는', '필수템입니ᄃ', 'ᅡ!', '처음', '호기심에서', '시작 한', '저탄고지', '키토식단을', '유지한지', '어느덧', '2년', '가까이', '되었어요', '저탄고지는', '살을', '빼기위해', '무작정', '탄수화물을', '끊는다거나', '몸에', '무리가', '갈', '수', '있는', '저칼로리', '/', '', '저염식이', '아니에요', '내', '몸에서', '나타나는', '반응에', '좀더', '귀기울이고', '끊임없이', '공부하고', '좋은', '음식을', '섭취하려고', '노력하는', '라이프스타일', '입니다']}

Module2: text_cleaning

Below is an example of how you can use the functions in the text_cleaning module.

from aipalette_nlp.preprocessing import detect_language, clean_text, remove_stopwords

text = """Dinner at @docksidevancouver . Patio season is definitely here!Support your local restaurants.

#foodie #facestuffing #scoutmagazine #vancouvermagazine #dailyhivevancouver #ediblevancouver #eatmagazine #vancouverisawesome #vancouverfoodie #food #foodlover
#curiocityvancouver #foodporn #foodlover #eat #foodgasm #foodinsta #foodinstagram #instafood #instafoodie #foodlover #foodpics  #foodiesofinstagram #restaurant #homechef #foodphotography #nomnomnom #georgiastraight #docksiderestaurant #granvilleisland #gnocchi #dinner"""

print("language detected of the given text is : ", detect_language(text))
print(remove_stopwords(text))
print(clean_text(text))

Output:

language detected of the given text is : en

dinner @docksidevancouver . patio season definitely here!support local restaurants. #foodie #facestuffing #scoutmagazine #vancouvermagazine #dailyhivevancouver #ediblevancouver #eatmagazine #vancouverisawesome #vancouverfoodie #food #foodlover #curiocityvancouver #foodporn #foodlover #eat #foodgasm #foodinsta #foodinstagram #instafood #instafoodie #foodlover #foodpics #foodiesofinstagram #restaurant #homechef #foodphotography #nomnomnom #georgiastraight #docksiderestaurant #granvilleisland #gnocchi #dinner

{'hashtags': ['foodie', 'facestuffing', 'scoutmagazine', 'vancouvermagazine', 'dailyhivevancouver', 'ediblevancouver', 'eatmagazine', 'vancouverisawesome', 'vancouverfoodie', 'food', 'foodlover', 'curiocityvancouver', 'foodporn', 'foodlover', 'eat', 'foodgasm', 'foodinsta', 'foodinstagram', 'instafood', 'instafoodie', 'foodlover', 'foodpics', 'foodiesofinstagram', 'restaurant', 'homechef', 'foodphotography', 'nomnomnom', 'georgiastraight', 'docksiderestaurant', 'granvilleisland', 'gnocchi', 'dinner'], 'cleaned_text': 'dinner username patio season definitely support local restaurants', 'text_length': 65}


Complete list of tokenizers supported:

['english', 'french', 'italian', 'portuguese', 'spanish', 'swedish', 'turkish', 'russian', 'mandarin', 'thai', 'japanese', 'korean', 'vietnamese','german', 'arabic']


Text Processing/Cleaning Functions

The clean_text function from module text_cleaning does the following steps:

  • replace the hashtags (#______) in the main caption with the original form of the word.
  • replace all the mentioned usernames (@_______) with the word “<username>”.
  • remove punctuations
  • remove stopwords (use nltk package)
  • detect language
  • replace all links/urls

Language supported by our language detector :

af, am, an, ar, as, az, be, bg, bn, br, bs, ca, cs, cy, da, de, dz, el, en, eo, es, et, eu, fa, fi, fo, fr, ga, gl, gu, he, hi, hr, ht, hu, hy, id, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lb, lo, lt, lv, mg, mk, ml, mn, mr, ms, mt, nb, ne, nl, nn, no, oc, or, pa, pl, ps, pt, qu, ro, ru, rw, se, si, sk, sl, sq, sr, sv, sw, ta, te, th, tl, tr, ug, uk, ur, vi, vo, wa, xh, zh, zu

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aipalette_nlp-0.1.0.tar.gz (24.3 kB view details)

Uploaded Source

Built Distribution

aipalette_nlp-0.1.0-py3-none-any.whl (25.6 kB view details)

Uploaded Python 3

File details

Details for the file aipalette_nlp-0.1.0.tar.gz.

File metadata

  • Download URL: aipalette_nlp-0.1.0.tar.gz
  • Upload date:
  • Size: 24.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for aipalette_nlp-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0db76860674d5c5b5a9af795db876c4d7404d0e073452b4fb7cf4329c592a424
MD5 0db6365ad7838aa9a4e2765618da9b79
BLAKE2b-256 6bcfe05ab033e7a29f5397de99c50cc2f58fb38bc374a0e91f8f1d2f8b9e742f

See more details on using hashes here.

File details

Details for the file aipalette_nlp-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for aipalette_nlp-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f036efd3a20a29c81912a5108542a3996666ac1e1cf6102f115a971aa2a4e962
MD5 7bf1d7250d496216509eb4adf15434fc
BLAKE2b-256 a0e93368acd0bf0703fb028a3f32f8e1bbb64bfd013cd19e13011bae748c8f3d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page