Korean tokenizer with character decomposition
Project description
Parasol Tokenizer
Parasol tokenizes hangul after decomposition. 한글 자음,모음을 분해하여 토큰화합니다.
- Original text : 고가도로에 삐져나온 초록잎 아마 이 도시에서 유일히 적응 못한 낭만일 거야
- Decomposed text : ㄱㅗㄱㅏㄷㅗㄹㅗㅇㅔ ㅃㅣㅈㅕㄴㅏㅇㅗㄴ ㅊㅗㄹㅗㄱㅇㅣㅍ ㅇㅏㅁㅏ ㅇㅣ ㄷㅗㅅㅣㅇㅔㅅㅓ ㅇㅠㅇㅣㄹㅎㅣ ㅈㅓㄱㅇㅡㅇ ㅁㅗㅅㅎㅏㄴ ㄴㅏㅇㅁㅏㄴㅇㅣㄹ ㄱㅓㅇㅑ
- Tokens : ▁ㄱㅗㄱㅏ / ㄷㅗㄹㅗ / ㅇㅔ / ▁ㅃㅣ / ㅈㅕㄴ / ㅏㅇㅗㄴ / ▁ㅊ / ㅗㄹ / ㅗㄱ / ㅇㅣ / ㅍ / ▁ㅇㅏㅁㅏ / ▁ㅇㅣ / ▁ㄷㅗㅅㅣ / ㅇㅔㅅㅓ / ▁ㅇㅠㅇㅣㄹ / ㅎㅣ / ▁ㅈㅓㄱㅇㅡㅇ / ▁ㅁㅗㅅㅎㅏㄴ / ▁ㄴㅏㅇㅁㅏㄴ / ㅇㅣㄹ / ▁ㄱㅓㅇㅑ
- Composed tokens : ▁고가 / 도로 / 에 / ▁삐 / 젼 / ㅏ온 / ▁ㅊ / ㅗㄹ / ㅗㄱ / 이 / ㅍ / ▁아마 / ▁이 / ▁도시 / 에서 / ▁유일 / 히 / ▁적응 / ▁못한 / ▁낭만 / 일 / ▁거야
Installation
pip install parasol-nlp
Experiment
The figure shows the results of the perplexity comparison experiment. with decomposition
is tokenized with charactor decomposition and no decomposition
is just tokenized.
Experiment source code is here.
Usage
Tokenizer
Use SentencePiece's BPE model as tokenizer and hgtk for decomposition.
from parasol import Tokenizer
# tokenize after decomposition
t1 = Tokenizer(decompose=True)
# tokenize without decomposition
t2 = Tokenizer(decompose=False)
then
>>> t1.tokenize("고가도로에 삐져나온 초록잎 아마 이 도시에서 유일히 적응 못한 낭만일 거야")
['▁고가', '도로', '에', '▁삐', '젼', 'ㅏ온', '▁ㅊ', 'ㅗ록', '잎', '▁아마', '▁이', '▁도시', '에서', '▁유일', '히', '▁적응', '▁못한', '▁낭만', '일', '▁거야']
>>> t2.tokenize("고가도로에 삐져나온 초록잎 아마 이 도시에서 유일히 적응 못한 낭만일 거야")
['▁고가', '도로', '에', '▁삐', '져', '나온', '▁초록', '잎', '▁아마', '▁이', '▁도시', '에서', '▁유일', '히', '▁적응', '▁못한', '▁낭만', '일', '▁거야']
# Output as vocabulary id
>>> t1.tokenize("고가도로에 삐져나온 초록잎 아마 이 도시에서 유일히 적응 못한 낭만일 거야", as_id=True)
[17687, 2135, 36, 8351, 3904, 3842, 52, 12256, 27398, 3469, 30, 6105, 160, 3767, 198, 8953, 2345, 13164, 89, 6872]
Composer
Hangul jamo composer
from parasol import Composer
c = Composer()
then
>>> c.compose("ㄷㅏㄹㅇㅣ ㄱㅣㅇㅜㄴ ㅂㅏㅁ ㅍㅓㄹㅓㄴㅂㅣㅊㅇㅣ ㅅㅡㅁㅕㄷㅡㄴ ㄱㅗㄹㅁㅗㄱㅇㅡㄹ ㄱㅓㄹㅇㅓㄱㅏㄷㅓㄴ ㄱㅣㄹㅇㅔ")
'달이 기운 밤 퍼런빛이 스며든 골목을 걸어가던 길에'
but it is not perfect, like..
>>> c.compose("ㅎㅐㅇㅇㅜㄴㅇㅡㄹ ㅂㅣㄹㅇㅓㅇㅛㅎㅎ")
'행운을 빌어욯ㅎ'
which of original text is 행운을 빌어요ㅎㅎ
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
parasol-nlp-0.0.4.tar.gz
(1.2 MB
view details)
Built Distribution
File details
Details for the file parasol-nlp-0.0.4.tar.gz
.
File metadata
- Download URL: parasol-nlp-0.0.4.tar.gz
- Upload date:
- Size: 1.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cc708f9f116a9597bf979d9051709c53fa565ae62a7929892878cb04aabdc9e8 |
|
MD5 | c2ff3ca05a97aee8448b53f7d3ccd9fe |
|
BLAKE2b-256 | 1bfb172217f126ccfd6dd08a8f81ebb46a6c1111d3476f07144ab90f3c102d02 |
File details
Details for the file parasol_nlp-0.0.4-py3-none-any.whl
.
File metadata
- Download URL: parasol_nlp-0.0.4-py3-none-any.whl
- Upload date:
- Size: 1.2 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f1adc6ac4c7fbb00e98a4376741165a6458275abb58d1ea1234e4da777a6193d |
|
MD5 | b88c104127d67dfe411951fdbaef8490 |
|
BLAKE2b-256 | 627a8a851d95c07cc4d51addc68acdc11a1b0afd76678333ace4c0abb367507d |