Trained Korean Lemmatizer
Project description
한국어 용언 분석기 (Korean Lemmatizer)
한국어의 동사와 형용사의 활용형 (surfacial form) 을 분석합니다. 한국어 용언 분석기는 다음의 기능을 제공합니다.
- 입력된 단어를 어간 (stem) 과 어미 (eomi) 으로 분리
- 입력된 단어를 원형으로 복원
이 패키지의 구현 원리는 github.io 블로그에 정리하였습니다.
Usage
analyze, lemmatize, conjugate
analyze
function returns morphemes of the given predicator word
from soylemma import Lemmatizer
lemmatizer = Lemmatizer()
lemmatizer.analyze('차가우니까')
The return value forms list of tuples because there can be more than one morpheme combination.
[(('차갑', 'Adjective'), ('우니까', 'Eomi'))]
lemmatize
function returns lemma of the given predicator word.
lemmatizer.lemmatize('차가우니까')
[('차갑다', 'Adjective')]
If the input word is not predicator such as Noun, it return empty list.
lemmatizer.lemmatize('한국어') # []
conjugate
function returns surfacial form. You should put stem and eomi as arguments. It returns all possible surfacial forms for the given stem and eomi.
lemmatizer.conjugate(stem='차갑', eomi='우니까')
lemmatizer.conjugate('예쁘', '었던')
['차가우니까', '차갑우니까']
['예뻤던', '예쁘었던']
update dictionaries and rules
For demonstration, we use dictioanry demo
.
어여뻤어
cannot be analyzed because the adjective 어여쁘
does not enrolled in dictionary.
from soylemma import Lemmatizer
lemmatizer = Lemmatizer(dictionary_name='demo')
print(lemmatizer.analyze('어여뻤어')) # []
So, we add the word with tag using add_words
function. Do it again. Then you can see the word 어여뻤어
is analyzed.
lemmatizer.add_words('어여쁘', 'Adjective')
lemmatizer.analyze('어여뻤어')
[(('어여쁘', 'Adjective'), ('었어', 'Eomi'))]
However, the word 파랬다
is still not able to be analyzed because the lemmatization rule for surfacial form 랬
does not exist.
lemmatizer.analyze('파랬다') # []
So, in this time, we update additional lemmatization rules using add_lemma_rules
function.
supplements = {
'랬': {('랗', '았')}
}
lemmatizer.add_lemma_rules(supplements)
After that, we can see the word 파랬다
is analyzed, and also conjugation of 파랗 + 았다
is available.
lemmatizer.analyze('파랬다')
lemmatizer.conjugate('파랗', '았다')
[(('파랗', 'Adjective'), ('았다', 'Eomi'))]
['파랬다', '파랗았다']
debug on
If you wonder which subwords came up as candidates of (stem, eomi), use debug
.
lemmatizer.analyze('파랬다', debug=True)
[DEBUG] word: 파랬다 = 파랗 + 았다, conjugation: 랬 = 랗 + 았
[(('파랗', 'Adjective'), ('았다', 'Eomi'))]
lemmatization rule extractor
You can extract lemmatization rule using extract_rule
function.
from soylemma import extract_rule
eojeol = '로드무비였다'
lw = '로드무비이'
lt = 'Adjective'
rw = '었다'
rt = 'Eomi'
extract_rule(eojeol, lw, lt, rw, rt)
('였다', ('이', '었다'))
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
File details
Details for the file soylemma-0.1.0.tar.gz
.
File metadata
- Download URL: soylemma-0.1.0.tar.gz
- Upload date:
- Size: 93.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.4.2 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.9.1 tqdm/4.28.1 CPython/3.7.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | dea434f26ac1f3c9bd960e3b4fa145d50ed10706a424c31638cc79425f49846c |
|
MD5 | 0df7e3976a871016710db644fff76006 |
|
BLAKE2b-256 | 4c0107da5b88fcc7217fa8dcae840c276e93b22504d8b4bd4ec7791ebc3b6fa2 |
File details
Details for the file soylemma-0.1.0-py3.7.egg
.
File metadata
- Download URL: soylemma-0.1.0-py3.7.egg
- Upload date:
- Size: 101.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.4.2 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.9.1 tqdm/4.28.1 CPython/3.7.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1a97aab959f624ed30307d9e14ea5a287de455a7ba33bfcfabbe68fbf763c35e |
|
MD5 | 6a6578f5a4bc4f1fe0cf1dc159eb939e |
|
BLAKE2b-256 | 2f7a9907596e6965ecf37148f0d5aadaa72a5cc5259534af52c7b846db0ea1df |
File details
Details for the file soylemma-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: soylemma-0.1.0-py3-none-any.whl
- Upload date:
- Size: 91.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.4.2 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.9.1 tqdm/4.28.1 CPython/3.7.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 25b4ca2b5f1cbb75e0642a7a5e813045cbd2c01a03d4da17850921386ebf90af |
|
MD5 | a915c818fcbf4080cfbfaa0ce838446c |
|
BLAKE2b-256 | 28df8cd8f8896012cc150f9d1ce103d8e79005b99de4d51eb97c26e4b79eee3c |