Skip to main content

Trained Korean Lemmatizer

Project description

한국어 용언 분석기 (Korean Lemmatizer)

한국어의 동사와 형용사의 활용형 (surfacial form) 을 분석합니다. 한국어 용언 분석기는 다음의 기능을 제공합니다.

  1. 입력된 단어를 어간 (stem) 과 어미 (eomi) 으로 분리
  2. 입력된 단어를 원형으로 복원

이 패키지의 구현 원리는 github.io 블로그에 정리하였습니다.

Usage

analyze, lemmatize, conjugate

analyze function returns morphemes of the given predicator word

from soylemma import Lemmatizer

lemmatizer = Lemmatizer()
lemmatizer.analyze('차가우니까')

The return value forms list of tuples because there can be more than one morpheme combination.

[(('차갑', 'Adjective'), ('우니까', 'Eomi'))]

lemmatize function returns lemma of the given predicator word.

lemmatizer.lemmatize('차가우니까')
[('차갑다', 'Adjective')]

If the input word is not predicator such as Noun, it return empty list.

lemmatizer.lemmatize('한국어') # []

conjugate function returns surfacial form. You should put stem and eomi as arguments. It returns all possible surfacial forms for the given stem and eomi.

lemmatizer.conjugate(stem='차갑', eomi='우니까')
lemmatizer.conjugate('예쁘', '었던')
['차가우니까', '차갑우니까']
['예뻤던', '예쁘었던']

update dictionaries and rules

For demonstration, we use dictioanry demo.

어여뻤어 cannot be analyzed because the adjective 어여쁘 does not enrolled in dictionary.

from soylemma import Lemmatizer

lemmatizer = Lemmatizer(dictionary_name='demo')
print(lemmatizer.analyze('어여뻤어')) # []

So, we add the word with tag using add_words function. Do it again. Then you can see the word 어여뻤어 is analyzed.

lemmatizer.add_words('어여쁘', 'Adjective')
lemmatizer.analyze('어여뻤어')
[(('어여쁘', 'Adjective'), ('었어', 'Eomi'))]

However, the word 파랬다 is still not able to be analyzed because the lemmatization rule for surfacial form does not exist.

lemmatizer.analyze('파랬다') # []

So, in this time, we update additional lemmatization rules using add_lemma_rules function.

supplements = {
    '랬': {('랗', '았')}
}

lemmatizer.add_lemma_rules(supplements)

After that, we can see the word 파랬다 is analyzed, and also conjugation of 파랗 + 았다 is available.

lemmatizer.analyze('파랬다')
lemmatizer.conjugate('파랗', '았다')
[(('파랗', 'Adjective'), ('았다', 'Eomi'))]
['파랬다', '파랗았다']

debug on

If you wonder which subwords came up as candidates of (stem, eomi), use debug.

lemmatizer.analyze('파랬다', debug=True)
[DEBUG] word: 파랬다 = 파랗 + 았다, conjugation: 랬 = 랗 + 았
[(('파랗', 'Adjective'), ('았다', 'Eomi'))]

lemmatization rule extractor

You can extract lemmatization rule using extract_rule function.

from soylemma import extract_rule

eojeol = '로드무비였다'
lw = '로드무비이'
lt = 'Adjective'
rw = '었다'
rt = 'Eomi'

extract_rule(eojeol, lw, lt, rw, rt)
('였다', ('이', '었다'))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

soylemma-0.1.1.tar.gz (93.9 kB view details)

Uploaded Source

Built Distributions

soylemma-0.1.1-py3.7.egg (102.3 kB view details)

Uploaded Source

soylemma-0.1.1-py3-none-any.whl (92.0 kB view details)

Uploaded Python 3

File details

Details for the file soylemma-0.1.1.tar.gz.

File metadata

  • Download URL: soylemma-0.1.1.tar.gz
  • Upload date:
  • Size: 93.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.4.2 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.9.1 tqdm/4.28.1 CPython/3.7.1

File hashes

Hashes for soylemma-0.1.1.tar.gz
Algorithm Hash digest
SHA256 1c104534e3095e5e41cdf2dc58c047eda210b6274f5363547e314341bd491519
MD5 ba9c1f83f221c45b2a3b7790f60a25e1
BLAKE2b-256 e422de9e4961b4b36c84761038f7f0b210e7f6ccd7bd14d997a99347704f6e65

See more details on using hashes here.

File details

Details for the file soylemma-0.1.1-py3.7.egg.

File metadata

  • Download URL: soylemma-0.1.1-py3.7.egg
  • Upload date:
  • Size: 102.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.4.2 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.9.1 tqdm/4.28.1 CPython/3.7.1

File hashes

Hashes for soylemma-0.1.1-py3.7.egg
Algorithm Hash digest
SHA256 286272d2c2c0893a3a8c97ee36d1024586cced61539d6d11cddca8d95a41cf95
MD5 c8191c4e33cd38f449f2e71f83ab75a5
BLAKE2b-256 c78fa5bf85531bff288e52821b86fd6a6861f84cf06d29847cebb46fda040253

See more details on using hashes here.

File details

Details for the file soylemma-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: soylemma-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 92.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.4.2 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.9.1 tqdm/4.28.1 CPython/3.7.1

File hashes

Hashes for soylemma-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b8aca0de6b1f3ac89aa1a89267d766acfdd3db8a8259195611906be6b4114abe
MD5 b59d768f9893477e79d57b027e3e513b
BLAKE2b-256 7eacbbdcfce243d291d1eb574fa7fa0b36ecee8eca259f97d5f6e2dad9e857f7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page