Skip to main content

Modern Korean NLP Package

Project description

moko는 국한문혼용 텍스트에서 한자어를 추출하는 모듈입니다.
근대한국학연구소 HK+사업단의 한국학 DB구축 연구의 일환으로 제작되었습니다.


Installation

$ pip install moko

Usage

Noun chunking

  • noun_chunk_dict: dictionary based word extraction
  • noun_chunk_model: noun chunking module with spacing model

Training data: 황성신문 논설기사를 관련 연구자가 띄어쓰기 한 학습데이터 활용

from moko import noun_chunker as nc

text = "泱泱大風이 固由於萬籟齊應이나 其初也엔 起於一蓬之末고 彼文明國之所謂 文明이 固謂其國民全軆之文明이나 其文明開發之原動力은"

dct_lst = nc.noun_chunk_dict(text)
print(dct_lst)

mdl_lst = nc.noun_chunk_model(text)
print(mdl_lst)

Parameter

  • char_num: control word length, default is "4"
  • stopword_lst: stopword list, default list contains 654 words ('今日', '今年', '一日'...)
  • usrword_lst: a list of words want to include ('noun_chunk_dict' only)

Word count

  • word_count: simple word count
  • co_occurence_count: return co-occurrence pair
from moko import term_analyzer as ta

print(ta.word_count(noun_list))
print(ta.co_occurence_count(noun_list))

N-word window extraction around a keyword from noun_list

mering window (Case2)

  • Case1: A, B, KEY, C, D
  • Case2: A, B, KEY, C, KEY, KEY, D, E
from moko import term_analyzer as ta

print(ta.extract_window(dct_lst,"文明",2))

To be added

  • Named Entity Recognition: 인명, 서명, 저자명, 기관명
  • Word Embedding: w2v(skip-gram), FastText
  • 띄어쓰기 모델 사용시 지시대명사, 접두어 처리문제

History

0.1.0.14 (2023-03-21) - First version of moko

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

moko-0.1.0.15.tar.gz (5.1 MB view hashes)

Uploaded Source

Built Distribution

moko-0.1.0.15-py3-none-any.whl (5.2 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page