Modern Korean NLP Package
Project description
moko는 국한문혼용 텍스트에서 한자어를 추출하는 모듈입니다.
근대한국학연구소 HK+사업단의 한국학 DB구축 연구의 일환으로 제작되었습니다.
Installation
$ pip install moko
Usage
Noun chunking
- noun_chunk_dict: dictionary based word extraction
- noun_chunk_model: noun chunking module with spacing model
Training data: 황성신문 논설기사를 관련 연구자가 띄어쓰기 한 학습데이터 활용
from moko import noun_chunker as nc
text = "泱泱大風이 固由於萬籟齊應이나 其初也엔 起於一蓬之末고 彼文明國之所謂 文明이 固謂其國民全軆之文明이나 其文明開發之原動力은"
dct_lst = nc.noun_chunk_dict(text)
print(dct_lst)
mdl_lst = nc.noun_chunk_model(text)
print(mdl_lst)
Parameter
- char_num: control word length, default is "4"
- stopword_lst: stopword list, default list contains 654 words ('今日', '今年', '一日'...)
- usrword_lst: a list of words want to include ('noun_chunk_dict' only)
Word count
- word_count: simple word count
- co_occurence_count: return co-occurrence pair
from moko import term_analyzer as ta
print(ta.word_count(noun_list))
print(ta.co_occurence_count(noun_list))
N-word window extraction around a keyword from noun_list
mering window (Case2)
- Case1: A, B, KEY, C, D
- Case2: A, B, KEY, C, KEY, KEY, D, E
from moko import term_analyzer as ta
print(ta.extract_window(dct_lst,"文明",2))
To be added
- Named Entity Recognition: 인명, 서명, 저자명, 기관명
- Word Embedding: w2v(skip-gram), FastText
- 띄어쓰기 모델 사용시 지시대명사, 접두어 처리문제
History
0.1.0.14 (2023-03-21) - First version of moko
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
moko-0.1.0.16.tar.gz
(5.1 MB
view hashes)