NK-HanDic package for installing via pip.
Project description
nkhandic-py โ Python wrapper for the NK-HanDic MeCab dictionary
๐ NK-HanDic (dictionary) repository: https://github.com/okikirmui/nkhandic
nkhandic is a Python helper package that makes it easy to use NK-HanDic, a MeCab dictionary for North Korean, from Python code.
โ ๏ธ Important distinction
- NK-HanDic = the MeCab dictionary itself (linguistic resource)
- nkhandic (this package) = a Python interface / utility layer for HanDic
The dictionary is developed and published separately;
this package focuses on Python usability.
If you need for a contemporary Korean, please check HanDic (MeCab dictionary) and handic (Python wrapper).
Overview
nkhandic provides a convenient Python interface for the NK-HanDic North Korean morphological analysis dictionary.
It allows researchers and developers to perform North Korean morphological analysis from Python without manually configuring dictionary paths or MeCab options.
The package:
- Bundles a snapshot of the NK-HanDic dictionary
- Provides high-level Python APIs
- Handles Jamo-based input/output
- Supports Hanja-aware representations
- Works across Linux, macOS, and Windows environments
Relationship between NK-HanDic and this package
NK-HanDic (dictionary repository)
โ
MeCab dictionary files
โ
nkhandic (Python wrapper)
โ
Your Python code
- The linguistic design and dictionary entries live in the NK-HanDic repository
- This package bundles a released snapshot of the dictionary only to enable Python use
- Updates to dictionary content are driven by the NK-HanDic project
๐ Quick Start (Python)
Installation
pip install nkhandic mecab-python3 jamotools
Minimal example
import nkhandic
text = "์ด๋น์๋์ง๊ป์ ๋ค์๊ณผ ๊ฐ์ด ๋ง์ํ์์๋ค."
print(nkhandic.tokenize_hangul(text))
print(nkhandic.pos_tag(text))
print(nkhandic.convert_text_to_hanja_hangul(text))
Example output
[('์ด๋น์', 'NNG'), ('๋์ง006', 'NNG'), ('๊ป์', 'JKS'), ('๋ค์01', 'NNG'), ('๊ณผ12', 'JKB'), ('๊ฐ์ด', 'MAG'), ('๋ง์', 'NNG'), ('ํ๋ค02', 'XSV'), ('์', 'EP'), ('ใ
', 'EP'), ('๋ค06', 'EF'), ('.', 'SF')]
['์ด๋น์', '๋์ง', '๊ป์', '๋ค์', '๊ณผ', '๊ฐ์ด', '๋ง์', 'ํ', '์์ฌ', 'ใ
', '๋ค', '.']
็ธฝ็งๆธๅๅฟ๊ป์ ๋ค์๊ณผ ๊ฐ์ด ๋ง์ํ์์๋ค.
High-level API (Python convenience layer)
tokenize_hangul(text)
Return a list of tokens in Hangul base form(Unified Hangul Code).
- Internally uses NK-HanDic via MeCab
- Automatically restores Hangul syllables from Jamo
- Robust against unknown words
If you want to obtain tokens in surface form instead of base form, specify โsurfaceโ for the mode option.
example:
text = "๋ง์ํ์์๋ค."
nkhandic.tokenize_hangul(text, mode="surface")
# ['๋ง์', 'ํ', '์์ฌ', 'ใ
', '๋ค', '.']
nkhandic.tokenize_hangul(text)
# ['๋ง์', 'ํ๋ค02', '์', 'ใ
', '๋ค06', '.']
tokenize(text)
Return tokens in Jamo surface form.
- Low-level wrapper around MeCab
text = "์ด์ฉ๋ฉด ๊ทธ๋ฆฌ๋ ์๋ํ์ ๊ฐ."
nkhandic.tokenize(text)
# ['แแ
ฅแแ
ฅแแ
งแซ', 'แแ
ณแ
แ
ตแแ
ฉ', 'แแ
ฑแแ
ข', 'แแ
ก', 'แแ
ต', 'แซแแ
ก', '.']
pos(text) โ lightweight POS
Return (surface, coarse_pos) pairs.
- Surface is returned in Jamo surface form
- POS corresponds to the first feature field
pos_tag(text)
Return a list of (token, POS) tuples.
- Uses HanDic base forms(Unified Hangul Code) when available
- Falls back to surface forms for unknown words
- POS tags are based on the Sejong tag set(see https://docs.komoran.kr/firststep/postypes.html)
The following is an example for comparing pos() and pos_tag().
text = "๋ง์ํ์์๋ค."
nkhandic.pos(text)
# [('แแ
กแฏแแ
ณแท', 'Noun'), ('แแ
ก', 'Suffix'), ('แแ
ตแแ
ง', 'Prefinal'), ('แป', 'Prefinal'), ('แแ
ก', 'Ending'), ('.', 'Symbol')]
nkhandic.pos_tag(text)
# [('๋ง์', 'NNG'), ('ํ๋ค02', 'XSV'), ('์', 'EP'), ('ใ
', 'EP'), ('๋ค06', 'EF'), ('.', 'SF')]
parse(text)
Return raw MeCab output string.
- Includes all feature fields
- Intended for advanced use
print(nkhandic.parse("์ด์ ์ฐ๋ฆฌ์์๋ 5๊ฐ๋
๊ณํ๊ธฐ๊ฐ์ด 2๋
๋จ์์๋ค."))
output:
แแ
ตแแ
ฆ Adverb,ไธ่ฌ,*,*,*,์ด์ 01,์ด์ ,*,*,A,MAG
แแ
ฎแ
แ
ต Noun,ไปฃๅ่ฉ,*,*,*,์ฐ๋ฆฌ03,์ฐ๋ฆฌ,*,*,A,NP
แแ
กแ Noun,ๆฎ้,*,*,*,์,์,*,*,A,NNG
แแ
ฆ Ending,ๅฉ่ฉ,ๅฆๆ ผ,*,*,์04,์,*,*,*,JKB
แแ
ณแซ Ending,ๅฉ่ฉ,้ก็ฎ,*,*,๋01,๋,*,*,*,JX
5 Symbol,ๆฐๅญ,*,*,*,*,*,*,*,*,SN
แแ
ขแแ
งแซ Noun,ไพๅญๅ่ฉ,ๅฉๆฐ่ฉ,*,*,๊ฐ๋
03,๊ฐ๋
,ๅๅนด,*,*,NNB
แแ
จแแ
ฌแจ Noun,ๆฎ้,ๅไฝ,*,*,๊ณํ01,๊ณํ,่จๅ,*,A,NNG
แแ
ตแแ
กแซ Noun,ๆฎ้,*,*,*,๊ธฐ๊ฐ07,๊ธฐ๊ฐ,ๆ้,*,B,NNG
แแ
ต Ending,ๅฉ่ฉ,ไธปๆ ผ,*,*,์ด25,์ด,*,*,*,JKS
2 Symbol,ๆฐๅญ,*,*,*,*,*,*,*,*,SN
แแ
งแซ Noun,ไพๅญๅ่ฉ,ๅฉๆฐ่ฉ,*,*,๋
02,๋
,ๅนด,*,A,NNB
แแ
กแทแแ
ก Verb,่ช็ซ,*,่ชๅบ3,*,๋จ๋ค01,๋จ์,*,*,B,VV
แแ
ตแป Verb,้่ช็ซ,*,่ชๅบ1,3ๆฅ็ถ,์๋ค01,์,*,*,A,VX
แแ
ก Ending,่ชๅฐพ,็ตๆญขๅฝข,*,1ๆฅ็ถ,๋ค06,๋ค,*,*,*,EF
. Symbol,ใใชใชใ,*,*,*,.,.,*,*,*,SF
EOS
convert_text_to_hanja_hangul(text)
Convert text into mixed Hanja + Hangul representation.
- Uses HanDic feature field (index 7)
- Preserves whitespace and punctuation
- Converts remaining Jamo into complete Hangul syllables
โ ๏ธ Caution
It may be possible to misidentifying homonyms. e.g. ์์ : ่ชไฟก/่ช่บซ
Platform compatibility (important update)
Recent versions of nkhandic include a more robust MeCab initialization layer to improve crossโplatform compatibility.
Earlier versions could fail on Windows or Conda environments due to platform-specific path handling issues.
Typical errors included:
[ifs] no such file or directory: /dev/null
or failures caused by Windows path escaping when dictionary paths contained spaces.
Improvements
The initialization logic now:
- Uses
os.devnullinstead of/dev/null - Automatically quotes dictionary paths
- Normalizes Windows paths to forward-slash format
- Improves MeCab argument handling
These changes make the package more reliable on:
- Windows 10 / 11
- Miniconda / Anaconda environments
- Python installations where the dictionary path contains spaces
Most users do not need to change their code.
Low-level access (for compatibility)
import handic
print(nkhandic.DICDIR) # path to bundled NK-HanDic snapshot
print(nkhandic.VERSION) # NK-HanDic dictionary version
These are provided mainly for backward compatibility and inspection.
Typical use cases
- Using HanDic conveniently from Python
- North Korean corpus analysis and language education research
- Preprocessing North Korean text for NLP pipelines
- Exploring Hangul / Hanja correspondences in North Korean
Features
Here is the list of features included in NK-HanDic. For more information, see the HanDic ํ์ฌ ์ ๋ณด.
- ํ์ฌ1, ํ์ฌ2, ํ์ฌ3: part of speech(index: 0-2)
- ํ์ฉํ: conjugation "base"(ex.
่ชๅบ1,่ชๅบ2,่ชๅบ3)(index: 3) - ์ ์ ์ ๋ณด: which "base" the ending is attached to(ex.
1ๆฅ็ถ,2ๆฅ็ถ, etc.)(index: 4) - ์ฌ์ ํญ๋ชฉ: base forms(index: 5)
- ํ์ธตํ: surface(index: 6)
- ํ์: for sino-words(index: 7)
- ๋ณด์ถฉ ์ ๋ณด: miscellaneous informations(index: 8)
- ํ์ต ์์ค: learning level(index: 9)
- ์ธ์ข ๊ณํ ํ์ฌ ํ๊ทธ: pos-tag(index: 10)
- ์กฐ์ ์ด ํ์: North Korean marker(index: 11)
- ์กฐ์ ์ด ๋ณด์ถฉ ์ ๋ณด: misc. informations about North Korean(index: 12)
Citation
When citing dictionary content, please cite the NK-HanDic project:
NK-HanDic: morphological analysis dictionary for North Korean
https://github.com/okikirmui/nkhandic
When citing this Python package, please cite both the package and NK-HanDic.
License
This code is licensed under the MIT license. NK-HanDic is copyright Yoshinori Sugai and distributed under the BSD license.
Acknowledgment
This repository is forked from unidic-lite with some modifications and file additions and deletions.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nkhandic-26.3.7.post1.tar.gz.
File metadata
- Download URL: nkhandic-26.3.7.post1.tar.gz
- Upload date:
- Size: 11.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
76d89944e9a60283b3a8485b8c62fe6d63296389611532d45620b48b2345932e
|
|
| MD5 |
4b7b365d77e7f84ff044fe7969ec138a
|
|
| BLAKE2b-256 |
13c386e442f2d32e06ac1f7a1eff345f4fe4f01db0b375ed930967387b9414a4
|
File details
Details for the file nkhandic-26.3.7.post1-py3-none-any.whl.
File metadata
- Download URL: nkhandic-26.3.7.post1-py3-none-any.whl
- Upload date:
- Size: 11.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f8815a56e179cf94e135cfc146cda1a050d8e880dbcad084292c12016e416942
|
|
| MD5 |
67bd6d070f6a7e263cf745317e7f2576
|
|
| BLAKE2b-256 |
8bd6d4038e09840fd761cd216ad98e328a31fefe59a43e2b147c77cdf42fcb29
|