Skip to main content

NK-HanDic package for installing via pip.

Project description

nkhandic-py โ€” Python wrapper for the NK-HanDic MeCab dictionary

PyPI - Version

๐Ÿ‘‰ NK-HanDic (dictionary) repository: https://github.com/okikirmui/nkhandic

nkhandic is a Python helper package that makes it easy to use NK-HanDic, a MeCab dictionary for North Korean, from Python code.

โš ๏ธ Important distinction

  • NK-HanDic = the MeCab dictionary itself (linguistic resource)
  • nkhandic (this package) = a Python interface / utility layer for HanDic

The dictionary is developed and published separately;
this package focuses on Python usability.

If you need for a contemporary Korean, please check HanDic (MeCab dictionary) and handic (Python wrapper).


Overview

nkhandic provides a convenient Python interface for the NK-HanDic North Korean morphological analysis dictionary.

It allows researchers and developers to perform North Korean morphological analysis from Python without manually configuring dictionary paths or MeCab options.

The package:

  • Bundles a snapshot of the NK-HanDic dictionary
  • Provides high-level Python APIs
  • Handles Jamo-based input/output
  • Supports Hanja-aware representations
  • Works across Linux, macOS, and Windows environments

Relationship between NK-HanDic and this package

NK-HanDic (dictionary repository)
        โ†“
   MeCab dictionary files
        โ†“
  nkhandic (Python wrapper)
        โ†“
  Your Python code
  • The linguistic design and dictionary entries live in the NK-HanDic repository
  • This package bundles a released snapshot of the dictionary only to enable Python use
  • Updates to dictionary content are driven by the NK-HanDic project

๐Ÿš€ Quick Start (Python)

Installation

pip install nkhandic mecab-python3 jamotools

Minimal example

import nkhandic

text = "์ด๋น„์„œ๋™์ง€๊ป˜์„œ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ง์”€ํ•˜์‹œ์˜€๋‹ค."

print(nkhandic.tokenize_hangul(text))
print(nkhandic.pos_tag(text))
print(nkhandic.convert_text_to_hanja_hangul(text))

Example output

[('์ด๋น„์„œ', 'NNG'), ('๋™์ง€006', 'NNG'), ('๊ป˜์„œ', 'JKS'), ('๋‹ค์Œ01', 'NNG'), ('๊ณผ12', 'JKB'), ('๊ฐ™์ด', 'MAG'), ('๋ง์”€', 'NNG'), ('ํ•˜๋‹ค02', 'XSV'), ('์‹œ', 'EP'), ('ใ…†', 'EP'), ('๋‹ค06', 'EF'), ('.', 'SF')]
['์ด๋น„์„œ', '๋™์ง€', '๊ป˜์„œ', '๋‹ค์Œ', '๊ณผ', '๊ฐ™์ด', '๋ง์”€', 'ํ•˜', '์‹œ์—ฌ', 'ใ…†', '๋‹ค', '.']
็ธฝ็ง˜ๆ›ธๅŒๅฟ—๊ป˜์„œ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ง์”€ํ•˜์‹œ์˜€๋‹ค.

High-level API (Python convenience layer)

tokenize_hangul(text)

Return a list of tokens in Hangul base form(Unified Hangul Code).

  • Internally uses NK-HanDic via MeCab
  • Automatically restores Hangul syllables from Jamo
  • Robust against unknown words

If you want to obtain tokens in surface form instead of base form, specify โ€œsurfaceโ€ for the mode option.

example:

text = "๋ง์”€ํ•˜์‹œ์˜€๋‹ค."

nkhandic.tokenize_hangul(text, mode="surface")
# ['๋ง์”€', 'ํ•˜', '์‹œ์—ฌ', 'ใ…†', '๋‹ค', '.']

nkhandic.tokenize_hangul(text)
# ['๋ง์”€', 'ํ•˜๋‹ค02', '์‹œ', 'ใ…†', '๋‹ค06', '.']

tokenize(text)

Return tokens in Jamo surface form.

  • Low-level wrapper around MeCab
text = "์–ด์ฉŒ๋ฉด ๊ทธ๋ฆฌ๋„ ์œ„๋Œ€ํ•˜์‹ ๊ฐ€."

nkhandic.tokenize(text)
# ['แ„‹แ…ฅแ„แ…ฅแ„†แ…งแ†ซ', 'แ„€แ…ณแ„…แ…ตแ„ƒแ…ฉ', 'แ„‹แ…ฑแ„ƒแ…ข', 'แ„’แ…ก', 'แ„‰แ…ต', 'แ†ซแ„€แ…ก', '.']

pos(text) โ€” lightweight POS

Return (surface, coarse_pos) pairs.

  • Surface is returned in Jamo surface form
  • POS corresponds to the first feature field

pos_tag(text)

Return a list of (token, POS) tuples.

The following is an example for comparing pos() and pos_tag().

text = "๋ง์”€ํ•˜์‹œ์˜€๋‹ค."

nkhandic.pos(text)
# [('แ„†แ…กแ†ฏแ„Šแ…ณแ†ท', 'Noun'), ('แ„’แ…ก', 'Suffix'), ('แ„‰แ…ตแ„‹แ…ง', 'Prefinal'), ('แ†ป', 'Prefinal'), ('แ„ƒแ…ก', 'Ending'), ('.', 'Symbol')]

nkhandic.pos_tag(text)
# [('๋ง์”€', 'NNG'), ('ํ•˜๋‹ค02', 'XSV'), ('์‹œ', 'EP'), ('ใ…†', 'EP'), ('๋‹ค06', 'EF'), ('.', 'SF')]

parse(text)

Return raw MeCab output string.

  • Includes all feature fields
  • Intended for advanced use
print(nkhandic.parse("์ด์ œ ์šฐ๋ฆฌ์•ž์—๋Š” 5๊ฐœ๋…„๊ณ„ํš๊ธฐ๊ฐ„์ด 2๋…„ ๋‚จ์•„์žˆ๋‹ค."))

output:

แ„‹แ…ตแ„Œแ…ฆ	Adverb,ไธ€่ˆฌ,*,*,*,์ด์ œ01,์ด์ œ,*,*,A,MAG
แ„‹แ…ฎแ„…แ…ต	Noun,ไปฃๅ่ฉž,*,*,*,์šฐ๋ฆฌ03,์šฐ๋ฆฌ,*,*,A,NP
แ„‹แ…กแ‡	Noun,ๆ™ฎ้€š,*,*,*,์•ž,์•ž,*,*,A,NNG
แ„‹แ…ฆ	Ending,ๅŠฉ่ฉž,ๅ‡ฆๆ ผ,*,*,์—04,์—,*,*,*,JKB
แ„‚แ…ณแ†ซ	Ending,ๅŠฉ่ฉž,้กŒ็›ฎ,*,*,๋Š”01,๋Š”,*,*,*,JX
5	Symbol,ๆ•ฐๅญ—,*,*,*,*,*,*,*,*,SN
แ„€แ…ขแ„‚แ…งแ†ซ	Noun,ไพๅญ˜ๅ่ฉž,ๅŠฉๆ•ฐ่ฉž,*,*,๊ฐœ๋…„03,๊ฐœ๋…„,ๅ€‹ๅนด,*,*,NNB
แ„€แ…จแ„’แ…ฌแ†จ	Noun,ๆ™ฎ้€š,ๅ‹•ไฝœ,*,*,๊ณ„ํš01,๊ณ„ํš,่จˆๅŠƒ,*,A,NNG
แ„€แ…ตแ„€แ…กแ†ซ	Noun,ๆ™ฎ้€š,*,*,*,๊ธฐ๊ฐ„07,๊ธฐ๊ฐ„,ๆœŸ้–“,*,B,NNG
แ„‹แ…ต	Ending,ๅŠฉ่ฉž,ไธปๆ ผ,*,*,์ด25,์ด,*,*,*,JKS
2	Symbol,ๆ•ฐๅญ—,*,*,*,*,*,*,*,*,SN
แ„‚แ…งแ†ซ	Noun,ไพๅญ˜ๅ่ฉž,ๅŠฉๆ•ฐ่ฉž,*,*,๋…„02,๋…„,ๅนด,*,A,NNB
แ„‚แ…กแ†ทแ„‹แ…ก	Verb,่‡ช็ซ‹,*,่ชžๅŸบ3,*,๋‚จ๋‹ค01,๋‚จ์•„,*,*,B,VV
แ„‹แ…ตแ†ป	Verb,้ž่‡ช็ซ‹,*,่ชžๅŸบ1,3ๆŽฅ็ถš,์žˆ๋‹ค01,์žˆ,*,*,A,VX
แ„ƒแ…ก	Ending,่ชžๅฐพ,็ต‚ๆญขๅฝข,*,1ๆŽฅ็ถš,๋‹ค06,๋‹ค,*,*,*,EF
.	Symbol,ใƒ”ใƒชใ‚ชใƒ‰,*,*,*,.,.,*,*,*,SF
EOS

convert_text_to_hanja_hangul(text)

Convert text into mixed Hanja + Hangul representation.

  • Uses HanDic feature field (index 7)
  • Preserves whitespace and punctuation
  • Converts remaining Jamo into complete Hangul syllables

โš ๏ธ Caution

It may be possible to misidentifying homonyms. e.g. ์ž์‹ : ่‡ชไฟก/่‡ช่บซ


Platform compatibility (important update)

Recent versions of nkhandic include a more robust MeCab initialization layer to improve crossโ€‘platform compatibility.

Earlier versions could fail on Windows or Conda environments due to platform-specific path handling issues.

Typical errors included:

[ifs] no such file or directory: /dev/null

or failures caused by Windows path escaping when dictionary paths contained spaces.

Improvements

The initialization logic now:

  • Uses os.devnull instead of /dev/null
  • Automatically quotes dictionary paths
  • Normalizes Windows paths to forward-slash format
  • Improves MeCab argument handling

These changes make the package more reliable on:

  • Windows 10 / 11
  • Miniconda / Anaconda environments
  • Python installations where the dictionary path contains spaces

Most users do not need to change their code.


Low-level access (for compatibility)

import handic

print(nkhandic.DICDIR)   # path to bundled NK-HanDic snapshot
print(nkhandic.VERSION)  # NK-HanDic dictionary version

These are provided mainly for backward compatibility and inspection.


Typical use cases

  • Using HanDic conveniently from Python
  • North Korean corpus analysis and language education research
  • Preprocessing North Korean text for NLP pipelines
  • Exploring Hangul / Hanja correspondences in North Korean

Features

Here is the list of features included in NK-HanDic. For more information, see the HanDic ํ’ˆ์‚ฌ ์ •๋ณด.

  • ํ’ˆ์‚ฌ1, ํ’ˆ์‚ฌ2, ํ’ˆ์‚ฌ3: part of speech(index: 0-2)
  • ํ™œ์šฉํ˜•: conjugation "base"(ex. ่ชžๅŸบ1, ่ชžๅŸบ2, ่ชžๅŸบ3)(index: 3)
  • ์ ‘์† ์ •๋ณด: which "base" the ending is attached to(ex. 1ๆŽฅ็ถš, 2ๆŽฅ็ถš, etc.)(index: 4)
  • ์‚ฌ์ „ ํ•ญ๋ชฉ: base forms(index: 5)
  • ํ‘œ์ธตํ˜•: surface(index: 6)
  • ํ•œ์ž: for sino-words(index: 7)
  • ๋ณด์ถฉ ์ •๋ณด: miscellaneous informations(index: 8)
  • ํ•™์Šต ์ˆ˜์ค€: learning level(index: 9)
  • ์„ธ์ข…๊ณ„ํš ํ’ˆ์‚ฌ ํƒœ๊ทธ: pos-tag(index: 10)
  • ์กฐ์„ ์–ด ํ‘œ์‹œ: North Korean marker(index: 11)
  • ์กฐ์„ ์–ด ๋ณด์ถฉ ์ •๋ณด: misc. informations about North Korean(index: 12)

Citation

When citing dictionary content, please cite the NK-HanDic project:

NK-HanDic: morphological analysis dictionary for North Korean
https://github.com/okikirmui/nkhandic

When citing this Python package, please cite both the package and NK-HanDic.


License

This code is licensed under the MIT license. NK-HanDic is copyright Yoshinori Sugai and distributed under the BSD license.


Acknowledgment

This repository is forked from unidic-lite with some modifications and file additions and deletions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nkhandic-26.3.7.post1.tar.gz (11.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nkhandic-26.3.7.post1-py3-none-any.whl (11.1 MB view details)

Uploaded Python 3

File details

Details for the file nkhandic-26.3.7.post1.tar.gz.

File metadata

  • Download URL: nkhandic-26.3.7.post1.tar.gz
  • Upload date:
  • Size: 11.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for nkhandic-26.3.7.post1.tar.gz
Algorithm Hash digest
SHA256 76d89944e9a60283b3a8485b8c62fe6d63296389611532d45620b48b2345932e
MD5 4b7b365d77e7f84ff044fe7969ec138a
BLAKE2b-256 13c386e442f2d32e06ac1f7a1eff345f4fe4f01db0b375ed930967387b9414a4

See more details on using hashes here.

File details

Details for the file nkhandic-26.3.7.post1-py3-none-any.whl.

File metadata

File hashes

Hashes for nkhandic-26.3.7.post1-py3-none-any.whl
Algorithm Hash digest
SHA256 f8815a56e179cf94e135cfc146cda1a050d8e880dbcad084292c12016e416942
MD5 67bd6d070f6a7e263cf745317e7f2576
BLAKE2b-256 8bd6d4038e09840fd761cd216ad98e328a31fefe59a43e2b147c77cdf42fcb29

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page