Skip to main content

Word segmentation for the Korean Language

Project description

Package hangul-korean

Hangul is the alphabet for the Korean language. hangul-korean is a package which currently contains a module for segmenting words in written Korean texts.

Segmentation of Korean Words 한국어 낱말 분절

Written Korean texts do employ white space characters. However, more often than not, Korean words occur in a text concatenated immediately to adjacent words without an intervening space character. This low degree of separation of words from each other in writing is due somewhat to an abundance of what linguists call "endoclitics" in the language.

As the language has been subjected to principled and rigorous study for a few decades, the issue of which strings of sounds, or letters, are words and which are not, has been settled among a small group of selected linguists. This kind of advancement has not been propagated to the general public yet, and nlp engineers working on Korean cannot but make do with whatever inconsistent grammars they happen to have access to. Thus, a major source of difficulty in developing competent Korean text processors has been, and still is, the notion of a word as the smallest syntactic unit.

The WordSegmenter class

The module tokenizer in this package defines the class WordSegmenter. It has, among other things, the following methods:

  • init(modelFile, word2idxFile)
  • infile(fileName)
  • outfile(fileName)
  • inputAsString(aStr)
  • doSegment()
  • segmentedOutput()

After creating a WordSegmenter object, say wseg, give it a file or a string and then issue wseg.doSegment(). The word-segmented text will be in a file (if specified with outfile()) or in an instance variable accessible via segmentedOutput().

Typical use

from hangul.tokenizer import WordSegmenter

wsg = WordSegmenter()
aPassage = "어휴쟤가 왜저래? 정말우스워죽겠네"  # 문자열을 함수 inputAsString의
wsg.inputAsString (aPassage)                    # 매개변인으로 주거나
# inFile = "tobesegmented"        # 파일 안에 담긴 글을 낱말분절할 때에는 
# wsg.infile(inFile)              # 그 파일 이름을 함수 infile에 넘긴다.
# wsg.outfile("output.txt")
wsg.doSegment() 
# infile()이 사용되면 낱말 분절 된 글이 적힌 파일이 생겨 난다.
# 이 파일은 함수 outfile()로 지정될 수 있고 디폴트 값은 "segmented_yyddd_hhmm.txt"이다.
# with open("segmented_yyddd_hhmm.txt", "r") as f:
#   lines = f.readlines()
# for aLine in lines:
#   print(aLine)
# inputAsString()이 사용된 경우에는
print(wsg.segmentedOutput)

Forms of a verb are not analyzed into morphs

The lexical category verb is inflected in hundreds (or thousands) of ways in the language and it is the only category that inflects. We do not analyze a form of a verb into its morphs. Such an analysis is best reserved for a separate component of inflectional morphology and is certainly required in order to do syntactic analyses of various kinds.

Slim size of the model

The model this package uses is of a very compact size: it is merely ten megabytes long.

Forthcoming in the package

The next version of this package might well contain a POS tagger. A higher F-measure of the word segmentation system (which currently is 0.970 while that of the open access model is somewhat lower) is something we would like to see as well.

Status of papers that describe this package

A draft is to be submitted to a journal, which describes the way the model for this package is obtained.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hangul-korean-1.0rc2.tar.gz (5.5 kB view details)

Uploaded Source

Built Distribution

hangul_korean-1.0rc2-py3-none-any.whl (9.1 MB view details)

Uploaded Python 3

File details

Details for the file hangul-korean-1.0rc2.tar.gz.

File metadata

  • Download URL: hangul-korean-1.0rc2.tar.gz
  • Upload date:
  • Size: 5.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/51.3.3 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.2

File hashes

Hashes for hangul-korean-1.0rc2.tar.gz
Algorithm Hash digest
SHA256 39db724ce8f5781f2b7bc442388784c77307c136dec53fd15b833e410a8f1578
MD5 01f3411e28abc463ad42a0c1fcf12d7a
BLAKE2b-256 f00ab7c771815cac84ccf0726a62813760728abc8b7e0421c82cc012bfc8c32b

See more details on using hashes here.

File details

Details for the file hangul_korean-1.0rc2-py3-none-any.whl.

File metadata

  • Download URL: hangul_korean-1.0rc2-py3-none-any.whl
  • Upload date:
  • Size: 9.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/51.3.3 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.2

File hashes

Hashes for hangul_korean-1.0rc2-py3-none-any.whl
Algorithm Hash digest
SHA256 ed081e53aa103974deaf5d269183b82928e44055551b4ec9192a14607d6c7554
MD5 ee14ae16d79182496d1f3a571384a30b
BLAKE2b-256 838cbd911dc6de8b6f69ab36c38b1aa0f8521bcf69c13cab172581af1e7642e3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page