Model-based Korean Text Tokenizer in Python

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

PyKoTokenizer

PyKoTokenizer is a Korean text tokenizer for Korean Natural Language Processing tasks. It includes deep learning (RNN) model-based word tokenizers as well as morphological analyzer based word tokenizers for Korean language.

Segmentation of Korean Words

Written Korean texts do employ white space characters. However, more often than not, Korean words occur in a text concatenated immediately to adjacent words without an intervening space character. This low degree of separation of words from each other in writing is due somewhat to an abundance of what linguists call "endoclitics" in the language.

As the language has been subjected to principled and rigorous study for a few decades, the issue of which strings of sounds, or letters, are words and which are not, has been settled among a small group of selected linguists. This kind of advancement has not been propagated to the general public yet, and nlp engineers working on Korean cannot but make do with whatever inconsistent grammars they happen to have access to. Thus, a major source of difficulty in developing competent Korean text processors has been, and still is, the notion of a word as the smallest syntactic unit.

How to install

Before using this package please make sure you have the following dependencies installed in your system.

Python >= 3.6
numpy >= 1.19.0
pandas >= 1.1.5
tensorflow >= 2.6.2
h5py >= 3.1.0
konlpy >= 0.5.2

Use the following command to install the package:

pip install pykotokenizer

How to Use

Model-based Tokenizers

Below, we show examples of using model-based tokenizers.

Using KoTokenizer

from pykotokenizer import KoTokenizer

tokenizer = KoTokenizer()

korean_text = "김형호영화시장분석가는'1987'의네이버영화정보네티즌10점평에서언급된단어들을지난해12월27일부터올해1월10일까지통계프로그램R과KoNLP패키지로텍스트마이닝하여분석했다."

tokenizer(korean_text)

Output:

"김 형호 영화 시장 분석가 는 ' 1987 ' 의 네이버 영화 정보 네티즌 10 점평 에서 언급 된 단어 들 을 지난 해 12 월 27 일 부터 올해 1 월 10 일 까지 통계 프로그램 R 과 KoNLP 패키지 로 텍스트 마이닝 하여 분석 했다 ."

Using KoSpacing

from pykotokenizer import KoSpacing

spacing = KoSpacing()

korean_text = "김형호영화시장분석가는'1987'의네이버영화정보네티즌10점평에서언급된단어들을지난해12월27일부터올해1월10일까지통계프로그램R과KoNLP패키지로텍스트마이닝하여분석했다."

spacing(korean_text)

Output:

"김형호 영화시장 분석가는 '1987'의 네이버 영화 정보 네티즌 10점 평에서 언급된 단어들을 지난해 12월 27일부터 올해 1월 10일까지 통계 프로그램 R과 KoNLP 패키지로 텍스트마이닝하여 분석했다."

Morphological analyzer based Tokenizers

Below, we show examples of using morphological analyzer based tokenizers. These tokenizers has dependency on KoNLPy. So, please install KoNLPy before using these. To install KoNLPy please visit this link - https://konlpy.org/en/latest/install/ and follow the procedure. KoNLPy requires Java in your system.

Using KoKkma

from pykotokenizer import KoKkma

kokkma = KoKkma()

korean_text = "김형호영화시장분석가는'1987'의네이버영화정보네티즌10점평에서언급된단어들을지난해12월27일부터올해1월10일까지통계프로그램R과KoNLP패키지로텍스트마이닝하여분석했다."

kokkma(korean_text)

Output:

"김 형 호 영화 시장 분석가 는 ' 1987 ' 의 네이버 영화 정보 네티즌 10 점 평 에서 언급 되 ㄴ 단어 들 을 지난해 12 월 27 일 부터 올해 1 월 10 일 까지 통계 프로그램 R 과 KoNLP 패키지 로 텍스트 마이닝 하 여 분석 하 었 다 ."

Using KoKomoran

from pykotokenizer import KoKomoran

kokomoran = KoKomoran()

korean_text = "김형호영화시장분석가는'1987'의네이버영화정보네티즌10점평에서언급된단어들을지난해12월27일부터올해1월10일까지통계프로그램R과KoNLP패키지로텍스트마이닝하여분석했다."

kokomoran(korean_text)

Output:

"김형호 영화 시장 분석가 는 ' 1987 ' 의 네이버 영화 정보 네티즌 10 점 평 에서 언급 되 ㄴ 단어 들 을 지난해 12월 27 일 부터 올해 1월 10 일 까지 통계 프로그램 R 과 KoNLP 패키지 로 텍스트 마 이닝 하 아 분석 하 았 다 ."

Credits

This package is a revamped and customized version of the following two sources:

KoTokenizer: https://pypi.org/project/hangul-korean/
KoSpacing: https://github.com/haven-jeon/PyKoSpacing
KoNLPy: https://konlpy.org/en/latest/

Project details

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.0.3

Dec 28, 2021

0.0.2

Dec 21, 2021

0.0.1

Dec 14, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pykotokenizer-0.0.3.tar.gz (11.3 MB view details)

Uploaded Dec 28, 2021 Source

Built Distribution

pykotokenizer-0.0.3-py3-none-any.whl (11.3 MB view details)

Uploaded Dec 28, 2021 Python 3

File details

Details for the file pykotokenizer-0.0.3.tar.gz.

File metadata

Download URL: pykotokenizer-0.0.3.tar.gz
Upload date: Dec 28, 2021
Size: 11.3 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.7.1 importlib_metadata/4.8.3 pkginfo/1.8.2 requests/2.23.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.6.9

File hashes

Hashes for pykotokenizer-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`da787f7be36c50b459b6735a3fc76a63bbb68f093ddbf25b534bce0ec5efddd1`
MD5	`317894e4b7e6381375b9ed326e7b3c99`
BLAKE2b-256	`3b2e121126022e2f5f857609dcb5963290ccbb3844f6fc6c083a071a9c3ed305`

See more details on using hashes here.

File details

Details for the file pykotokenizer-0.0.3-py3-none-any.whl.

File metadata

Download URL: pykotokenizer-0.0.3-py3-none-any.whl
Upload date: Dec 28, 2021
Size: 11.3 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.7.1 importlib_metadata/4.8.3 pkginfo/1.8.2 requests/2.23.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.6.9

File hashes

Hashes for pykotokenizer-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4e20e6d0afe168530102b005973f0e78a4385abd4c9d6bc9a4713baadd0a7dc1`
MD5	`73c773b9b583948b9116e3c3fc51c7c1`
BLAKE2b-256	`3b51c94b1251fe8786644242a5b766724620778c6d0e9d6355095676e6468a00`

See more details on using hashes here.

pykotokenizer 0.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PyKoTokenizer

Segmentation of Korean Words

How to install

How to Use

Model-based Tokenizers

Using KoTokenizer

Using KoSpacing

Morphological analyzer based Tokenizers

Using KoKkma

Using KoKomoran

Credits

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes