Skip to main content

Korean morphological analyzer based on the BERT architecture

Project description

kotok

kotok is a Korean morphological analysis tool based on the BERT architecture. It is able to lemmatize and POS-tag Korean sentences. Furthermore, it can detect and correct spacing as well as spelling errors in Korean text.

Features

kotok has the following features:

  1. Correct spacing errors
  2. Correct spelling errors
  3. Split Korean text into morphemes
  4. Assign POS tags to morphemes (Uses the Sejong POS tag set)
  5. Lemmatize morphemes

Implementation details

Spacing error detection and correction

All code related to spacing is located in the kotok/spacing directory.

Spacing errors are detected by fine-tuning a BERT model for token classification with the task of detecting spacing errors. The model is trained on a dataset of Korean text with simulated spacing errors to predict whether a specific token has a missing or extra space.

Spacing errors are corrected by inserting or removing spaces between tokens based on the predictions of the spacing error detection model. All spacing possibilities are considered and the one that achieves the lowest error score using the spacing error detection model is chosen to be the correct spacing variant. See kotok/spacing/inference.py for the implementation.

Spelling error detection and correction

All code related to spacing is located in the kotok/error directory.

Spelling errors are detected by fine-tuning a BERT model for token classification with the task of detecting spelling errors. The model is trained on a dataset of Korean text with simulated spelling errors to predict whether a specific token is a spelling error.

Spelling errors are generated by the TypoTransformer class in kotok/error/typo.py which is able to generate likely spelling errors based on common Korean typo patterns.

Spelling errors are corrected by replacing the misspelled token with the token corrections generated by the TypoTransformer class. The token correction with the highest probability of being the correct spelling is chosen as the corrected token. See kotok/error/inference.py for the implementation.

Morpheme splitting, POS-tagging and lemmatization

All code related to morpheme splitting, POS-tagging and lemmatization is located in the kotok directory.

Morpheme splitting and POS-tagging are performed by fine-tuning a BERT model for token classification. Training data is generated by tokenizing plain text files with Kiwi.

Resulting morphmes are lemmazizzed (=stemmed) by a rule-based lemmatizer. The lemmatizer is based on the Korean stemming system by Yomitan, a web-browser dictionary.

Installation for development

Create and activate a virtual environment:

Linux and macOS:

python3 -m venv .venv
source .venv/bin/activate

Windows (PowerShell):

python3 -m venv .venv
.venv\Scripts\Activate.ps1

Install the required packages:

pip install -r requirements.txt

Train the classification models

To run kotok, the classification models need to be trained. If not using pre-trained model files, follow the instructions below to generate the models from scratch.

Aquire training data

The default training data set can be downloaded by running the following command:

python -m kotok data_dl

If a custom training data set is to be used, place plain text files into the data/txt directory. The directory is recursively searched for all .txt files.

Train the classification models

Choose a BERT based tokenizer model which should be fine-tuned for the 3 classification tasks. The model name or path should be specified with the -m option for all of the following commands. The best results have been observed with the klue/bert-base model.

Train the spacing error classification model

# Simulate spacing errors in the training data and label them
python -m kotok.spacing data -m <tokenizer model name or path>

# Train the spacing error classification model
python -m kotok.spacing train -m <tokenizer model name or path> -o <output model directory>

Train the spelling error classification model

# Simulate spelling errors in the training data and label them
python -m kotok.error data -m <tokenizer model name or path>

# Train the spelling error classification model
python -m kotok.error train -m <tokenizer model name or path> -o <output model directory>

Train the POS-tagging and lemmatization model

# Prepare training and validation data
python -m kotok data -m <tokenizer model name or path>

# Train the POS-tagging and lemmatization model
python -m kotok train -m <tokenizer model name or path> -o <output model directory>

Run kotok as a command line tool

Run the following command to start the command line interface, allowing for the input of Korean text to be analyzed:

python -m kotok inference \
    -m <pos tokenizer model name or path> -cm <fine-tuned pos classification model directory> \
    -em <error tokenizer model name or path> -ecm <fine-tuned error classification model directory> \
    -sm <spacing tokenizer model name or path> -scm <fine-tuned spacing error classification model directory>

User dictionary

User dictionary entries are stored in tsv (=tab-separated values) files with the following format:

<word> <pos tag>

If words should be tagged even if they just begin with the specified word, the following format should be used. This is useful for specifiying words that conjugate, such as verbs.

<word>* <pos tag>

If the POS tag should be enforced, ie the user dictionary entry should only be applied if the POS makes sense in the context, the following format should be used:

<word> <pos tag>!

If the word makes sense in place of multiple POS tags, the following format should be used. This is especially useful for POS tags that are closely related, such as NNG and NNP.

<word> <pos tag>!<check pos tag1>,<check pos tag2>,...

To enable the user dictionary, the -u option should be used with the path to the user dictionary file or directory. If a directory is specified, all tsv files in the directory are loaded recursively.

Further options

Further command line options can be found by running python -m kotok inference --help.

Use kotok as a library

To use kotok as a library, the Analyzer class can be imported and used as follows:

from kotok import Analyzer

analyzer = Analyzer(
    model="<pos tokenizer model name or path>",
    classification_model="<fine-tuned pos classification model directory>",
    error_model="<error tokenizer model name or path>",
    error_classification_model="<fine-tuned error classification model directory>",
    spacing_model="<spacing tokenizer model name or path>",
    spacing_classification_model="<fine-tuned spacing error classification model directory>",
    lemma_data="<lemmatization data directory>",
)

result = analyzer.analyze("안녕하세요.")
print(result) # [안녕/NNG, 하/XSA, 세요/EF, ./SF]

Detailed information on the Analyzer class can be found by checking the docstrings of the class.

License

kotok is licensed under the GNU General Public License v3.0. See the LICENSE file for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kotok-1.0.0.tar.gz (29.1 kB view details)

Uploaded Source

File details

Details for the file kotok-1.0.0.tar.gz.

File metadata

  • Download URL: kotok-1.0.0.tar.gz
  • Upload date:
  • Size: 29.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for kotok-1.0.0.tar.gz
Algorithm Hash digest
SHA256 7cc6dffbbd9fef50d0049f2717bef9dafbfe2ffed3b4abec08259b3ecbbf5819
MD5 c532f5c7a5828c271bcf5f9adde34429
BLAKE2b-256 00b07301f1f0675969cefc6b2c77e4ff3552fe81adb0c17a81b77c5d8384240b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page