Skip to main content

Noise Generator for Korean Text

Project description

Noise Generator for Korean Text Grammar Error Correction Model

This model is using python-mecab-ko as the main tokenizer.

If you need to know the token types check the website below

https://m.blog.naver.com/PostView.naver?isHttpsRedirect=true&blogId=aramjo&logNo=221404488280

Requirements

Python >= 3.7

pip install wget
pip install konlpy
pip install hangul_utils
pip install hangul_jamo
pip install inko
pip install g2pk
pip install gensim

Currently not available in Windows OS.
Only available at Linux or Colab

How to use

Preparation

git clone https://github.com/JKJIN1999/gec_noise_generator_ko.git

Manualy Download Wiki.ko.vec from this link

https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.ko.vec

Place the wiki.ko.vec file to gec_noise_generator_ko/src/resources

Type this in your terminal

python3 gec_noise_generator_ko/src/gecnk/noise.py --e (Error type of your preference)

The result will appear in results file.

Requires a Text file with its directory as the First argument when calling the function.
Text data should be in txt file and each sentence should be organized through each line.

Functions (Error Type) (second argument)

  • spacing_error

Creates Random spacing error according to rule written below
(명사, 관형어) + (명사 , 관형어 의존명사) 인 경우 띄어쓰기를 만든다
조사와 접사를 띄어써서 생기는 오류
단어 가운대에 띄어쓰기를 해서 생기는 오류

  • punctuation_error

Converts puntuation within the same type
문장속의 토큰중 품사가 “S-” 로 시작하는 랜덤하게 고른 토큰을 같은 품사표 내에서 랜덤하게 변경한다.

  • phonetic_first_error or phonetic_last_error

Converts the first or last character if it exists in phonetic data list
여 > 녀 / 율 > 률

  • remove_josa_error

Randomly remove josa from sentence
Remove token which type starts with "J"

  • addition_error

Randomly add consonant to character which has "ㅇ" as its starting consonant or doens't have the last consonant
아기 > 바기 / 다치 > 닫치

  • separation_error

From the randomly selected word, decompose the letter to consonant and vowel
할 > ㅎㅏㄹ

  • typing_language_error

Convert Korean text to English text regarding to the same keyboard position
한글 > gksrmf / 고양이 > rhdiddl

  • postposition_diff_josa_error or postposition_same_josa_error

Convert a Josa to either different or same type of Josa from the Josa dataset.
를 > 을 / 에게 > 할

  • busa_error

Convert busa "이" to "히" either way
부단히 > 부단이 / 같이 > 같히

  • middle_shiot_error

If there is "ㅅ" as the last consonant in a word longer than 2 characters, erase the last consonant "ㅅ"
숫자 > 수자 / 찻잔 > 차잔

  • grapheme_to_phonem_error

Convert the word's textual form as it is pronounciated. If the textual outcome based on the pronounciation of that specific word is not same as the current textual form of the word
행복하다 > 행보카다 / 같이 > 가치

  • overlapping_sound_error

If there are two continuous Tensed Consonant letter positioning in each first consonant letter, convert the second letter's first consonant into Basic Consonant.
딱딱하다 > 딱닥하다 / 쌉쌀하다 > 쌉살하다

  • final_suffix_error

Convert the randomly selected final suffix into a different final suffix that doesn't match the original suffix
하겠습니다 > 하겠습네까 / 하고있다 > 하고있니

  • mag_error & maj_error

By using the similarity function in Gensim KeyedVector, convert the typical pumsa type (mag or maj) into a different busa.
mag 얼마나 > 어떻게 / maj 하지만 > 그러나

  • polite_speech_error

Misusage of two types of polite speech josa, nominative josa and adverbal josa
이,가 > 께서 / 에게 > 께

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gec_noise_generator_ko-0.1.2.tar.gz (10.6 kB view hashes)

Uploaded Source

Built Distribution

gec_noise_generator_ko-0.1.2-py3-none-any.whl (12.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page