Noise Generator for Korean Text
Project description
Noise Generator for Korean Text Grammar Error Correction Model
This model is using python-mecab-ko as the main tokenizer.
If you need to know the token types check the website below
https://m.blog.naver.com/PostView.naver?isHttpsRedirect=true&blogId=aramjo&logNo=221404488280
Requirements
Python >= 3.7
pip install wget
pip install konlpy
pip install hangul_utils
pip install hangul_jamo
pip install inko
pip install g2pk
pip install gensim
Currently not available in Windows OS.
Manualy Download Wiki.ko.vec from this link
https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.ko.vec
Place the wiki.ko.vec file to gec_noise_generator_ko/project/resources
How to use
Preparation
pip install gec_noise_generator_ko
from project import *
noise("text_directory")
import project
nsgec = noise_generate.NoiseGenerate("text data directory", "error Type")
The result will appear in results file.
Requires a Text file with its directory as the First argument when calling the function.
Text data should be in txt file and each sentence should be organized through each line.
Functions (Error Type) (second argument)
- spacing_error
Creates Random spacing error according to rule written below
(명사, 관형어) + (명사 , 관형어 의존명사) 인 경우 띄어쓰기를 만든다
조사와 접사를 띄어써서 생기는 오류
단어 가운대에 띄어쓰기를 해서 생기는 오류
- punctuation_error
Converts puntuation within the same type
문장속의 토큰중 품사가 “S-” 로 시작하는 랜덤하게 고른 토큰을 같은 품사표 내에서 랜덤하게 변경한다.
- phonetic_first_error or phonetic_last_error
Converts the first or last character if it exists in phonetic data list
여 > 녀 / 율 > 률
- remove_josa_error
Randomly remove josa from sentence
Remove token which type starts with "J"
- addition_error
Randomly add consonant to character which has "ㅇ" as its starting consonant or doens't have the last consonant
아기 > 바기 / 다치 > 닫치
- separation_error
From the randomly selected word, decompose the letter to consonant and vowel
할 > ㅎㅏㄹ
- typing_language_error
Convert Korean text to English text regarding to the same keyboard position
한글 > gksrmf / 고양이 > rhdiddl
- postposition_diff_josa_error or postposition_same_josa_error
Convert a Josa to either different or same type of Josa from the Josa dataset.
를 > 을 / 에게 > 할
- busa_error
Convert busa "이" to "히" either way
부단히 > 부단이 / 같이 > 같히
- middle_shiot_error
If there is "ㅅ" as the last consonant in a word longer than 2 characters, erase the last consonant "ㅅ"
숫자 > 수자 / 찻잔 > 차잔
- grapheme_to_phonem_error
Convert the word's textual form as it is pronounciated. If the textual outcome based on the pronounciation of that specific word is not same as the current textual form of the word
행복하다 > 행보카다 / 같이 > 가치
- overlapping_sound_error
If there are two continuous Tensed Consonant letter positioning in each first consonant letter, convert the second letter's first consonant into Basic Consonant.
딱딱하다 > 딱닥하다 / 쌉쌀하다 > 쌉살하다
- final_suffix_error
Convert the randomly selected final suffix into a different final suffix that doesn't match the original suffix
하겠습니다 > 하겠습네까 / 하고있다 > 하고있니
- mag_error & maj_error
By using the similarity function in Gensim KeyedVector, convert the typical pumsa type (mag or maj) into a different busa.
mag 얼마나 > 어떻게 / maj 하지만 > 그러나
- polite_speech_error
Misusage of two types of polite speech josa, nominative josa and adverbal josa
이,가 > 께서 / 에게 > 께
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for gec_noise_generator_ko-0.1.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | e262d6c64563630847dc63a8892cce84f80ea8c63ccd125ffba15c559cfa7a2e |
|
MD5 | 84b18f2f7c85d0bf9551437cc1775c45 |
|
BLAKE2b-256 | c30b0d356a64455255d2121d5ebb5912fe7f2380a7c2f975aea1cbdb59abc760 |
Hashes for gec_noise_generator_ko-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 24bc87231dbfaac0b19f8fcaee61a2c9af35e7f413f403533d3f27e028b807e1 |
|
MD5 | 8466b726fb96e9f902b97dc6a2e95a57 |
|
BLAKE2b-256 | 501ce33a398bf1d60f40981997126e0daf64103c76ccb6a2562b6c4b91609e6c |