Noise Generator for Korean Text
Project description
Noise Generator for Korean Text Grammar Error Correction Model
This model is using python-mecab-ko as the main tokenizer.
If you need to know the token types check the website below
https://m.blog.naver.com/PostView.naver?isHttpsRedirect=true&blogId=aramjo&logNo=221404488280
Requirements
Python >= 3.7
pip install wget
pip install konlpy
pip install hangul_utils
pip install hangul_jamo
pip install inko
pip install g2pk
pip install gensim
Currently not available in Windows OS.
Manualy Download Wiki.ko.vec from this link
https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.ko.vec
Place the wiki.ko.vec file to gec_noise_generator_ko/project/resources
How to use
Preparation
pip install gec_noise4korean
import project
nsgec = noise_generate.NoiseGenerate("text data directory", "error Type")
The result will appear in results file.
Requires a Text file with its directory as the First argument when calling the function.
Text data should be in txt file and each sentence should be organized through each line.
Functions (Error Type) (second argument)
- spacing_error
Creates Random spacing error according to rule written below
(명사, 관형어) + (명사 , 관형어 의존명사) 인 경우 띄어쓰기를 만든다
조사와 접사를 띄어써서 생기는 오류
단어 가운대에 띄어쓰기를 해서 생기는 오류
- punctuation_error
Converts puntuation within the same type
문장속의 토큰중 품사가 “S-” 로 시작하는 랜덤하게 고른 토큰을 같은 품사표 내에서 랜덤하게 변경한다.
- punctuation_error
Converts puntuation within the same type
문장속의 토큰중 품사가 “S-” 로 시작하는 랜덤하게 고른 토큰을 같은 품사표 내에서 랜덤하게 변경한다.
- phonetic_first_error or phonetic_last_error
Converts the first or last character if it exists in phonetic data list
여 > 녀 / 율 > 률
- remove_josa_error
Randomly remove josa from sentence
Remove token which type starts with "J"
- addition_error
Randomly add consonant to character which has "ㅇ" as its starting consonant or doens't have the last consonant
아기 > 바기 / 다치 > 닫치
- separation_error
From the randomly selected word, decompose the letter to consonant and vowel
할 > ㅎㅏㄹ
- typing_language_error
Convert Korean text to English text regarding to the same keyboard position
한글 > gksrmf / 고양이 > rhdiddl
- postposition_diff_josa_error or postposition_same_josa_error
Convert a Josa to either different or same type of Josa from the Josa dataset. 를 > 을 / 에게 > 할
- busa_error
Convert busa "이" to "히" either way 부단히 > 부단이 / 같이 > 같히
- middle_shiot_error
If there is "ㅅ" as the last consonant in a word longer than 2 characters, erase the last consonant "ㅅ" 숫자 > 수자 / 찻잔 > 차잔
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for gec_noise_generator_ko-0.0.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7873dfa0dc71caa67f2b51ab7468b0627b0654357a6cbc5d37aca4e5e52b47f2 |
|
MD5 | 88d716fba6235981e729c3af969931f3 |
|
BLAKE2b-256 | 69430807f176246cb7779af0264f8ccdd6584a33073eea9e50ac0b025bfbdfe9 |
Hashes for gec_noise_generator_ko-0.0.1-py3.10.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9886539778bdc962cc586b5a6dbf0049ad3cdd60dcce6d77f948a7809e79248e |
|
MD5 | e25a48ec7a1fc27c039d76e0bb03a664 |
|
BLAKE2b-256 | c185d1fb03a18e6ed249a5f771ee25f0adf0a48f34f409e61ce0a99fe490dbab |
Hashes for gec_noise_generator_ko-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d4574063f00af20b7e6d45bc7bedf99da4d5d158e4fec3fc9b4ef2d82c575068 |
|
MD5 | bb5a877ec0524f402fccfc8984fc1497 |
|
BLAKE2b-256 | eab637733b3667c0320950aadd690bf8e317c5b1bc0016a8cb8a90d6b2f6ed9f |