Skip to main content

Korean Translation and Augmentation with fine-tuned NLLB

Project description

KoTAN: Korean Translation and Augmentation with fine-tuned NLLB

A KoTAN package can exercise korean data augmentation task and en->ko, ko->en translation task. In case of translation model, we are fine-tuning facebook NLLB model. About data augmentation task, we processe backtranslation task. In addition, we also provide speech-style conversion options.

Package install

  • torch=2.0.0 (cuda 12.0) and python>=3.8 are avaliable.
  • You can install the package with below command.
pip3 install kotan

Usage

  • You can use KoTAN with below command.
  • Import package.
>>> from kotan import KoTAN
  • Avaliable tasks
>>> KoTAN.available_tasks()
  • Avaliable languages
>>> KoTAN.available_lang()
  • Data augmentation options
>>> KoTAN.available_level()
  • origin: Before fine-tuning nllb model.
  • fine: After fine-tuning nllb model.
  • Speech-style conversion options
>>> KoTAN.available_style()
  • formal: 문어체
  • informal: 구어체
  • android: 안드로이드
  • azae: 아재
  • chat: 채팅
  • choding: 초등학생
  • emoticon: 이모티콘
  • enfp: enfp
  • gentle: 신사
  • halbae: 할아버지
  • halmae: 할머니
  • joongding: 중학생
  • king: 왕
  • naruto: 나루토
  • seonbi: 선비
  • sosim: 소심한
  • translator: 번역기

Translation

>>> from kotan import KoTAN
>>> mt = KoTAN(task="translation", tgt="en")
>>> inputs = ['나는 온 세상 사람들이 행복해지길 바라', '나는 선한 영향력을 펼치는 사람이 되고 싶어']
>>> mt.predict(inputs)

Data Augmentation

Origin nllb model (before fine-tuning)

>>> from kotan import KoTAN
>>> aug = KoTAN(task="augmentation", level="origin")
>>> inputs = ['나는 온 세상 사람들이 행복해지길 바라', '나는 선한 영향력을 펼치는 사람이 되고 싶어']
>>> aug.predict(inputs)

Fine-tuned nllb model with Aihub datasets.

>>> from kotan import KoTAN
>>> aug = KoTAN(task="augmentation", level="fine")
>>> inputs=['나는 온 세상 사람들이 행복해지길 바라', '나는 선한 영향력을 펼치는 사람이 되고 싶어']
>>> aug.predict(inputs)

Apply style-convert option.

>>> from kotan import KoTAN
>>> aug = KoTAN(task="augmentation", style="chat")
>>> inputs=['나는 온 세상 사람들이 행복해지길 바라', '나는 선한 영향력을 펼치는 사람이 되고 싶어']
>>> aug.predict(inputs)

Speech-style conversion

>>> from kotan import KoTAN
>>> style = KoTAN(task="augmentation", style="king")
>>> inputs=['나는 온 세상 사람들이 행복해지길 바라', '나는 선한 영향력을 펼치는 사람이 되고 싶어']
>>> style.predict(inputs)

Demo

Huggingface KoTAN space

Citation

@misc{KoTAN,
  author       = {Juhwan Lee and Jisu Kim},
  title        = {KoTAN: Korean Translation and Augmentation with fine-tuned NLLB},
  howpublished = {\url{https://github.com/KoJLabs/KoTAN}},
  year         = {2023},
}

Contributors

Jisu, Kim, Juhwan, Lee

License

KoTAN project follow Apache License 2.0 lisence

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kotan-1.0.0.tar.gz (4.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kotan-1.0.0-py3-none-any.whl (7.1 kB view details)

Uploaded Python 3

File details

Details for the file kotan-1.0.0.tar.gz.

File metadata

  • Download URL: kotan-1.0.0.tar.gz
  • Upload date:
  • Size: 4.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.2 CPython/3.8.16 Linux/4.15.0-206-generic

File hashes

Hashes for kotan-1.0.0.tar.gz
Algorithm Hash digest
SHA256 f0a64693df8c8bb89b3e82489ed6e6af88a7301480c1341bea7c203a7d40c448
MD5 1c9b7df82d9cf103a5b4a3471eead79e
BLAKE2b-256 6aed67c81b0c2c98982c288733468c928a9725c85a791dede3c797c9ca691cf8

See more details on using hashes here.

File details

Details for the file kotan-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: kotan-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 7.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.2 CPython/3.8.16 Linux/4.15.0-206-generic

File hashes

Hashes for kotan-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1b14ef2581a0ca3a28d92d12d361435127b402cf55514297cf842a8cf58985c1
MD5 85796edb39753ecadce82f844058e12a
BLAKE2b-256 3af6b2f1c0a3dd7d3a63bcdb8bec606ae3ae0189bbe82cf0016c6849096de9f2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page