Skip to main content

No project description provided

Project description

vina2vi

vina2vi stands for Vietnamese no accent to Vietnamese,
which is a Python package aiming at helping foreigners decrypt Vietnamese messages.

Among other things, we plan to make vina2vi capable of

  • Restoring Vietnamese diacritics
  • Correcting spelling
  • Translating acronyms, đổi vần, etc.

Installation

During development, I have used Python3.10. But I think all Python versions >= 3.8 will be fine.

Run the following command to install vina2vi:

pip install vina2vi

Alternatively, one can also install the latest commit from GitLab as follows.

pip install git+https://gitlab.com/phunc20/vina2vi

Usage

I only work on this project on my spare time, and work slowly. This README is meant to get changed fast and a lot. Therefore, please pay attention to the versions and the corresponding README. For the moment, there is not much in the package that is super useful. As time goes by, I will add more.

I roughly classify the code into the following categories

  • models
  • metrics
  • util

vina2vi/models/

For example, one can try to play with a Transformer model

In [1]: from vina2vi.models.tf.tuto_transformer import Translator

In [2]: translator = Translator.from_pretrained()

In [3]: translator.translate("Sang nay toi dan con toi di cong vien choi.")
Out[3]: 'sang này tôi đàn con tôi đi công viên chơi .'

In [4]: translator.translate("Truong dai hoc xa hoi va nhan van")
Out[4]: 'trường đại học xã hội và nhân văn'

Or with a bigram model

In [1]: from vina2vi.models.char_based.bigram import Bigram

In [2]: bigram = Bigram.from_pretrained()

In [3]: bigram.translate("Sang nay toi dan con toi di cong vien choi.")
Out[3]: 'Sang này tôi đãn cón tôi đi cóng viện chôi.'

In [4]: bigram.translate("Truong dai hoc xa hoi va nhan van")
Out[4]: 'Trường đãi hôc xã hôi và nhàn vàn'

In particular, the CRF model in vina2vi/models/crf.py is a direct borrow from trungtv. The only reason I put a GitHub installation link, like in requirements.txt, is that the original repo has a small bug and that it seems that that pacakge is no longer maintained. As a result, I forked the work and made a few commits to fix the small bug.

However, it seems that PyPi does not accept pyproject.toml containing any GitHub link as dependency. Therefore, in order to use the CRF model, please install the pyvi package according to requirements.txt.

vina2vi/util.py

For example, there is an utility function to help tell whether a string contains non-Vietnamese characters, is_foreign. As the name suggests,

  • If the string contains characters other than the modern Vietnamese alphabets, then is_foreign returns True
  • If the string consists exclusively of characters of modern Vietnamese alphabets, then is_foreign returns False
    • Languages whose alphabets are a subset of Vietnamese's are thus considered as Vietnamese
    • Currently, we do not consider chữ Nôm as Vietnamese; maybe we will in the future
In [1]: from vina2vi.util import Vietnamese

In [2]: Vietnamese.is_foreign("Российская Федерация\tRossiyskaya Federatsiya")
Out[2]: True

In [3]: Vietnamese.is_foreign("\n\tRossiyskaya Federatsiya")
Out[3]: False

In [4]: Vietnamese.is_foreign("Tôi nói tiếng Việt Nam\t碎呐㗂越南")
Out[4]: True

In [5]: Vietnamese.is_foreign("Tôi nói tiếng Việt Nam\t")
Out[5]: False

There are also four useful normalizers (from Hugging Face's tokenizers library)

  1. uncased_vi_normalizer
  2. cased_vi_normalizer
  3. uncased_vina_normalizer
  4. cased_vina_normalizer

which help

  1. Make sure that similar-looking characters are not only similar but also exactly the same
  2. (In the case of (un)cased_vina_normalizer) Remove diacritics

Evaluation and Data

vina2vi/metrics/evaluate_models.py helps give a quick overview of the models' performances.

$ python evaluate_models.py
                                 mean  median
baseline                       0.7862  0.7813
bigram                         0.8405  0.8399
crf_trungtv                    0.8591  0.8561
tf_tuto_transformer (cased)    0.8691  0.8758
tf_tuto_transformer (uncased)  0.8918  0.9101

where the mean and median stand for the mean and median similarities calculated on some test dataset (to be described below).

vina2vi/data/: Different types of text are collected with balance -- Modern literature, lyrics, classics, etc.

$ wc -c vina2vi/data/*.txt
 3576 vina2vi/data/canh_dong_bat_tan.txt
 2261 vina2vi/data/cho_toi_xin_mot_ve_di_tuoi_tho.txt
 1775 vina2vi/data/chuyen_tinh_nguoi_trinh_nu_ten_thi.txt
 3174 vina2vi/data/tron_tim.txt
 1613 vina2vi/data/truyen_kieu.txt
12399 total

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vina2vi-0.0.12.tar.gz (28.4 kB view details)

Uploaded Source

Built Distribution

vina2vi-0.0.12-py3-none-any.whl (30.3 kB view details)

Uploaded Python 3

File details

Details for the file vina2vi-0.0.12.tar.gz.

File metadata

  • Download URL: vina2vi-0.0.12.tar.gz
  • Upload date:
  • Size: 28.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for vina2vi-0.0.12.tar.gz
Algorithm Hash digest
SHA256 75b9f8ff8453284dfccea74e8e6ffab3bc3bce881df72b86bfd2ed54e3720eed
MD5 9d66a10acc3bc046cd182502c87dadb8
BLAKE2b-256 e94d1708f02014e36088224f801c6d3388d94b6616b66f824508f831f66fda3b

See more details on using hashes here.

File details

Details for the file vina2vi-0.0.12-py3-none-any.whl.

File metadata

  • Download URL: vina2vi-0.0.12-py3-none-any.whl
  • Upload date:
  • Size: 30.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for vina2vi-0.0.12-py3-none-any.whl
Algorithm Hash digest
SHA256 9f59b5c8378fa1a633e38b7c7e8e810f7c5b0217ef28795609f2445d0e673aaf
MD5 fb61ca62a71c53d6eec5fcffc3274348
BLAKE2b-256 da1858a533fb9548152319935ad98fe8a7b217fdb40ae0aabcd8f02358525404

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page