Skip to main content

No project description provided

Project description

vina2vi

vina2vi stands for Vietnamese no accent to Vietnamese,
which is a Python package aiming at helping foreigners decrypt Vietnamese messages.

Among other things, we plan to make vina2vi capable of

  • Restoring Vietnamese diacritics
  • Correcting spelling
  • Translating acronyms, đổi vần, etc.

Installation

During development, I have used Python3.10. But I think all Python versions >= 3.8 will be fine.

Run the following command to install vina2vi:

pip install vina2vi

Alternatively, one can also install the latest commit from GitLab as follows.

pip install git+https://gitlab.com/phunc20/vina2vi

Usage

I only work on this project on my spare time, and work slowly. This README is meant to get changed fast and a lot. Therefore, please pay attention to the versions and the corresponding README. For the moment, there is not much in the package that is super useful. As time goes by, I will add more.

I roughly classify the code into the following categories

  • models
  • metrics
  • util

vina2vi/models/

For example, one can try to play with a Transformer model

In [1]: from vina2vi.models.tf.tuto_transformer import Translator

In [2]: translator = Translator.from_pretrained()

In [3]: translator.translate("Sang nay toi dan con toi di cong vien choi.")
Out[3]: 'sang này tôi đàn con tôi đi công viên chơi .'

In [4]: translator.translate("Truong dai hoc xa hoi va nhan van")
Out[4]: 'trường đại học xã hội và nhân văn'

Or with a bigram model

In [1]: from vina2vi.models.char_based.bigram import Bigram

In [2]: bigram = Bigram.from_pretrained()

In [3]: bigram.translate("Sang nay toi dan con toi di cong vien choi.")
Out[3]: 'Sang này tôi đãn cón tôi đi cóng viện chôi.'

In [4]: bigram.translate("Truong dai hoc xa hoi va nhan van")
Out[4]: 'Trường đãi hôc xã hôi và nhàn vàn'

In particular, the CRF model in vina2vi/models/crf.py is a direct borrow from trungtv. The only reason I put a GitHub installation link, like in requirements.txt, is that the original repo has a small bug and that it seems that that pacakge is no longer maintained. As a result, I forked the work and made a few commits to fix the small bug.

However, it seems that PyPi does not accept pyproject.toml containing any GitHub link as dependency. Therefore, in order to use the CRF model, please install the pyvi package according to requirements.txt.

vina2vi/util.py

For example, there is an utility function to help tell whether a string contains non-Vietnamese characters, is_foreign. As the name suggests,

  • If the string contains characters other than the modern Vietnamese alphabets, then is_foreign returns True
  • If the string consists exclusively of characters of modern Vietnamese alphabets, then is_foreign returns False
    • Languages whose alphabets are a subset of Vietnamese's are thus considered as Vietnamese
    • Currently, we do not consider chữ Nôm as Vietnamese; maybe we will in the future
In [1]: from vina2vi.util import Vietnamese

In [2]: Vietnamese.is_foreign("Российская Федерация\tRossiyskaya Federatsiya")
Out[2]: True

In [3]: Vietnamese.is_foreign("\n\tRossiyskaya Federatsiya")
Out[3]: False

In [4]: Vietnamese.is_foreign("Tôi nói tiếng Việt Nam\t碎呐㗂越南")
Out[4]: True

In [5]: Vietnamese.is_foreign("Tôi nói tiếng Việt Nam\t")
Out[5]: False

There are also four useful normalizers (from Hugging Face's tokenizers library)

  1. uncased_vi_normalizer
  2. cased_vi_normalizer
  3. uncased_vina_normalizer
  4. cased_vina_normalizer

which help

  1. Make sure that similar-looking characters are not only similar but also exactly the same
  2. (In the case of (un)cased_vina_normalizer) Remove diacritics

Evaluation and Data

vina2vi/metrics/evaluate_models.py helps give a quick overview of the models' performances.

$ python evaluate_models.py
                                 mean  median
baseline                       0.7862  0.7813
bigram                         0.8405  0.8399
crf_trungtv                    0.8591  0.8561
tf_tuto_transformer (cased)    0.8691  0.8758
tf_tuto_transformer (uncased)  0.8918  0.9101

where the mean and median stand for the mean and median similarities calculated on some test dataset (to be described below).

vina2vi/data/: Different types of text are collected with balance -- Modern literature, lyrics, classics, etc.

$ wc -c vina2vi/data/*.txt
 3576 vina2vi/data/canh_dong_bat_tan.txt
 2261 vina2vi/data/cho_toi_xin_mot_ve_di_tuoi_tho.txt
 1775 vina2vi/data/chuyen_tinh_nguoi_trinh_nu_ten_thi.txt
 3174 vina2vi/data/tron_tim.txt
 1613 vina2vi/data/truyen_kieu.txt
12399 total

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vina2vi-0.0.11.tar.gz (26.1 kB view details)

Uploaded Source

Built Distribution

vina2vi-0.0.11-py3-none-any.whl (27.7 kB view details)

Uploaded Python 3

File details

Details for the file vina2vi-0.0.11.tar.gz.

File metadata

  • Download URL: vina2vi-0.0.11.tar.gz
  • Upload date:
  • Size: 26.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for vina2vi-0.0.11.tar.gz
Algorithm Hash digest
SHA256 fd177d2f7f281a7419a137340eabf7a8a3f0b960e27ccf83af14e35b75869a00
MD5 1d2bc028a1575c1f40bc911596173e60
BLAKE2b-256 32409fe1454ef11ceda26410702049ecd7e390fb124e16d40b372a94498e1626

See more details on using hashes here.

File details

Details for the file vina2vi-0.0.11-py3-none-any.whl.

File metadata

  • Download URL: vina2vi-0.0.11-py3-none-any.whl
  • Upload date:
  • Size: 27.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for vina2vi-0.0.11-py3-none-any.whl
Algorithm Hash digest
SHA256 18c6827d3294a2d4e9a4f326cd1986eeeaac2ab50e1b94a0dd743a339ba20461
MD5 15c26bf780317ac8b16ce8400e04392c
BLAKE2b-256 aee6facaaddd0b052681bda0aee95505cf4dc536253a0167bbf50d6a6396f67b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page