No project description provided
Project description
vina2vi
vina2vi
stands for Vietnamese no accent to Vietnamese,
which is a Python package aiming at helping foreigners decrypt Vietnamese messages.
Among other things, we plan to make vina2vi
capable of
- Restoring Vietnamese diacritics
- Correcting spelling
- Translating acronyms, đổi vần, etc.
Installation
During development, I have used Python3.10. But I think all Python versions >= 3.8 will be fine.
Run the following command to install vina2vi
:
pip install vina2vi
Alternatively, one can also install the latest commit from GitLab as follows.
pip install git+https://gitlab.com/phunc20/vina2vi
Usage
I only work on this project on my spare time, and work slowly. This README is meant to get changed fast and a lot. Therefore, please pay attention to the versions and the corresponding README. For the moment, there is not much in the package that is super useful. As time goes by, I will add more.
I roughly classify the code into the following categories
models
metrics
util
vina2vi/models/
For example, one can try to play with a Transformer model
In [1]: from vina2vi.models.tf.tuto_transformer import Translator
In [2]: translator = Translator.from_pretrained()
In [3]: translator.translate("Sang nay toi dan con toi di cong vien choi.")
Out[3]: 'sang này tôi đàn con tôi đi công viên chơi .'
In [4]: translator.translate("Truong dai hoc xa hoi va nhan van")
Out[4]: 'trường đại học xã hội và nhân văn'
Or with a bigram model
In [1]: from vina2vi.models.char_based.bigram import Bigram
In [2]: bigram = Bigram.from_pretrained()
In [3]: bigram.translate("Sang nay toi dan con toi di cong vien choi.")
Out[3]: 'Sang này tôi đãn cón tôi đi cóng viện chôi.'
In [4]: bigram.translate("Truong dai hoc xa hoi va nhan van")
Out[4]: 'Trường đãi hôc xã hôi và nhàn vàn'
In particular, the CRF model in vina2vi/models/crf.py
is a direct borrow from
trungtv
. The only reason I put a GitHub installation
link, like in requirements.txt
, is that the original repo has a small bug and that it seems
that that pacakge is no longer maintained. As a result, I forked the work and made
a few commits to fix the small bug.
However, it seems that PyPi does not accept pyproject.toml
containing any GitHub link
as dependency. Therefore, in order to use the CRF model, please install the pyvi
package
according to requirements.txt
.
vina2vi/util.py
For example, there is an utility function to help tell whether a string contains
non-Vietnamese characters, is_foreign
. As the name suggests,
- If the string contains characters other than the modern Vietnamese alphabets,
then
is_foreign
returnsTrue
- If the string consists exclusively of characters of modern Vietnamese alphabets,
then
is_foreign
returnsFalse
- Languages whose alphabets are a subset of Vietnamese's are thus considered as Vietnamese
- Currently, we do not consider chữ Nôm as Vietnamese; maybe we will in the future
In [1]: from vina2vi.util import Vietnamese
In [2]: Vietnamese.is_foreign("Российская Федерация\tRossiyskaya Federatsiya")
Out[2]: True
In [3]: Vietnamese.is_foreign("\n\tRossiyskaya Federatsiya")
Out[3]: False
In [4]: Vietnamese.is_foreign("Tôi nói tiếng Việt Nam\t碎呐㗂越南")
Out[4]: True
In [5]: Vietnamese.is_foreign("Tôi nói tiếng Việt Nam\t")
Out[5]: False
There are also four useful normalizers (from Hugging Face's tokenizers
library)
uncased_vi_normalizer
cased_vi_normalizer
uncased_vina_normalizer
cased_vina_normalizer
which help
- Make sure that similar-looking characters are not only similar but also exactly the same
- (In the case of
(un)cased_vina_normalizer
) Remove diacritics
Evaluation and Data
vina2vi/metrics/evaluate_models.py
helps give a quick overview of the models' performances.
$ python evaluate_models.py
mean median
baseline 0.7862 0.7813
bigram 0.8405 0.8399
crf_trungtv 0.8591 0.8561
tf_tuto_transformer (cased) 0.8691 0.8758
tf_tuto_transformer (uncased) 0.8918 0.9101
where the mean
and median
stand for the mean and median similarities calculated on some test
dataset (to be described below).
vina2vi/data/
: Different types of text are collected with balance --
Modern literature, lyrics, classics, etc.
$ wc -c vina2vi/data/*.txt
3576 vina2vi/data/canh_dong_bat_tan.txt
2261 vina2vi/data/cho_toi_xin_mot_ve_di_tuoi_tho.txt
1775 vina2vi/data/chuyen_tinh_nguoi_trinh_nu_ten_thi.txt
3174 vina2vi/data/tron_tim.txt
1613 vina2vi/data/truyen_kieu.txt
12399 total
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file vina2vi-0.0.10.tar.gz
.
File metadata
- Download URL: vina2vi-0.0.10.tar.gz
- Upload date:
- Size: 26.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d9a2478c5b618d04ec81fa0830d793d6c3ae3826359ca14740d74a8a3a7da135 |
|
MD5 | d825b8492dc1aa5b54e12da2caed72b9 |
|
BLAKE2b-256 | 47e9e55eef715c97f1c8e742f88493c9842f814d98e9c4b5fbf21960eb0f869b |
File details
Details for the file vina2vi-0.0.10-py3-none-any.whl
.
File metadata
- Download URL: vina2vi-0.0.10-py3-none-any.whl
- Upload date:
- Size: 27.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5636c39ad7247fefd32ac5f94a0cfe0ba68493aed9ec8f50bd07926bab85c6c5 |
|
MD5 | 76db6d8cc49349a13e4aefc13173b9a3 |
|
BLAKE2b-256 | 17acb62d8016e2b3856469ba7050e705353768042f997dd8731d090d7665031e |