Skip to main content

No project description provided

Project description

vina2vi

vina2vi stands for Vietnamese no accent to Vietnamese,
which is a Python package aiming at helping foreigners decrypt Vietnamese messages.
(More precisely, targeted to foreigners who already know the basics of the language.)

Among other things, this Python package aims to

  • Restore Vietnamese diacritics
  • Translate acronyms, đổi vần, etc.
  • Correct spelling

Installation

During development, I have used Python3.10. But I think all Python versions >= 3.8 will be fine.

Run the following to install vina2vi:

pip install vina2vi

Alternatively, one can also install the latest version from GitLab as follows.

pip install git+https://gitlab.com/phunc20/vina2vi

Usage

I only work on this project on my spare time, and work slowly. This README is meant to get changed fast and a lot. Therefore, please pay attention to the versions and its README. For the moment, there is not much in the package that is super useful. As time goes by, I will add more.

I roughly classify the code into the following categories

  • models
  • metrics
  • util

vina2vi/models/

For example, one can try to play with one of the models

In [1]: from vina2vi.models.tf.tuto_transformer import Tokenizers, Translator, Transformer

In [2]: tokenizers = Tokenizers.from_pretrained()

In [3]: input_vocab_size = tokenizers.vina.get_vocab_size().numpy().tolist()

In [4]: target_vocab_size = tokenizers.vi.get_vocab_size().numpy().tolist()

In [5]: transformer = Transformer.from_pretrained(input_vocab_size=input_vocab_size, target_vocab_size=target_vocab_size)

In [6]: translator = Translator(tokenizers, transformer)

In [7]: translator.translate("Sang nay toi dan con toi di cong vien choi.")
Out[7]: 'sang này tôi đàn con tôi đi công viên chơi .'

In [8]: translator.translate("Truong dai hoc xa hoi va nhan van")
Out[8]: 'trường đại học xã hội và nhân văn'

vina2vi/util.py

There is an utility function to help tell whether a string contains non-Vietnamese characters, is_foreign. As the name suggests,

  • If the string contains characters other than the modern Vietnamese alphabets, then is_foreign returns True
  • If the string consists exclusively of characters of modern Vietnamese alphabets, then is_foreign returns False
    • Languages whose alphabets are a subset of Vietnamese's are thus considered as Vietnamese
    • Currently, we do not consider chữ Nôm as Vietnamese; maybe we will in the future
In [1]: from vina2vi.util import Vietnamese

In [2]: Vietnamese.is_foreign("Российская Федерация\tRossiyskaya Federatsiya")
Out[2]: True

In [3]: Vietnamese.is_foreign("\n\tRossiyskaya Federatsiya")
Out[3]: False

In [4]: Vietnamese.is_foreign("Tôi nói tiếng Việt Nam\t碎呐㗂越南")
Out[4]: True

In [5]: Vietnamese.is_foreign("Tôi nói tiếng Việt Nam\t")
Out[5]: False

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vina2vi-0.0.9.tar.gz (13.8 kB view hashes)

Uploaded Source

Built Distribution

vina2vi-0.0.9-py3-none-any.whl (13.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page