Skip to main content

No project description provided

Project description

vina2vi

vina2vi stands for Vietnamese no accent to Vietnamese,
which is a Python package aiming at helping foreigners decrypt Vietnamese messages.
(More precisely, targeted to foreigners who already know the basics of the language.)

Among other things, this Python package aims to

  • Restore Vietnamese diacritics
  • Translate acronyms, đổi vần, etc.
  • Correct spelling

Installation

During development, I have used Python3.10. But I think all Python versions >= 3.8 will be fine.

Run the following to install vina2vi:

pip install vina2vi

Alternatively, one can also install the latest version from GitLab as follows.

pip install git+https://gitlab.com/phunc20/vina2vi

Usage

I only work on this project on my spare time, and work slowly. This README is meant to get changed fast and a lot. Therefore, please pay attention to the versions and its README. For the moment, there is not much in the package that is super useful. As time goes by, I will add more.

I roughly classify the code into the following categories

  • models
  • metrics
  • util

vina2vi/models/

For example, one can try to play with one of the models

In [1]: from vina2vi.models.tf.tuto_transformer import Tokenizers, Translator, Transformer

In [2]: tokenizers = Tokenizers.from_pretrained()

In [3]: input_vocab_size = tokenizers.vina.get_vocab_size().numpy().tolist()

In [4]: target_vocab_size = tokenizers.vi.get_vocab_size().numpy().tolist()

In [5]: transformer = Transformer.from_pretrained(input_vocab_size=input_vocab_size, target_vocab_size=target_vocab_size)

In [6]: translator = Translator(tokenizers, transformer)

In [7]: translator.translate("Sang nay toi dan con toi di cong vien choi.")
Out[7]: 'sang này tôi đàn con tôi đi công viên chơi .'

In [8]: translator.translate("Truong dai hoc xa hoi va nhan van")
Out[8]: 'trường đại học xã hội và nhân văn'

vina2vi/util.py

There is an utility function to help tell whether a string contains non-Vietnamese characters, is_foreign. As the name suggests,

  • If the string contains characters other than the modern Vietnamese alphabets, then is_foreign returns True
  • If the string consists exclusively of characters of modern Vietnamese alphabets, then is_foreign returns False
    • Languages whose alphabets are a subset of Vietnamese's are thus considered as Vietnamese
    • Currently, we do not consider chữ Nôm as Vietnamese; maybe we will in the future
In [1]: from vina2vi.util import Vietnamese

In [2]: Vietnamese.is_foreign("Российская Федерация\tRossiyskaya Federatsiya")
Out[2]: True

In [3]: Vietnamese.is_foreign("\n\tRossiyskaya Federatsiya")
Out[3]: False

In [4]: Vietnamese.is_foreign("Tôi nói tiếng Việt Nam\t碎呐㗂越南")
Out[4]: True

In [5]: Vietnamese.is_foreign("Tôi nói tiếng Việt Nam\t")
Out[5]: False

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vina2vi-0.0.9.tar.gz (13.8 kB view details)

Uploaded Source

Built Distribution

vina2vi-0.0.9-py3-none-any.whl (13.7 kB view details)

Uploaded Python 3

File details

Details for the file vina2vi-0.0.9.tar.gz.

File metadata

  • Download URL: vina2vi-0.0.9.tar.gz
  • Upload date:
  • Size: 13.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for vina2vi-0.0.9.tar.gz
Algorithm Hash digest
SHA256 e85bd6576698c12c7d3cf30544f5f6e012be0d8de1639da6fa405eecc64c4b20
MD5 dcd850831c2501f29538942e1098ae3f
BLAKE2b-256 4a5883b4a343cb668bbf96bd57841122fb3d2123ba2c0d66d37065c4e296d836

See more details on using hashes here.

File details

Details for the file vina2vi-0.0.9-py3-none-any.whl.

File metadata

  • Download URL: vina2vi-0.0.9-py3-none-any.whl
  • Upload date:
  • Size: 13.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for vina2vi-0.0.9-py3-none-any.whl
Algorithm Hash digest
SHA256 a5fe68d4080e2e5e96b8650e65205bc24c89eb451f44181a0d2d5e79fbdf94d5
MD5 3fc24d5d8fcff7ae5c3b3bd7313e57b1
BLAKE2b-256 3e81067f691fd328cf72b79c25b3725bffe9cefdcc8dbeb820f7641e89012535

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page