Skip to main content

An utility library for processing Vietnamese texts

Project description

chiecthuyenngoaixa

GitHub issues GitHub license Documentation Status PyPI PyPI - Downloads

Tiếng Việt

chiecthuyenngoaixa is a Python library which provides functions and classes for various tasks in processing Vietnamese texts, such as removing diacritics, converting numbers to words, sorting strings, validations and more.

This library is written on pure Python with no dependencies. Python 3.8 and above is supported.

Installation

Chiecthuyenngoaixa is available on PyPI. Open a terminal or Command Prompt (on Windows) and run the following command:

pip install chiecthuyenngoaixa

If you are using Poetry, use this instead:

poetry add chiecthuyenngoaixa

Basic usage

The library will now be available as ctnx module (abbreviation of chiecthuyenngoaixa).

Some commonly used functions and classes can be imported directly. For example:

  • To convert Vietnamese text to ASCII-only text:
>>> from ctnx import remove_diacritics
>>> remove_diacritics("Đàn ong thấy cái lon thì bu vào.")
'Dan ong thay cai lon thi bu vao.'
  • To convert a number to Vietnamese text:
>>> from ctnx import num_to_words
>>> num_to_words(123456789021003.45)
'một trăm hai mươi ba nghìn bốn trăm năm mươi sáu tỉ bảy trăm tám mươi chín triệu không trăm hai mươi mốt nghìn không trăm linh ba phẩy bốn mươi lăm'
  • To sort Vietnamese texts:
>>> from ctnx import ViSortKey
>>> lines = ['Hà Nam', 'Hải Dương', 'Hà Nội', 'Hà Tĩnh', 'Hải Phòng', 'Hậu Giang', 'Hoà Bình', 'Hưng Yên', 'Hạ Long', 'Hà Giang', 'Điện Biên'\]
>>> sorted(lines, key=ViSortKey)
['Điện Biên', 'Hà Giang', 'Hà Nam', 'Hà Nội', 'Hà Tĩnh', 'Hải Dương', 'Hải Phòng', 'Hạ Long', 'Hậu Giang', 'Hoà Bình', 'Hưng Yên']

Other functions and classes are put into separate sub-modules. For example:

  • To convert a likely confusing text of Vietnamese to the normal text:
>>> from ctnx.misc import normalize_confusables
>>> normalize_confusables("𝕮𝖍𝖎ế𝖈 𝖙𝖍𝖚𝖞ề𝖓 𝖓𝖌𝖔à𝖎 𝖝𝖆")
'Chiếc thuyền ngoài xa'
  • To extract information from a Vietnamese National Citizen ID (Căn cước công dân) number:
>>> from ctnx import validation
>>> validation.is_valid_cccd("024192123456")
True
>>> validation.parse_cccd("024192123456")
CccdResult(id='123456', is_male=False, birth_year=1992, birth_country='vn', birth_province='Bắc Giang')
  • To extract tones from a Vietnamese syllable or text:
>>> from ctnx.misc import separate_tone
>>> separate_tone("Đẩu")
('Đâu', '?')
>>> toneNames = {'': 'thanh', '/': 'sắc', '\\': 'huyền', '?': 'hỏi', '~': 'ngã', '.': 'nặng'}
>>> ' '.join(toneNames[separate_tone(syll)[1]] for syll in "Tôi thầm cảm ơn Đẩu đã giữ mình ở nán lại".split(' '))
'thanh huyền hỏi thanh hỏi ngã ngã huyền hỏi sắc nặng'
  • To manipulate Vietnamese syllables:
>>> from ctnx.syllable import Syllable
>>> text = "ba ngày một trận nhẹ năm ngày một trận nặng"
>>> a = [Syllable.from_string(x) for x in text.split(' ')]
>>> a
[Syllable(b, a, ), Syllable(ng, ay, , \), Syllable(m, ô, t, .), Syllable(tr, â, n, .), Syllable(nh, e, , .), Syllable(n, ă, m), Syllable(ng, ay, , \), Syllable(m, ô, t, .), Syllable(tr, â, n, .), Syllable(n, ă, ng, .)]
>>> for syll in a:
...     syll.onset = 'nh'
...
>>> a
[Syllable(nh, a, ), Syllable(nh, ay, , \), Syllable(nh, ô, t, .), Syllable(nh, â, n, .), Syllable(nh, e, , .), Syllable(nh, ă, m), Syllable(nh, ay, , \), Syllable(nh, ô, t, .), Syllable(nh, â, n, .), Syllable(nh, ă, ng, .)]
>>> ' '.join(str(x) for x in a)
'nha nhày nhột nhận nhẹ nhăm nhày nhột nhận nhặng'

For further usages, see the documentation, which is hosted on chiecthuyenngoaixa.readthedocs.io.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chiecthuyenngoaixa-0.2.1.tar.gz (29.5 kB view details)

Uploaded Source

Built Distribution

chiecthuyenngoaixa-0.2.1-py3-none-any.whl (29.3 kB view details)

Uploaded Python 3

File details

Details for the file chiecthuyenngoaixa-0.2.1.tar.gz.

File metadata

  • Download URL: chiecthuyenngoaixa-0.2.1.tar.gz
  • Upload date:
  • Size: 29.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.8.10 Linux/6.5.0-1021-azure

File hashes

Hashes for chiecthuyenngoaixa-0.2.1.tar.gz
Algorithm Hash digest
SHA256 550abe20852a57aaa8244ab680b8eac94c67014ec2498ea630dc6fec1187f2cb
MD5 570b5581535a32104905ae4fe215e7bd
BLAKE2b-256 a6cfffcad9def27dd6dd83b272332afe33ed71a78fc1d6b35cf69e4a919265f9

See more details on using hashes here.

File details

Details for the file chiecthuyenngoaixa-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for chiecthuyenngoaixa-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ae267f722dae249159e7f52f3d31a7e884971a1de01846aa19f8d638f10f97ed
MD5 216f42e250262a90b0e792b2f723d04e
BLAKE2b-256 7bd5792127373238f16dce2a9bac82f555b86bd609b4484251a959110a0266db

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page