Skip to main content

Vietnamese Tokenizer package based on deep learning method

Project description

VieTokenizer

This model architecture that we use is a simple bi-lstm network trained by unsupervised learning on a large pre-segmented dataset. The model will predict if serial word is 1 and non-serial is 0, for example, "Tôi tên là Nguyễn Tiến Đạt" will be equivalent to a sequence of numbers with both zero and one being [0, 0, 0, 0, 1, 1].

Installation 🎉

  • This repository is tested on python 3.7+ and Tensorflow 2.8+
  • VieTokenizer can be installed using pip as follows:
pip install vietokenizer 🍰
  • VieTokenizer can also be installed from source with the following commands:
git clone https://github.com/Nguyendat-bit/VieTokenizer
cd VieTokenizer
pip install -e . 

Usage 🔥

>>> import vietokenizer
>>> tokenizer= vietokenizer.vntokenizer()
>>> tokenizer('Tôi tên là Nguyễn Tiến Đạt, hiện là sinh viên Đại học CN GTVT tại Hà Nội.')
'Tôi tên là Nguyễn_Tiến_Đạt , hiện là sinh_viên Đại_học CN GTVT tại Hà_Nội .'
>>> tokenizer('Kim loại nặng thường được định nghĩa là kim loại có khối lượng riêng, khối lượng nguyên tử hoặc số hiệu nguyên tử lớn.')
'Kim_loại nặng thường được định_nghĩa là kim_loại có khối_lượng riêng , khối_lượng nguyên_tử hoặc số_hiệu nguyên_tử lớn .'

License

Apache 2.0 License.
Copyright © 2022 Nguyễn Tiến Đạt. All rights reserved.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vietokenizer-1.0.3.tar.gz (8.0 kB view details)

Uploaded Source

Built Distribution

vietokenizer-1.0.3-py3-none-any.whl (9.0 kB view details)

Uploaded Python 3

File details

Details for the file vietokenizer-1.0.3.tar.gz.

File metadata

  • Download URL: vietokenizer-1.0.3.tar.gz
  • Upload date:
  • Size: 8.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.7

File hashes

Hashes for vietokenizer-1.0.3.tar.gz
Algorithm Hash digest
SHA256 9cd67607c4d9d903453bb8ecc47c5949b18dae06d5e0afb80613f6405fca3215
MD5 ed67009a9e97c868eb33f28343891361
BLAKE2b-256 4932f72483004b65fae8efa6e2a9b248c235383ca5b0fee43d705ea7412ba8ae

See more details on using hashes here.

File details

Details for the file vietokenizer-1.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for vietokenizer-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 a9d79d54c6a8b2993fc3660c05679e591726b42310309a32e230f676f565c3f5
MD5 5439ecde9851e8758afef9f42ebbde25
BLAKE2b-256 b76ccde13f142dff5e822957aaa6236d54d381009690374ec0d1c79b008628d2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page