Skip to main content

Vietnamese Tokenizer package based on deep learning method

Project description

VieTokenizer

This is a new package that supports Vietnamese word segmentation based on deep learning methods. The model architecture we use is a simple bi-lstm network trained on a pre-labeled dataset. For example, the training set: "Tôi tên là Nguyễn Tiến Đạt" and the test set: "Tôi tên là Nguyễn_Tiến_Đạt". The model will predict if serial word is 1 and non-serial is 0, for example, "Tôi tên là Nguyễn Tiến Đạt" will be equivalent to a sequence of numbers with both zero and one being [0, 0, 0, 0, 1, 1].

Installation 🎉

  • This repository is tested on python 3.7+ and Tensorflow 2.8+
  • VieTokenizer can be installed using pip as follows:
pip install vietokenizer 🍰
  • VieTokenizer can also be installed from source with the following commands:
git clone https://github.com/Nguyendat-bit/VieTokenizer
cd VieTokenizer
pip install -e . 

Usage 🔥

>>> import vietokenizer
>>> tokenizer= vietokenizer.vntokenizer()
>>> tokenizer('Tôi tên là Nguyễn Tiến Đạt, hiện là sinh viên Đại học CN GTVT tại Hà Nội.')
'Tôi tên là Nguyễn_Tiến_Đạt , hiện là sinh_viên Đại_học CN GTVT tại Hà_Nội .'
>>> tokenizer('Kim loại nặng thường được định nghĩa là kim loại có khối lượng riêng, khối lượng nguyên tử hoặc số hiệu nguyên tử lớn.')
'Kim loại nặng_thường được định_nghĩa là kim_loại có khối_lượng riêng , khối_lượng nguyên_tử hoặc số_hiệu nguyên_tử lớn .'

License

Apache 2.0 License.
Copyright © 2022 Nguyễn Tiến Đạt. All rights reserved.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vietokenizer-1.0.1.tar.gz (8.0 kB view details)

Uploaded Source

Built Distribution

vietokenizer-1.0.1-py3-none-any.whl (9.0 kB view details)

Uploaded Python 3

File details

Details for the file vietokenizer-1.0.1.tar.gz.

File metadata

  • Download URL: vietokenizer-1.0.1.tar.gz
  • Upload date:
  • Size: 8.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.7

File hashes

Hashes for vietokenizer-1.0.1.tar.gz
Algorithm Hash digest
SHA256 8595a4833f095f01d3898d03a4f5669ac39cf528509a9a8f56f01b244714a184
MD5 c9126aee39f529375fe51bbf102d9c3d
BLAKE2b-256 ae7c7764df6f8c67cea26d61bc8d8206ac7972fad5b6871947dae6fb70e26802

See more details on using hashes here.

File details

Details for the file vietokenizer-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for vietokenizer-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b71c8839c00ecc5e675f3a67dd7b6297ec2f62376eb381dc22902eb11d6231b1
MD5 e973c6c3df8b4251daa9e5fe339eb94f
BLAKE2b-256 5e5dea5685206fc43edec6f748cd0b47f445d21042631248d70f4b7a379f1266

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page