Vietnamese Tokenizer package based on deep learning method
Project description
VieTokenizer
This is a new package that supports Vietnamese word segmentation based on deep learning methods. The model architecture we use is a simple bi-lstm network trained on a pre-labeled dataset. For example, the training set: "Tôi tên là Nguyễn Tiến Đạt" and the test set: "Tôi tên là Nguyễn_Tiến_Đạt". The model will predict if serial word is 1 and non-serial is 0, for example, "Tôi tên là Nguyễn Tiến Đạt" will be equivalent to a sequence of numbers with both zero and one being [0, 0, 0, 0, 1, 1].
Installation 🎉
- This repository is tested on python 3.7+ and Tensorflow 2.8+
- VieTokenizer can be installed using pip as follows:
pip install vietokenizer 🍰
- VieTokenizer can also be installed from source with the following commands:
git clone https://github.com/Nguyendat-bit/VieTokenizer
cd VieTokenizer
pip install -e .
Usage 🔥
>>> import vietokenizer
>>> tokenizer= vietokenizer.vntokenizer()
>>> tokenizer('Tôi tên là Nguyễn Tiến Đạt, hiện là sinh viên Đại học CN GTVT tại Hà Nội.')
'Tôi tên là Nguyễn_Tiến_Đạt , hiện là sinh_viên Đại_học CN GTVT tại Hà_Nội .'
>>> tokenizer('Kim loại nặng thường được định nghĩa là kim loại có khối lượng riêng, khối lượng nguyên tử hoặc số hiệu nguyên tử lớn.')
'Kim loại nặng_thường được định_nghĩa là kim_loại có khối_lượng riêng , khối_lượng nguyên_tử hoặc số_hiệu nguyên_tử lớn .'
License
Apache 2.0 License.
Copyright © 2022 Nguyễn Tiến Đạt. All rights reserved.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file vietokenizer-1.0.1.tar.gz
.
File metadata
- Download URL: vietokenizer-1.0.1.tar.gz
- Upload date:
- Size: 8.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8595a4833f095f01d3898d03a4f5669ac39cf528509a9a8f56f01b244714a184 |
|
MD5 | c9126aee39f529375fe51bbf102d9c3d |
|
BLAKE2b-256 | ae7c7764df6f8c67cea26d61bc8d8206ac7972fad5b6871947dae6fb70e26802 |
File details
Details for the file vietokenizer-1.0.1-py3-none-any.whl
.
File metadata
- Download URL: vietokenizer-1.0.1-py3-none-any.whl
- Upload date:
- Size: 9.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b71c8839c00ecc5e675f3a67dd7b6297ec2f62376eb381dc22902eb11d6231b1 |
|
MD5 | e973c6c3df8b4251daa9e5fe339eb94f |
|
BLAKE2b-256 | 5e5dea5685206fc43edec6f748cd0b47f445d21042631248d70f4b7a379f1266 |