Vietnamese Tokenizer package based on deep learning method
Project description
VieTokenizer
This model architecture that we use is a simple bi-lstm network trained by unsupervised learning on a large pre-segmented dataset. The model will predict if serial word is 1 and non-serial is 0, for example, "Tôi tên là Nguyễn Tiến Đạt" will be equivalent to a sequence of numbers with both zero and one being [0, 0, 0, 0, 1, 1].
Installation 🎉
- This repository is tested on python 3.7+ and Tensorflow 2.8+
- VieTokenizer can be installed using pip as follows:
pip install vietokenizer 🍰
- VieTokenizer can also be installed from source with the following commands:
git clone https://github.com/Nguyendat-bit/VieTokenizer
cd VieTokenizer
pip install -e .
Usage 🔥
>>> import vietokenizer
>>> tokenizer= vietokenizer.vntokenizer()
>>> tokenizer('Tôi tên là Nguyễn Tiến Đạt, hiện là sinh viên Đại học CN GTVT tại Hà Nội.')
'Tôi tên là Nguyễn_Tiến_Đạt , hiện là sinh_viên Đại_học CN GTVT tại Hà_Nội .'
>>> tokenizer('Kim loại nặng thường được định nghĩa là kim loại có khối lượng riêng, khối lượng nguyên tử hoặc số hiệu nguyên tử lớn.')
'Kim_loại nặng thường được định_nghĩa là kim_loại có khối_lượng riêng , khối_lượng nguyên_tử hoặc số_hiệu nguyên_tử lớn .'
License
Apache 2.0 License.
Copyright © 2022 Nguyễn Tiến Đạt. All rights reserved.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
vietokenizer-1.0.3.tar.gz
(8.0 kB
view details)
Built Distribution
File details
Details for the file vietokenizer-1.0.3.tar.gz
.
File metadata
- Download URL: vietokenizer-1.0.3.tar.gz
- Upload date:
- Size: 8.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9cd67607c4d9d903453bb8ecc47c5949b18dae06d5e0afb80613f6405fca3215 |
|
MD5 | ed67009a9e97c868eb33f28343891361 |
|
BLAKE2b-256 | 4932f72483004b65fae8efa6e2a9b248c235383ca5b0fee43d705ea7412ba8ae |
File details
Details for the file vietokenizer-1.0.3-py3-none-any.whl
.
File metadata
- Download URL: vietokenizer-1.0.3-py3-none-any.whl
- Upload date:
- Size: 9.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a9d79d54c6a8b2993fc3660c05679e591726b42310309a32e230f676f565c3f5 |
|
MD5 | 5439ecde9851e8758afef9f42ebbde25 |
|
BLAKE2b-256 | b76ccde13f142dff5e822957aaa6236d54d381009690374ec0d1c79b008628d2 |