Vietnamese Tokenizer package based on deep learning method
Project description
VieTokenizer
This model architecture that we use is a simple bi-lstm network trained by unsupervised learning on a large pre-segmented dataset. The model will predict if serial word is 1 and non-serial is 0, for example, "Tôi tên là Nguyễn Tiến Đạt" will be equivalent to a sequence of numbers with both zero and one being [0, 0, 0, 0, 1, 1].
Installation 🎉
- This repository is tested on python 3.7+ and Tensorflow 2.8+
- VieTokenizer can be installed using pip as follows:
pip install vietokenizer 🍰
- VieTokenizer can also be installed from source with the following commands:
git clone https://github.com/Nguyendat-bit/VieTokenizer
cd VieTokenizer
pip install -e .
Usage 🔥
>>> import vietokenizer
>>> tokenizer= vietokenizer.vntokenizer()
>>> tokenizer('Tôi tên là Nguyễn Tiến Đạt, hiện là sinh viên Đại học CN GTVT tại Hà Nội.')
'Tôi tên là Nguyễn_Tiến_Đạt , hiện là sinh_viên Đại_học CN GTVT tại Hà_Nội .'
>>> tokenizer('Kim loại nặng thường được định nghĩa là kim loại có khối lượng riêng, khối lượng nguyên tử hoặc số hiệu nguyên tử lớn.')
'Kim_loại nặng thường được định_nghĩa là kim_loại có khối_lượng riêng , khối_lượng nguyên_tử hoặc số_hiệu nguyên_tử lớn .'
License
Apache 2.0 License.
Copyright © 2022 Nguyễn Tiến Đạt. All rights reserved.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vietokenizer-1.0.3.tar.gz.
File metadata
- Download URL: vietokenizer-1.0.3.tar.gz
- Upload date:
- Size: 8.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9cd67607c4d9d903453bb8ecc47c5949b18dae06d5e0afb80613f6405fca3215
|
|
| MD5 |
ed67009a9e97c868eb33f28343891361
|
|
| BLAKE2b-256 |
4932f72483004b65fae8efa6e2a9b248c235383ca5b0fee43d705ea7412ba8ae
|
File details
Details for the file vietokenizer-1.0.3-py3-none-any.whl.
File metadata
- Download URL: vietokenizer-1.0.3-py3-none-any.whl
- Upload date:
- Size: 9.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a9d79d54c6a8b2993fc3660c05679e591726b42310309a32e230f676f565c3f5
|
|
| MD5 |
5439ecde9851e8758afef9f42ebbde25
|
|
| BLAKE2b-256 |
b76ccde13f142dff5e822957aaa6236d54d381009690374ec0d1c79b008628d2
|