Vietnamese tokenization, preprocess and models NLP
Project description
Genz Tokenize
Installation:
pip install genz-tokenize
Using for tokenize
>>> from genz_tokenize import Tokenize
# using vocab from lib
>>> tokenize = Tokenize()
>>> print(tokenize('sinh_viên công_nghệ', 'hello', max_len = 10, padding = True, truncation = True))
# {'input_ids': [1, 770, 1444, 2, 2, 30469, 2, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 0, 0, 0], 'sequence_id': [None, 0, 0, None, None, 1, None]}
>>> print(tokenize.decode([1, 770, 2]))
# <s> sinh_viên </s>
# from your vocab
>>> tokenize = Tokenize.fromFile('vocab.txt','bpe.codes')
Using bert tokenize inheritance from PreTrainedTokenizer Transformers
>>> from genz_tokenize import TokenizeForBert
# Using vocab from lib
>>> tokenize = TokenizeForBert()
>>> print(tokenize(['sinh_viên công_nghệ', 'hello'], max_length=5, padding='max_length',truncation=True))
# {'input_ids': [[1, 770, 1444, 2, 0], [1, 30469, 2, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 0], [1, 1, 1, 0, 0]]}
# Using your vocab
>>> tokenize = TokenizeForBert.fromFile('vocab.txt','bpe.codes')
Embedding matrix from fasttext
>>> from genz_tokenize import get_embedding_matrix
>>> embedding_matrix = get_embedding_matrix()
Preprocessing data
>>> from genz_tokenize.preprocess import remove_punctuations, convert_unicode, remove_emoji, vncore_tokenize
Model
1. Seq2Seq with Bahdanau Attention
2. Transformer classification
3. Transformer
4. BERT
Trainer
>>> from genz_tokenize.base_model.utils import Config
>>> from genz_tokenize.base_model.models import Seq2Seq, Transformer, TransformerClassification
>>> from genz_tokenize.base_model.training import TrainArgument, Trainer
# create config hyper parameter
>>> config = Config()
>>> config.vocab_size = 100
>>> config.target_vocab_size = 120
>>> config.units = 16
>>> config.maxlen = 20
# initial model
>>> model = Seq2Seq(config)
>>> x = tf.zeros(shape=(10, config.maxlen))
>>> y = tf.zeros(shape=(10, config.maxlen))
# create dataset
>>> BUFFER_SIZE = len(x)
>>> dataset_train = tf.data.Dataset.from_tensor_slices((x, y))
>>> dataset_train = dataset_train.shuffle(BUFFER_SIZE)
>>> dataset_train = dataset_train.batch(2)
>>> dataset_train = dataset_train.prefetch(tf.data.experimental.AUTOTUNE)
>>> args = TrainArgument(batch_size=2, epochs=2)
>>> trainer = Trainer(model=model, args=args, data_train=dataset_train)
>>> trainer.train()
>>> from genz_tokenize.models.bert import DataCollection
>>> from genz_tokenize.models.bert.training import TrainArg, Trainner
>>> from genz_tokenize.models.bert.roberta import RoBertaClassification, RobertaConfig
>>> import tensorflow as tf
>>> x = tf.zeros(shape=(10, 10), dtype=tf.int32)
>>> mask = tf.zeros(shape=(10, 10), dtype=tf.int32)
>>> y = tf.zeros(shape=(10, 2), dtype=tf.int32)
>>> dataset = DataCollection(
input_ids=x,
attention_mask=mask,
token_type_ids=None,
dec_input_ids=None,
dec_attention_mask=None,
dec_token_type_ids=None,
y=y
)
>>> tf_dataset = dataset.to_tf_dataset(batch_size=2)
>>> config = RobertaConfig()
>>> config.num_class = 2
>>> model = RoBertaQAEncoderDecoder(config)
>>> arg = TrainArg(epochs=2, batch_size=2, learning_rate=1e-2)
>>> trainer = Trainner(model, arg, tf_dataset)
>>> trainer.train()
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
genz-tokenize-1.2.1.tar.gz
(62.5 MB
view details)
Built Distribution
File details
Details for the file genz-tokenize-1.2.1.tar.gz
.
File metadata
- Download URL: genz-tokenize-1.2.1.tar.gz
- Upload date:
- Size: 62.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5b1250260e7c804e97519d47a2f841cc9b8c0fe4454bafbc23cb00a05746c847 |
|
MD5 | 99563d1de64708d77b0ef4e2d6e80111 |
|
BLAKE2b-256 | b94d07df4c3f217b54883570234c15726861f57753052114eac6b84b222c9dca |
File details
Details for the file genz_tokenize-1.2.1-py3-none-any.whl
.
File metadata
- Download URL: genz_tokenize-1.2.1-py3-none-any.whl
- Upload date:
- Size: 63.9 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3b1a814545946a8bbd5c300503daa19046aa4eebbb36d11d00415eff9d7e1889 |
|
MD5 | 0dd80d634e6d677c5451b3444ff96112 |
|
BLAKE2b-256 | 50bf7de4c498183be0486b1e7e1b146e39b48acc85b9748d0de442dd5bee77bb |