Vietnamese tokenization, preprocess and models NLP
Project description
Genz Tokenize
Using for tokenize
from genz_tokenize import Tokenize
# using vocab from lib
tokenize = Tokenize()
print(tokenize('sinh_viên công_nghệ', 'hello', max_len = 10, padding = True, truncation = True))
# {'input_ids': [1, 770, 1444, 2, 2, 30469, 2, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 0, 0, 0], 'sequence_id': [None, 0, 0, None, None, 1, None]}
print(tokenize.decode([1, 770, 2]))
# <s> sinh_viên </s>
# from your vocab
tokenize = Tokenize.fromFile('vocab.txt','bpe.codes')
Preprocessing data
from genz_tokenize.preprocess import remove_punctuations, convert_unicode, remove_emoji, vncore_tokenize
Model
1. Seq2Seq with Bahdanau Attention
2. Transformer classification
3. Transformer
4. BERT
Trainer
from genz_tokenize.base_model.utils import Config
from genz_tokenize.base_model.models import Seq2Seq, Transformer, TransformerClassification
from genz_tokenize.base_model.training import TrainArgument, Trainer
# create config hyper parameter
config = Config()
config.vocab_size = 100
config.target_vocab_size = 120
config.units = 16
config.maxlen = 20
# initial model
model = Seq2Seq(config)
x = tf.zeros(shape=(10, config.maxlen))
y = tf.zeros(shape=(10, config.maxlen))
# create dataset
BUFFER_SIZE = len(x)
dataset_train = tf.data.Dataset.from_tensor_slices((x, y))
dataset_train = dataset_train.shuffle(BUFFER_SIZE)
dataset_train = dataset_train.batch(2)
dataset_train = dataset_train.prefetch(tf.data.experimental.AUTOTUNE)
args = TrainArgument(batch_size=2, epochs=2)
trainer = Trainer(model=model, args=args, data_train=dataset_train)
trainer.train()
from genz_tokenize.models.bert import DataCollection
from genz_tokenize.models.bert.training import TrainArg, Trainner
from genz_tokenize.models.bert.roberta import RoBertaClassification, RobertaConfig
import tensorflow as tf
x = tf.zeros(shape=(10, 10), dtype=tf.int32)
mask = tf.zeros(shape=(10, 10), dtype=tf.int32)
y = tf.zeros(shape=(10, 2), dtype=tf.int32)
dataset = DataCollection(
input_ids=x,
attention_mask=mask,
token_type_ids=None,
dec_input_ids=None,
dec_attention_mask=None,
dec_token_type_ids=None,
y=y
)
tf_dataset = dataset.to_tf_dataset(batch_size=2)
config = RobertaConfig()
config.num_class = 2
model = RoBertaQAEncoderDecoder(config)
arg = TrainArg(epochs=2, batch_size=2, learning_rate=1e-2)
trainer = Trainner(model, arg, tf_dataset)
trainer.train()
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
genz-tokenize-1.2.7.tar.gz
(544.8 kB
view details)
Built Distribution
genz_tokenize-1.2.7-py3-none-any.whl
(552.5 kB
view details)
File details
Details for the file genz-tokenize-1.2.7.tar.gz
.
File metadata
- Download URL: genz-tokenize-1.2.7.tar.gz
- Upload date:
- Size: 544.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 26c79eda6a71b382b7ceae23f226a0c5a7bdde4f1aed6e913e8da534a42dbdd0 |
|
MD5 | 293e211019372e2d0c1dc92bdd19ab74 |
|
BLAKE2b-256 | 9e761ee0f4c7eafcda8327c448c92f661de6131f97104c67dfe85c6616a6668b |
File details
Details for the file genz_tokenize-1.2.7-py3-none-any.whl
.
File metadata
- Download URL: genz_tokenize-1.2.7-py3-none-any.whl
- Upload date:
- Size: 552.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f1e97a1b87ae05e2957205f3cab45d1737cdcbc7bf9ad2f98228c06cab3c5135 |
|
MD5 | 6908bc8ddbd4c44f1378b06c3e177025 |
|
BLAKE2b-256 | 7eba1d4fb3044e49fb385eba5a1263e960d2231f67afd50133db4d8e7f0d78d8 |