Skip to main content

Vietnamese tokenization, preprocess and models NLP

Project description

Genz Tokenize

Installation:

pip install genz-tokenize

Using for tokenize

    >>> from genz_tokenize import Tokenize
    # using vocab from lib
    >>> tokenize = Tokenize()
    >>> print(tokenize('sinh_viên công_nghệ', 'hello', max_len = 10, padding = True, truncation = True))
    # {'input_ids': [1, 770, 1444, 2, 2, 30469, 2, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 0, 0, 0], 'sequence_id': [None, 0, 0, None, None, 1, None]}

    >>> print(tokenize.decode([1, 770, 2]))
    # <s> sinh_viên </s>

    # from your vocab
    >>> tokenize = Tokenize.fromFile('vocab.txt','bpe.codes')

Using bert tokenize inheritance from PreTrainedTokenizer Transformers

    >>> from genz_tokenize import TokenizeForBert
    # Using vocab from lib
    >>> tokenize = TokenizeForBert()
    >>> print(tokenize(['sinh_viên công_nghệ', 'hello'], max_length=5, padding='max_length',truncation=True))
    # {'input_ids': [[1, 770, 1444, 2, 0], [1, 30469, 2, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 0], [1, 1, 1, 0, 0]]}

    # Using your vocab
    >>> tokenize = TokenizeForBert.fromFile('vocab.txt','bpe.codes')

Embedding matrix from fasttext

    >>> from genz_tokenize import get_embedding_matrix
    >>> embedding_matrix = get_embedding_matrix()

Preprocessing data

    >>> from genz_tokenize.preprocess import remove_punctuations,  convert_unicode, remove_emoji, vncore_tokenize

Model

1. Seq2Seq with Bahdanau Attention
2. Transformer classification
3. Transformer
4. BERT

Trainer

    >>> from genz_tokenize.base_model.utils import Config
    >>> from genz_tokenize.base_model.models import Seq2Seq, Transformer, TransformerClassification
    >>> from genz_tokenize.base_model.training import TrainArgument, Trainer
    # create config hyper parameter
    >>> config = Config()
    >>> config.vocab_size = 100
    >>> config.target_vocab_size = 120
    >>> config.units = 16
    >>> config.maxlen = 20
    # initial model
    >>> model = Seq2Seq(config)
    >>> x = tf.zeros(shape=(10, config.maxlen))
    >>> y = tf.zeros(shape=(10, config.maxlen))
    # create dataset
    >>> BUFFER_SIZE = len(x)
    >>> dataset_train = tf.data.Dataset.from_tensor_slices((x, y))
    >>> dataset_train = dataset_train.shuffle(BUFFER_SIZE)
    >>> dataset_train = dataset_train.batch(2)
    >>> dataset_train = dataset_train.prefetch(tf.data.experimental.AUTOTUNE)

    >>> args = TrainArgument(batch_size=2, epochs=2)
    >>> trainer = Trainer(model=model, args=args, data_train=dataset_train)
    >>> trainer.train()
    >>> from genz_tokenize.models.bert import DataCollection
    >>> from genz_tokenize.models.bert.training import TrainArg, Trainner
    >>> from genz_tokenize.models.bert.roberta import RoBertaClassification, RobertaConfig
    >>> import tensorflow as tf

    >>> x = tf.zeros(shape=(10, 10), dtype=tf.int32)
    >>> mask = tf.zeros(shape=(10, 10), dtype=tf.int32)
    >>> y = tf.zeros(shape=(10, 2), dtype=tf.int32)

    >>> dataset = DataCollection(
                    input_ids=x,
                    attention_mask=mask,
                    token_type_ids=None,
                    dec_input_ids=None,
                    dec_attention_mask=None,
                    dec_token_type_ids=None,
                    y=y
                )
    >>> tf_dataset = dataset.to_tf_dataset(batch_size=2)

    >>> config = RobertaConfig()
    >>> config.num_class = 2
    >>> model = RoBertaQAEncoderDecoder(config)
    >>> arg = TrainArg(epochs=2, batch_size=2, learning_rate=1e-2)
    >>> trainer = Trainner(model, arg, tf_dataset)
    >>> trainer.train()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genz-tokenize-1.2.6.tar.gz (62.5 MB view details)

Uploaded Source

Built Distribution

genz_tokenize-1.2.6-py3-none-any.whl (63.9 MB view details)

Uploaded Python 3

File details

Details for the file genz-tokenize-1.2.6.tar.gz.

File metadata

  • Download URL: genz-tokenize-1.2.6.tar.gz
  • Upload date:
  • Size: 62.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.10

File hashes

Hashes for genz-tokenize-1.2.6.tar.gz
Algorithm Hash digest
SHA256 f89f7bdfb301ea640dff324e9887b956f5846e4a8eb837a3555ea73d9dd6be3d
MD5 a7b5d8ff8139b0ac5bfe2abb5c3db752
BLAKE2b-256 9138672e52ac4e31616f721621b2d71629bdd2769f68422e2cef696834a64e57

See more details on using hashes here.

File details

Details for the file genz_tokenize-1.2.6-py3-none-any.whl.

File metadata

  • Download URL: genz_tokenize-1.2.6-py3-none-any.whl
  • Upload date:
  • Size: 63.9 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.10

File hashes

Hashes for genz_tokenize-1.2.6-py3-none-any.whl
Algorithm Hash digest
SHA256 a9a34cbc57993e4ab7018cd82cd1a52f8f763507be87702c11203e528e20798e
MD5 41ca6c160caa79c9e95c66410ac8ef10
BLAKE2b-256 00dda3fdf297ca98aa7d2cb929d59f8b20b459cd53eb93d9725b611f33baac67

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page