Skip to main content

Vietnamese tokenization, preprocess and models NLP

Project description

Genz Tokenize

PyPI

Using for tokenize

    >>> from genz_tokenize import Tokenize
    # using vocab from lib
    >>> tokenize = Tokenize()
    >>> print(tokenize('sinh_viên công_nghệ', 'hello', max_len = 10, padding = True, truncation = True))
    # {'input_ids': [1, 770, 1444, 2, 2, 30469, 2, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 0, 0, 0], 'sequence_id': [None, 0, 0, None, None, 1, None]}

    >>> print(tokenize.decode([1, 770, 2]))
    # <s> sinh_viên </s>

    # from your vocab
    >>> tokenize = Tokenize.fromFile('vocab.txt','bpe.codes')

Using bert tokenize inheritance from PreTrainedTokenizer Transformers

    >>> from genz_tokenize import TokenizeForBert
    # Using vocab from lib
    >>> tokenize = TokenizeForBert()
    >>> print(tokenize(['sinh_viên công_nghệ', 'hello'], max_length=5, padding='max_length',truncation=True))
    # {'input_ids': [[1, 770, 1444, 2, 0], [1, 30469, 2, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 0], [1, 1, 1, 0, 0]]}

    # Using your vocab
    >>> tokenize = TokenizeForBert.fromFile('vocab.txt','bpe.codes')

Embedding matrix from fasttext

    >>> from genz_tokenize import get_embedding_matrix
    >>> embedding_matrix = get_embedding_matrix()

Preprocessing data

    >>> from genz_tokenize.preprocess import remove_punctuations,  convert_unicode, remove_emoji, vncore_tokenize

Model

1. Seq2Seq with Bahdanau Attention
2. Transformer classification
3. Transformer
4. BERT

Trainer

    >>> from genz_tokenize.base_model.utils import Config
    >>> from genz_tokenize.base_model.models import Seq2Seq, Transformer, TransformerClassification
    >>> from genz_tokenize.base_model.training import TrainArgument, Trainer
    # create config hyper parameter
    >>> config = Config()
    >>> config.vocab_size = 100
    >>> config.target_vocab_size = 120
    >>> config.units = 16
    >>> config.maxlen = 20
    # initial model
    >>> model = Seq2Seq(config)
    >>> x = tf.zeros(shape=(10, config.maxlen))
    >>> y = tf.zeros(shape=(10, config.maxlen))
    # create dataset
    >>> BUFFER_SIZE = len(x)
    >>> dataset_train = tf.data.Dataset.from_tensor_slices((x, y))
    >>> dataset_train = dataset_train.shuffle(BUFFER_SIZE)
    >>> dataset_train = dataset_train.batch(2)
    >>> dataset_train = dataset_train.prefetch(tf.data.experimental.AUTOTUNE)

    >>> args = TrainArgument(batch_size=2, epochs=2)
    >>> trainer = Trainer(model=model, args=args, data_train=dataset_train)
    >>> trainer.train()
    >>> from genz_tokenize.models.bert import DataCollection
    >>> from genz_tokenize.models.bert.training import TrainArg, Trainner
    >>> from genz_tokenize.models.bert.roberta import RoBertaClassification, RobertaConfig
    >>> import tensorflow as tf

    >>> x = tf.zeros(shape=(10, 10), dtype=tf.int32)
    >>> mask = tf.zeros(shape=(10, 10), dtype=tf.int32)
    >>> y = tf.zeros(shape=(10, 2), dtype=tf.int32)

    >>> dataset = DataCollection(
                    input_ids=x,
                    attention_mask=mask,
                    token_type_ids=None,
                    dec_input_ids=None,
                    dec_attention_mask=None,
                    dec_token_type_ids=None,
                    y=y
                )
    >>> tf_dataset = dataset.to_tf_dataset(batch_size=2)

    >>> config = RobertaConfig()
    >>> config.num_class = 2
    >>> model = RoBertaQAEncoderDecoder(config)
    >>> arg = TrainArg(epochs=2, batch_size=2, learning_rate=1e-2)
    >>> trainer = Trainner(model, arg, tf_dataset)
    >>> trainer.train()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genz-tokenize-1.2.7a1.tar.gz (62.5 MB view details)

Uploaded Source

Built Distribution

genz_tokenize-1.2.7a1-py3-none-any.whl (63.9 MB view details)

Uploaded Python 3

File details

Details for the file genz-tokenize-1.2.7a1.tar.gz.

File metadata

  • Download URL: genz-tokenize-1.2.7a1.tar.gz
  • Upload date:
  • Size: 62.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.10

File hashes

Hashes for genz-tokenize-1.2.7a1.tar.gz
Algorithm Hash digest
SHA256 b2c32a0fcf6a9b6ab60f5a6a02bc693b01925d973104fa8b7227c9924047d260
MD5 160f9dd9b3bdfca36ac08f662f02b17b
BLAKE2b-256 a4798897e036b27dad9ed3a61776e5241935d9f2063ae9966b783868e97b8898

See more details on using hashes here.

File details

Details for the file genz_tokenize-1.2.7a1-py3-none-any.whl.

File metadata

  • Download URL: genz_tokenize-1.2.7a1-py3-none-any.whl
  • Upload date:
  • Size: 63.9 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.10

File hashes

Hashes for genz_tokenize-1.2.7a1-py3-none-any.whl
Algorithm Hash digest
SHA256 a77a8d00777c8ad299f8dff5bebdd52eb6932a9d2b9abb70b23fd351a12b2c52
MD5 1a33f4d210723596070b28ff40683b59
BLAKE2b-256 05c09e43797241c6953b4b905c90915ef914b4fe497cf90095ba3802bea953d1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page